Registry
Module Specifications
Archived Version 2020 - 2021
| |||||||||||||||||||||||||||||||||||||
Description This module will provide students with fundamental and advanced skills required for data analytics, including: data management, processing, summarization, and predictive analytics. It is focused on providing students with a strong theoretical foundation, along with the ability to make practical use of the advanced techniques in the field. The Python programming language will be used for demonstrating the use of various techniques throughout the module, giving students practical tools for solving relatively sophisticated and broadly-defined real world problems in a well-established and widely-used programming environment. | |||||||||||||||||||||||||||||||||||||
Learning Outcomes 1. describe several widely used methods for data storage, including specialized file formats, SQL and NoSQL databases, and key-value stores 2. explore datasets using summary statistics, statistical plots, and advanced data visualization methods (e.g. t-SNE) 3. describe supervised machine learning theory, including problem types, best practices for data preparation, model selection, overfitting and underfitting, and bias-variance tradeoff 4. apply fundamental and advanced classification and regression algorithms including: linear and nonlinear regression, discriminant analysis, decision trees, logistic regression, support vector machines, and ensembles 5. perform various types of generic unsupervised data analytics including cluster analysis, density estimation, and dimensionality reduction 6. describe the principles of modern representation learning and deep learning techniques and evaluate the merits of several state-of-the-art models 7. demonstrate a critical appreciation of available software packages for data analysis 8. demonstrate the ability to implement a predictive analytics pipeline | |||||||||||||||||||||||||||||||||||||
All module information is indicative and subject to change. For further information,students are advised to refer to the University's Marks and Standards and Programme Specific Regulations at: http://www.dcu.ie/registry/examinations/index.shtml |
|||||||||||||||||||||||||||||||||||||
Indicative Content and
Learning Activities Introduction to Python Programming for Data AnalyticsIntroduce students unfamiliar with Python to the syntax and structure of the language, with a particular focus on working with data in various forms. This will involve introducing some of the standard numeric and scientific computing libraries available in Python and demonstrating how these can be used to perform several standard tasks including reading and writing data in standard formats, data types, slicing and shaping data, and performing standard manipulation tasks.Data Summarization and VisualizationDiscuss various types of univariate (mean, median, variance, stddev, quantiles, mode, etc.) and bivariate (covariance, Pearson and Spearman correlation) statistics and how they can be used to summarize data. Illustrate several ways of visualizing single and multi-dimensional data, including basic plots (scatter, line, bar, contour, image), statistical plots (e.g. histogram, density plot, box plots, violin plots, and error bars), and advanced visualization techniques (e.g. t-SNE). Tufte’s principles for the visual display of quantitative information will be used to demonstrate best practices.Unsupervised Machine LearningDiscussion of the goals of unsupervised learning with examples including the types of objectives used in practice. This will include an in-depth discussion of several standard methods for clustering (k-means, hierarchical clustering, linkage types), and an overview of latent variable models and Principal Component Analysis (PCA). Students will also be expected to learn on how to use these models in practice with standard and advanced software tools and applications.Supervised Machine Learning PrinciplesOverview and objectives of supervised learning. Introduction to standard notation and conventions, problem types (regression, classification, structured prediction), training and tests sets, black box learning principles, training error, test error, generalization error, and out-of-sample error. Discussion of bias-variance tradeoff, overfitting and underfitting, the no free lunch theorem, model selection, cross-validation, data hygiene, and data snooping.Supervised Machine Learning AlgorithmsDiscussion of several important classes of machine learning algorithms including linear regression, decision trees, ensemble methods, logistic regression, support vector machines, and a range of neural network types. Algorithm for optimizing loss functions (gradient descent and stochastic gradient descent). Types of loss functions (convex and non-convex). Kernel methods.Representation Learning and Deep LearningPrinciples of representation learning. Introduction to multi-layer perceptrons, stacked autoencoders, convolutional neural networks, and recurrent neural networks. Practical optimization methods, GPU-based optimization, and software packages. Illustration of several real-world applications (natural language processing, image classification, speech recognition, information retrieval, recommender systems). Comparative discussion of several industry standard technologies (Tensorflow, Caffe, Torch). | |||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||
Indicative Reading List
| |||||||||||||||||||||||||||||||||||||
Other Resources 41333, Website, 0, Scikit-learn: Machine Learning in Python, http://scikit-learn.org/stable/, 41334, Website, 0, Pandas: Python Data Analysis Library, http://pandas.pydata.org/, 41335, Website, 0, Seaborn: Statistical Data Visualization, http://stanford.edu/~mwaskom/software/seaborn/, 41336, Website, 0, TensorFlow: Open Source Software Library for Machine Intelligence, https://www.tensorflow.org/, | |||||||||||||||||||||||||||||||||||||
Programme or List of Programmes | |||||||||||||||||||||||||||||||||||||
Archives: |
|