Registry
Module Specifications
Archived Version 2016  2017
 
Description This module will provide students with fundamental and advanced skills required for data analytics, including: data management, processing, summarization, and predictive analytics. It is focused on providing students with a strong theoretical foundation, along with the ability to make practical use of the advanced techniques in the field. The Python programming language will be used for demonstrating the use of various techniques throughout the module, giving students practical tools for solving relatively sophisticated and broadlydefined real world problems in a wellestablished and widelyused programming environment.  
Learning Outcomes 1. describe several widely used methods for data storage, including specialized file formats, SQL and NoSQL databases, and keyvalue stores 2. explore datasets using summary statistics, statistical plots, and advanced data visualization methods (e.g. tSNE) 3. describe supervised machine learning theory, including problem types, best practices for data preparation, model selection, overfitting and underfitting, and biasvariance tradeoff 4. apply fundamental and advanced classification and regression algorithms including: linear and nonlinear regression, discriminant analysis, decision trees, logistic regression, support vector machines, and ensembles 5. perform various types of generic unsupervised data analytics including cluster analysis, density estimation, and dimensionality reduction 6. describe the principles of modern representation learning and deep learning techniques and evaluate the merits of several stateoftheart models 7. demonstrate a critical appreciation of available software packages for data analysis 8. demonstrate the ability to implement a predictive analytics pipeline  
All module information is indicative and subject to change. For further information,students are advised to refer to the University's Marks and Standards and Programme Specific Regulations at: http://www.dcu.ie/registry/examinations/index.shtml 

Indicative Content and
Learning Activities Introduction to Python Programming for Data Analytics Introduce students unfamiliar with Python to the syntax and structure of the language, with a particular focus on working with data in various forms. This will involve introducing some of the standard numeric and scientific computing libraries available in Python and demonstrating how these can be used to perform several standard tasks including reading and writing data in standard formats, data types, slicing and shaping data, and performing standard manipulation tasks. Data Summarization and Visualization Discuss various types of univariate (mean, median, variance, stddev, quantiles, mode, etc.) and bivariate (covariance, Pearson and Spearman correlation) statistics and how they can be used to summarize data. Illustrate several ways of visualizing single and multidimensional data, including basic plots (scatter, line, bar, contour, image), statistical plots (e.g. histogram, density plot, box plots, violin plots, and error bars), and advanced visualization techniques (e.g. tSNE). Tufte’s principles for the visual display of quantitative information will be used to demonstrate best practices. Unsupervised Machine Learning Discussion of the goals of unsupervised learning with examples including the types of objectives used in practice. This will include an indepth discussion of several standard methods for clustering (kmeans, hierarchical clustering, linkage types), and an overview of latent variable models and Principal Component Analysis (PCA). Students will also be expected to learn on how to use these models in practice with standard and advanced software tools and applications. Supervised Machine Learning Principles Overview and objectives of supervised learning. Introduction to standard notation and conventions, problem types (regression, classification, structured prediction), training and tests sets, black box learning principles, training error, test error, generalization error, and outofsample error. Discussion of biasvariance tradeoff, overfitting and underfitting, the no free lunch theorem, model selection, crossvalidation, data hygiene, and data snooping. Supervised Machine Learning Algorithms Discussion of several important classes of machine learning algorithms including linear regression, decision trees, ensemble methods, logistic regression, support vector machines, and a range of neural network types. Algorithm for optimizing loss functions (gradient descent and stochastic gradient descent). Types of loss functions (convex and nonconvex). Kernel methods. Representation Learning and Deep Learning Principles of representation learning. Introduction to multilayer perceptrons, stacked autoencoders, convolutional neural networks, and recurrent neural networks. Practical optimization methods, GPUbased optimization, and software packages. Illustration of several realworld applications (natural language processing, image classification, speech recognition, information retrieval, recommender systems). Comparative discussion of several industry standard technologies (Tensorflow, Caffe, Torch).  
 
Indicative Reading List
 
Other Resources 23034, Website, 0, Scikitlearn: Machine Learning in Python, http://scikitlearn.org/stable/, 23035, Website, 0, Pandas: Python Data Analysis Library, http://pandas.pydata.org/, 23036, Website, 0, Seaborn: Statistical Data Visualization, http://stanford.edu/~mwaskom/software/seaborn/, 23037, Website, 0, TensorFlow: Open Source Software Library for Machine Intelligence, https://www.tensorflow.org/,  
Programme or List of Programmes  
Archives: 
