DCU Home | Our Courses | Loop | Registry | Library | Search DCU

Registry

Module Specifications

Archived Version 2016 - 2017

Module Title
Module Code
School

Online Module Resources

NFQ level 9 Credit Rating 7.5
Pre-requisite None
Co-requisite None
Compatibles None
Incompatibles None
Description

This module will provide students with fundamental and advanced skills required for data analytics, including: data management, processing, summarization, and predictive analytics. It is focused on providing students with a strong theoretical foundation, along with the ability to make practical use of the advanced techniques in the field. The Python programming language will be used for demonstrating the use of various techniques throughout the module, giving students practical tools for solving relatively sophisticated and broadly-defined real world problems in a well-established and widely-used programming environment.

Learning Outcomes

1. describe several widely used methods for data storage, including specialized file formats, SQL and NoSQL databases, and key-value stores
2. explore datasets using summary statistics, statistical plots, and advanced data visualization methods (e.g. t-SNE)
3. describe supervised machine learning theory, including problem types, best practices for data preparation, model selection, overfitting and underfitting, and bias-variance tradeoff
4. apply fundamental and advanced classification and regression algorithms including: linear and nonlinear regression, discriminant analysis, decision trees, logistic regression, support vector machines, and ensembles
5. perform various types of generic unsupervised data analytics including cluster analysis, density estimation, and dimensionality reduction
6. describe the principles of modern representation learning and deep learning techniques and evaluate the merits of several state-of-the-art models
7. demonstrate a critical appreciation of available software packages for data analysis
8. demonstrate the ability to implement a predictive analytics pipeline



Workload Full-time hours per semester
Type Hours Description
Lecture36Classroom Lectures
Independent Study24Regular Homeworks
Independent Study36Assignment Work
Independent Study92Self-directed study of materials and study for the examination.
Total Workload: 188

All module information is indicative and subject to change. For further information,students are advised to refer to the University's Marks and Standards and Programme Specific Regulations at: http://www.dcu.ie/registry/examinations/index.shtml

Indicative Content and Learning Activities

Introduction to Python Programming for Data Analytics
Introduce students unfamiliar with Python to the syntax and structure of the language, with a particular focus on working with data in various forms. This will involve introducing some of the standard numeric and scientific computing libraries available in Python and demonstrating how these can be used to perform several standard tasks including reading and writing data in standard formats, data types, slicing and shaping data, and performing standard manipulation tasks.

Data Summarization and Visualization
Discuss various types of univariate (mean, median, variance, stddev, quantiles, mode, etc.) and bivariate (covariance, Pearson and Spearman correlation) statistics and how they can be used to summarize data. Illustrate several ways of visualizing single and multi-dimensional data, including basic plots (scatter, line, bar, contour, image), statistical plots (e.g. histogram, density plot, box plots, violin plots, and error bars), and advanced visualization techniques (e.g. t-SNE). Tufte’s principles for the visual display of quantitative information will be used to demonstrate best practices.

Unsupervised Machine Learning
Discussion of the goals of unsupervised learning with examples including the types of objectives used in practice. This will include an in-depth discussion of several standard methods for clustering (k-means, hierarchical clustering, linkage types), and an overview of latent variable models and Principal Component Analysis (PCA). Students will also be expected to learn on how to use these models in practice with standard and advanced software tools and applications.

Supervised Machine Learning Principles
Overview and objectives of supervised learning. Introduction to standard notation and conventions, problem types (regression, classification, structured prediction), training and tests sets, black box learning principles, training error, test error, generalization error, and out-of-sample error. Discussion of bias-variance tradeoff, overfitting and underfitting, the no free lunch theorem, model selection, cross-validation, data hygiene, and data snooping.

Supervised Machine Learning Algorithms
Discussion of several important classes of machine learning algorithms including linear regression, decision trees, ensemble methods, logistic regression, support vector machines, and a range of neural network types. Algorithm for optimizing loss functions (gradient descent and stochastic gradient descent). Types of loss functions (convex and non-convex). Kernel methods.

Representation Learning and Deep Learning
Principles of representation learning. Introduction to multi-layer perceptrons, stacked autoencoders, convolutional neural networks, and recurrent neural networks. Practical optimization methods, GPU-based optimization, and software packages. Illustration of several real-world applications (natural language processing, image classification, speech recognition, information retrieval, recommender systems). Comparative discussion of several industry standard technologies (Tensorflow, Caffe, Torch).

Assessment Breakdown
Continuous Assessment% Examination Weight%
Course Work Breakdown
TypeDescription% of totalAssessment Date
Reassessment Requirement
Resit arrangements are explained by the following categories;
1 = A resit is available for all components of the module
2 = No resit is available for 100% continuous assessment module
3 = No resit is available for the continuous assessment component
Unavailable
Indicative Reading List

  • Trevor Hastie, Robert Tibshirani, Jerome Friedman: 2009, The elements of statistical learning, Springer, New York, N.Y., 9780387848570
  • Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: 0, Learning From Data, AMLBook, 1600490069
  • Edward R. Tufte: 2001, The visual display of quantitative information, Graphics Press, Cheshire, Conn., 0961392142
  • Mark Pilgrim: 0, Dive Into Python, Apress, 1590593561
  • Wes McKinney: 0, Python for Data Analysis, O'Reilly Media, 1449319793
  • 0: Machine Learning in Python: Essential Techniques for Predictive Analysis, Chichester; John Wiley & Sons,
Other Resources

23034, Website, 0, Scikit-learn: Machine Learning in Python, http://scikit-learn.org/stable/, 23035, Website, 0, Pandas: Python Data Analysis Library, http://pandas.pydata.org/, 23036, Website, 0, Seaborn: Statistical Data Visualization, http://stanford.edu/~mwaskom/software/seaborn/, 23037, Website, 0, TensorFlow: Open Source Software Library for Machine Intelligence, https://www.tensorflow.org/,
Programme or List of Programmes
Archives: