The remarkable development of computing power and other technology now allows scientists and businesses to routinely collect datasets of immense size and complexity. Most classical statistical methods were designed for situations with many observations and a few, carefully chosen variables. However, we now often gather data where we have huge numbers of variables, in an attempt to capture as much information as we can about anything which might conceivably have an influence on the phenomenon of interest. This dramatic increase in the number variables makes modern datasets strikingly different, as well-established traditional methods perform either very poorly, or often do not work at all.

Developing methods that are able to extract meaningful information from these large and challenging datasets has recently been an area of intense research in statistics, machine learning and computer science. In this course, we will study some of the methods that have been developed to study such datasets.

Arrangements for **examples classes** are given here.

We will have an **extra lecture** on **Wednesday 28 November at 2pm** in **MR9**.

Initial meetings for **Part III essays** will be held on **Wednesday 28 November** in **MR21**.

Timings are 3.30pm for 'Statistical inference using machine learning methods' and 4pm for 'Recent developments in false discovery rate control'.

- The Elements of Statistical Learning (T. Hastie, R. Tibshirani and J. Friedman) has excellent background material for large parts of this course, presented in a less mathematical style.

- Statistics for High-Dimensional Data (P. Bühlmann and S. van de Geer) covers much of our course and in many places goes into much greater depth than we do.

- Statistical Learning with Sparsity (T. Hastie, R. Tibshirani and M. Wainwright) is excellent for the part of the course on the Lasso and its generalisations.

- Notes on the theory of RKHS (D. Sejdinovic, A. Gretton) gives an excellent detailed treatment of the theory of RKHS's.

- Advanced data analysis from an elementary point of view (C. Shalizi) - chapter 20 and the whole of part IV provides some nice background reading for the part of the course on graphical models and causal inference.

- Lecture notes on Causality (J. Peters) is highly recommended if you want to learn more about causal inference. Parts of our notes are based closely on this, though this goes into more depth and covers more topics. You can view a mini-course lecured by him here.

- Some preliminary material prepared for another course I teach may be helpful as a source of basic background material on linear algebra.

The code for the demonstrations is written in R. Rstudio is a useful editor for R. Here are some introductory worksheets on R: Sheet 1, (solutions); Sheet 2, (solutions). The code for the demonstrations is given below.

- Example Sheet 3 and solutions.

- © 2017 the Statistical Laboratory, University of Cambridge

Information provided by webmaster@statslab.cam.ac.uk - Privacy and Cookies