The remarkable development of computing power and other technology now allows scientists and businesses to routinely collect datasets of immense size and complexity. Most classical statistical methods were designed for situations with many observations and a few, carefully chosen variables. However, we now often gather data where we have huge numbers of variables, in an attempt to capture as much information as we can about anything which might conceivably have an influence on the phenomenon of interest. This dramatic increase in the number variables makes modern datasets strikingly different, as well-established traditional methods perform either very poorly, or often do not work at all.

Developing methods that are able to extract meaningful information from these large and challenging datasets has recently been an area of intense research in statistics, machine learning and computer science. In this course, we will study some of the methods that have been developed to study such datasets.

- The Elements of Statistical Learning (T. Hastie, R. Tibshirani and J. Friedman) has excellent background material for large parts of this course, presented in a less mathematical style.

- Statistics for High-Dimensional Data (P. Bühlmann and S. van de Geer) covers much of our course and in many places goes into much greater depth than we do.

- Statistical Learning with Sparsity (T. Hastie, R. Tibshirani and M. Wainwright) is excellent for the part of the course on the Lasso and its generalisations.

- Notes on the theory of RKHS (D. Sejdinovic, A. Gretton) gives an excellent detailed treatment of the theory of RKHS's.

- Advanced data analysis from an elementary point of view (C. Shalizi) - chapter 20 and the whole of part IV provides some nice background reading for the part of the course on graphical models and causal inference.

- Lecture notes on Causality (J. Peters) is highly recommended if you want to learn more about causal inference. Parts of our notes are based closely on this, though this goes into more depth and covers more topics. You can view a mini-course lecured by him here.

The code for the demonstrations is written in R. Rstudio is a useful editor for R. Here are some introductory worksheets on R: Sheet 1, (solutions); Sheet 2, (solutions). The code for the demonstrations is given below.

- © 2019 the Statistical Laboratory, University of Cambridge

Information provided by webmaster@statslab.cam.ac.uk - Privacy and Cookies