The remarkable development of computing power and other technology now allows scientists and businesses to routinely collect datasets of immense size and complexity. Most classical statistical methods were designed for situations with many observations and a few, carefully chosen variables. However, we now often gather data where we have huge numbers of variables, in an attempt to capture as much information as we can about anything which might conceivably have an influence on the phenomenon of interest. This dramatic increase in the number variables makes modern datasets strikingly different, as well-established traditional methods perform either very poorly, or often do not work at all.

Developing methods that are able to extract meaningful information from these large and challenging datasets has recently been an area of intense research in statistics, machine learning and computer science. In this course, we will study some of the methods that have been developed to study such datasets.

- The Elements of Statistical Learning (T. Hastie, R. Tibshirani and J. Friedman) has excellent background material for large parts of this course, presented in a less mathematical style.

- Statistics for High-Dimensional Data (P. Bühlmann and S. van de Geer) covers much of our course and in many places goes into much greater depth than we do.

- Statistical Learning with Sparsity (T. Hastie, R. Tibshirani and M. Wainwright) is excellent for the part of the course on the Lasso and its generalisations.

- Notes on the theory of RKHS (D. Sejdinovic, A. Gretton) gives an excellent detailed treatment of the theory of RKHS's.

- Advanced data analysis from an elementary point of view (C. Shalizi) - chapter 20 and the whole of part IV provides some nice background reading for the part of the course on graphical models and causal inference.

- Lecture notes on Causality (J. Peters) is highly recommended if you want to learn more about causal inference. Parts of our notes are based closely on this, though this goes into more depth and covers more topics.

The code for the demonstrations is written in R. Rstudio is a useful editor for R. Here are some introductory worksheets on R: Sheet 1, (solutions); Sheet 2, (solutions). The code for the demonstrations is given below.

- Revision questions. You may also wish to look at last year's exam. Note that a different version of the normal tail bound was proved in last year's course, so in question 3, you can ignore the part requiring you to prove a tail bound, or prove an alternative version with an added factor of 2 in the appropriate place.

- © 2016 the Statistical Laboratory, University of Cambridge

Information provided by webmaster@statslab.cam.ac.uk - Privacy and Cookies