Statistics Group



High-dimensional statistics

Kueh, Nickl, Samworth, Shah

Recent technological advances have dramatically increased the volume of data that scientists collect. One prototypical problem arising from high-dimensional data is variable selection, as faced, for example, by practitioners conducting a microarray experiment, who want to flag important proteins for further investigation, and where such further study may be costly.

We have

  • improved and extended the approach of Stability Selection (Meinshausen and Bühlmann, 2010 JRSS-B read paper), significantly increasing its accuracy and giving it wide applicability.

  • widely generalised the applicability of Sure Independence Screening (Fan and Lv, 2008), a technique popular for its computational speed

  • proposed locally adaptive wavelet estimators for data on compact homogeneous manifolds, such as d-dimensional unit spheres, useful in statistical analysis of data sets in astrophysics, such as ultra-high energy cosmic rays.

  • determined the optimal choice of k for the ubiquitous k-nearest neighbour classifier.

stability

Stability selection involves selecting variables `stable' under data resamplings, giving practitioners

better guidance over where to place the all-important dividing line between signal (red) and noise (blue)



  • Fan, J., Samworth, R. and Wu, Y. (2009)
    Ultra high dimensional feature selection: beyond the linear model
    Journal of Machine Learning Research, 10, 2013-2038.

  • Hall, P., Park, B. U. and Samworth, R. J. (2008)
    Choice of neighbor order in nearest-neighbor classification
    Annals of Statistics, 36, 2135-2152.

  • Kerkyacharian G., Nickl, R., Picard D.,
    Concentration Inequalities and Confidence Bands for Needlet Density Estimators on Compact Homogeneous Manifolds.
    Probability Theory and Related Fields, 2011to appear.