Rajen Shah


R.Shah@statslab.cam.ac.uk
Room: D1.15
Office phone: +44 1223 765923

I am a Professor of Statistics at the Statistical Laboratory, which is part of the Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge. My research interests include high-dimensional statistics, causal inference and large-scale data.

Editorial Service

  • Associate Editor for JRSSB (2016-2020) and (2023- )

Research

Core Statisical Publications and Preprints

  • Young, E. H. and Shah, R. D. (2024) ROSE Random Forests for Robust Semiparametric Efficient Estimation. Preprint. (.pdf).
  • Klyne, H. and Shah, R. D. (2023) Average partial effect estimation using double machine learning. Preprint. (.pdf).
  • Young, E. H. and Shah, R. D. (2023) Sandwich Boosting for Accurate Estimation in Partially Linear Models for Grouped Data. J. Roy. Statist. Soc., Ser. B., to appear (.pdf).
  • Guo, F. R. and Shah, R. D. (2023) Rank-transformed subsampling: Inference for multiple data splitting and exchangeable p-values. J. Roy. Statist. Soc., Ser. B., to appear (.pdf).
  • Lundborg, A. R., Kim, I., Shah, R. D. and Samworth, R. J. (2022) The Projected Covariance Measure for assumption-lean variable significance testing. Annals of Statistics, to appear. (.pdf).
  • Pein, F. and Shah, R. D. (2021) Cross-validation for change-point regression: pitfalls and solutions. Bernoulli, to appear. (.pdf).
  • Wang, Y. and Shah, R. D. (2020) Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders. Annals of Statistics, to appear. (.pdf) Software: dipw.
  • Shah, R. D. and Bühlmann, P. (2023) Double-estimation-friendly inference for high-dimensional misspecified models. Statistical Science, 38(1), 68-91. (.pdf).
  • Jakobsen, M. E., Shah, R. D., Bühlmann, P. and Peters, J. (2022) Structure Learning for Directed Trees. JMLR, 23, 1-97 (.pdf).
  • Lundborg, A. R., Shah, R. D. and Peters, J. (2022) Conditional Independence Testing in Hilbert Spaces with Applications to Functional Data Analysis. J. Roy. Statist. Soc., Ser. B. 84(5), 1821-1850 (.pdf) Software: ghcm.
  • Stokell, B. G. and Shah, R. D. (2022) High-dimensional regression with potential prior information on variable importance. Statistics and Computing. 32, 52 (.pdf).
  • Zhao, Q., Nianqiao, J., Bacallado, S. and Shah, R. D. (2021) BETS: The dangers of selection bias in early analyses of the coronavirus disease (COVID-19) pandemic. Annals of Applied Statistics, 15(1), 363-390. (.pdf).
  • Stokell, B. G., Shah, R. D. and Tibshirani, R. J. (2021) Modelling High-Dimensional Categorical Data Using Nonconvex Fusion Penalties. J. Roy. Statist. Soc., Ser. B, 83(3), 579-611. (.pdf) Software: CatReg and erratum.
  • Janková, J., Shah, R. D., Bühlmann, P. and Samworth, R. J. (2020) Goodness-of-fit testing in high dimensional generalized linear models. J. Roy. Statist. Soc., Ser. B, 82(3), 773-795. (.pdf) Software: GRPtests.
  • Shah, R. D., Frot, B., Thanei, G. and Meinshausen, N. (2020) Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding. J. Roy. Statist. Soc., Ser. B., 82(2), 361-389. (.pdf).
  • Shah, R. D. and Peters, J. (2020) The hardness of conditional independence and the generalised covariance measure. Annals of Statistics, 48(3), 1514-1538. (.pdf) Software: GeneralisedCovarianceMeasure.
  • Thanei, G., Meinshausen, N. and Shah, R. D. (2018) The xyz algorithm for fast interaction search in high-dimensional data. JMLR, 19, 1-42. (.pdf).
  • Shah, R. D. and Meinshausen, N. (2018) On b-bit min-wise hashing for large-scale regression and classification with sparse data. JMLR, 18, 1-42. (.pdf)
  • Shah, R. D. and Bühlmann, P. (2017) Goodness-of-fit tests for high-dimensional linear models. J. Roy. Statist. Soc., Ser. B. (.pdf) Software: RPtests.
  • Chen, Y. and Shah, R. D. (2017) Discussion of Random projection ensemble classification by Cannings, T. and Samworth, R. J. J. Roy. Statist. Soc., Ser. B, 79, 1003-1004. (.pdf).
  • Shah, R. D. (2016) Modelling interactions in high-dimensional data with Backtracking. JMLR, 17, 1-31. (.pdf) Software: LassoBacktracking.
  • Shah, R. D. and Samworth, R. J. (2015) Discussion of An adaptive resampling test for detecting the presence of significant predictors by McKeague, I. W. and Qian, M. J. Amer. Statist. Assoc., 110, 1439-1442. (.pdf)
  • Chen, Y., Shah, R. D. and Samworth, R. J. (2014) Discussion of Multiscale change point inference by Frick, K., Munk, A. and Sieling, H. J. Roy. Statist. Soc., Ser. B, 76, 544-546. (.pdf)
  • Shah, R. D. and Meinshausen, N. (2014) Random Intersection Trees. JMLR, 15, 629-654. (.pdf) Software: FSInteract.
  • Shah, R. D. and Samworth, R. J. (2013) Discussion of Correlated variables in regression: clustering and sparse estimation by Bühlmann, Rütimann, van de Geer and Zhang. Journal of Statistical Planning and Inference, 143, 1866-1868. (.pdf)
  • Shah, R. D. and Samworth, R. J. (2013) Variable selection with error control: Another look at Stability Selection. J. Roy. Statist. Soc., Ser. B, 75, 55-80. (.pdf) Associated R code.
  • Shah, R. D. and Samworth, R. J. (2010) Discussion of Stability Selection by Meinshausen and Bühlmann. J. Roy. Statist. Soc., Ser. B, 72, 455-456. (.pdf)

Interdisciplinary Publications

  • Joao Rocha, Satish Arcot Jayaram, Tim J. Stevens, Nadine Muschalik, Rajen D. Shah, Sahar Emran, Cristina Robles, Matthew Freeman, Sean Munro (2023) Functional unknomics: Systematic screening of conserved genes of unknown function. PLOS Biology, 21(8). (.pdf)
  • Mitchell, P. D., Brown, R., Wang, T., Shah, R. D., Samworth, R. J., Deakin, S., Edge, P., Hudson, I., Hutchinson, R., Kaur, K., Lacey, E.-K., Latimer, M., Natarajan, R., Qasim, S., Rehm, A., Sanghrajka, A., Tissingh, E. and Wright, G. (2019) Multi-centre study of non-accidental injury and limb fractures in young children in the East Anglia region, UK. Archives of Disease in Childhood, to appear.
  • Bødker, J. S., Brøndum, R. F., Schmitz, A., Schönherz, A. A., Jespersen, D. S., Sønderkær, M., Vesteghem, C., Due, H., Nøgaard C. H., Perez-Andres, M., Samur, M. K., Davies, F., Walker, B., Pawlyn, C., Kaiser, M., Johnson, D., Bertsch, U., Broyl, A., van Duin, M., Shah, R., Johansen, P., Nøgaard, M. A., Samworth, R. J., Sonneveld, P., Goldschmidt. H., Morgan, G. J., Orfao, A., Munshi, N., El-Galaly, T., Dybkær, K. and Bøgsted, M. (2018) A multiple myeloma classification system that associates normal B-cell subset phenotypes with prognosis. Blood Advances, 2, 2400-2411.
  • Dybkær, K., Bøgsted, M., Falgreen, S., Bødker, J. S., Kjeldsen, M. K., Schmitz, A., Bilgrau, A. E., Xu-Monette, Z. Y., Li, L., Bergkvist, K. S., Laursen, M. B., Rodrigo-Domingo, M., Marques, S. C., Rasmussen, S. B., Nyegaard, M., Gaihede, M., Møller, M. B., Samworth, R. J., Shah, R. D., Johansen, P., El-Galaly, T. C., Young, K. H. and Johnsen, H. E. (2015) A diffuse large B-cell lymphoma classification system that associates normal B-cell subset phenotypes with prognosis, J. Clinical Oncology, 33, 1379-1388.

Other notes

  • Sparsity. Workshop on Multivariate Analysis Today 2015. (.pdf) Some slides can be found here.
  • High-dimensional data and the Lasso. Eureka 62, 2013 (.pdf)

Group members

PhD students

  • Elliot Young

Alumni

  • Harvey Klyne (Postdoc, Harvard University)
  • Richard Guo (Assistant Professor, University of Washington)
  • Ilmun Kim (Assistant Professor, Yonsei University
  • Yuhao Wang (Assistant Professor, Tsinghua University)
  • Benjamin Stokell (Quantitative Researcher, G-Research)
  • Gian-Andrea Thanei (Statistical Scientist, Roche)

Teaching