Qingyuan Zhao 赵卿元

Table of Contents



About me
I am University Lecturer of Statistics at University of Cambridge. [cv] [Google Scholar]
Brief bio
In 2016–2019 I was a postdoctoral fellow in the Statistics Department of the Wharton School, University of Pennsylvania, mentored by Dylan Small and Sean Hennessy. In 2016 I received my Ph.D. in Statistics from Stanford University (advised by Trevor Hastie). In 2011 I received my B.S. in Mathematics from the Special Class for the Gifted Young (SCGY), University of Science and Technology of China (USTC) in 2011. I worked at eBay (2013) and Google (2014, 2015) as summer intern.
D1.01, Wilberforce Road, Cambridge CB3 0WB, UK.


<2019-12-11 Wed> Talk on sensitivity analysis in SAMSI workshop   Academic

<2019-10-15 Tue> BSU seminar   Academic

Slides for the MRC Biostatistics Unit seminar: Using sparsity to overcome unmeasured confounding: Two examples

<2019-10-09 Wed> Causal inference reading group   Academic

Together with Rajen Shah, I will be organizing a reading group in causal inference. In the Michaelmas term we will be meeting every Tuesday 1:30–3pm from 22 October and the topic is causal inference and high-dimensional statistics. Readings of the individual meetings can be found on talks.cam. The reading group is intended to be casual and friendly, all are welcome.

<2019-09-27 Fri> Mac set-up for statisticians   Academic

For modern statisticians and data scientists, using your computing device efficiently is the key to the success.

I have been using Emacs for my day-to-day statistics work from 2012. I am sharing my set-up here.

<2019-08-01 Thu> Appointment started at University of Cambridge

<2019-07-02 Tue> New page on Mendelian randomization and instrumental variables

The new webpage contains useful information about summary-data Mendelian randomization and related research, including software and my recent talks.

<2019-05-11 Sat> Article accepted by Annals of Statistics   Academic

<2019-04-25 Thu> Article accepted by Journal of the Royal Statistical Society (Series B)   Academic

<2019-03-27 Wed> Mendelian randomization talk   Academic

Click for the slides and handout of the talk I give at the student lunch seminar in Wharton statistics department.

<2018-12-19 Wed> Article accepted by Statistical Science   Academic

Two-sample instrumental variable analyses using heterogeneous samples (with Jingshu Wang, Wes Spiller, Jack Bowden, Dylan Small).

<2018-11-30 Fri> Slides for Larry Brown Memorial Workshop   Academic

Click the link for my slides.

I had the fortune to have a few conversations with Larry during the first year of my postdoc. Every time he struck me as a brilliant statistician but a humble person. At one time I presented our work on selective inference for effect modification in Larry's Friday workshop and received many incisive comments from him. I wish I could have talked to him more often.

<2018-10-19 Fri> Homepage upgrade   Life

I switched the HTML style from Twitter Bootstrap to the awesome Bigblow theme. This entire homepage is written in org-mode and generated by a single command in Emacs!

<2018-10-16 Tue> –<2018-10-20 Sat> ASHG 2018   Academic Life

This is my first time at the annual meeting of American Society of Human Genetics. I presented a poster (#3196T, Reviewer's Choice abstract) about our recent work on Mendelian randomization.

Check out this video on Instagram taken when I was enjoying riding the electric scooter at the San Diego waterfront.

<2018-09-24 Mon> New commentary on the causal inference data competition   Academic

<2018-09-12 Fri> New report on selective inference for effect modification   Academic

<2018-08-07 Tue> New article on performance evaluation of mutual funds   Academic

<2018-08-01 Wed> Trip to Vancouver and the Canadian Rockies   Life

Check out this post on Instagram for some highlights of the trip.

<2018-06-19 Tue> –<2018-06-21 Thu> EcoSta 2018   Academic

I presented our new results on estimating the skill of mutual fund managers in an invited session. [slides]

<2018-05-21 Mon> –<2018-05-23 Wed> ACIC 2018   Academic

I organized and spoke in a session on "New advancements in sensitivity analysis of observational studies" about our new percentile bootstrap approach to sensitivity analysis. [slides]

I went to a special workshop on treatment effect heterogeneity and reported the analysis results using selective inference on a dataset provided by the workshop organizers. [slides]

<2018-04-24 Tue> Talks on Mendelian randomization   Academic

I visited University of Minnesota (Stats Department), Johns Hopkins (Biostats Department), UC Berkeley (Biostatistics division) and Stanford (Stats Department) to give seminars about our new work on Mendelian randomization. [slides]


Research interests

I am interested in causal inference, high dimensional statistics and their applications in biomedical and social sciences. Some of my current research interests include

Rajen Shah and I are currently running a reading group in causal inference.


Below is a list of my publications. You can click the triangle to expand each paper for more information, or click "[Expand all]" on the top right of this page to expand all entries.


Selecting and ranking individualized treatment rules with unmeasured confounding. [preprint]   Sensitivity_Analysis Causal_Inference Multiple_Testing

  • Authors: Bo Zhang, Jordan Weiss, Dylan S. Small, Qingyuan Zhao.
  • Summary: Optimal individualized treatment rules try to assign the best treatment to every individual, but it may be very sensitive to unmeasured confounding bias for groups of people exhibiting small treatment effect in the observational study. We give a quantification of this idea using Rosenbaum's sensitivity analysis model and propose some general ways to select and rank individualized treatment rules using multiple testing procedures.

The role of lipoprotein subfractions in coronary artery disease: A Mendelian randomization study. [bioRxiv]   Mendelian_Randomization Epidemiology Causal_Inference

  • Authors: Qingyuan Zhao, Jingshu Wang, Zhen Miao, Nancy Zhang, Sean Hennessy, Dylan S Small, Daniel J Rader.
  • Summary: We apply the MR-RAPS method we developed in previous articles to infer the potential causal role of lipoprotein subfractions in CAD. This is motivated by the finding in our earlier IJE paper that the association between genetically-determined HDL-C and CAD is heterogeneous according to instrument strength. In this study, We find that HDL subfraction traits, unlike LDL and VLDL subfractions, appear to have heterogeneous effects on coronary artery disease according to particle size. The concentration of medium HDL particles may have a protective effect on CAD that is independent of traditional lipid factors.

An Interval Estimation Approach to Selection Bias in Observational Studies [arXiv]   Sensitivity_Analysis Causal_Inference

  • Authors: Matthew Tudball, Rachael Hughes, Kate Tilling, Qingyuan Zhao, Jack Bowden.
  • Summary: We use asymptotic results for the sample average approximation in stochastic programming to construct confidence intervals for a class of partially-identified problems in observational studies.


Comment: Will competition-winning methods for causal inference also succeed in practice? In Statistical Science, 2019 [journal link] [preprint]   Machine_Learning Causal_Inference

  • Authors: Qingyuan Zhao, Luke Keele, Dylan Small.
  • Summary: This is an invited commentary for Statistical Science on the causal inference data competition in ACIC 2016.

Selective inference for effect modification: An empirical investigation. In Observational Studies, 2019 [journal link] [preprint]   Causal_Inference Effect_Modification

  • Authors: Qingyuan Zhao, Snigdha Panigrahi.
  • Summary: In a special workshop in ACIC 2018, we were invited to analyze a simulated dataset to detect treatment effect heterogeneity. This article reports our results presented in the workshop. We also tried out more recent selective inference methods based on the selective sampler.

Performance evaluation with latent factors. [preprint] [SSRN]   

  • Authors: Yang Song, Qingyuan Zhao.
  • Summary: We use Confounder Adjusted Testing and Estimating (CATE) proposed in our previous paper to estimate the abnormal return (aka "alpha") of U.S. equity mutual funds. When funds are ranked by the difference between CATE alpha and CAPM alpha, the top decile outperforms the bottom decile by 500 bps per year. We also find evidence that mutual fund flows become less responsive to FFC factors.
  • Slides at EcoSta 2018.

Falsification tests for instrumental variable designs with an application to tendency to operate. Medical Care, 2019 [journal link]   Causal_Inference Instrumental_Variable

  • Authors: Luke Keele, Qingyuan Zhao, Rachel Kelz, Dylan Small.
  • Summary: We propose a falsification test for the IV assumptions using sub-populations of the data with overwhelming proportion of treated or untreated units. If the IV assumptions hold, we should find the intention-to-treat effect is zero within these sub-populations. We demonstrate this test using an IV known as tendency to operate (TTO) from health services research.

Powerful three-sample genome-wide design and robust statistical inference in summary-data Mendelian randomization. To appear in International Journal of Epidemiology [preprint] [arXiv]   Causal_Inference Epidemiology Instrumental_Variable Mendelian_Randomization

  • Authors: Qingyuan Zhao, Yang Chen, Jingshu Wang, Dylan Small.
  • Summary: We extend the MR-RAPS method in our previous paper using the empirical partially Bayes framework described by Lindsay, allowing a true genome-wide design for Mendelian randomization.
  • Slides (more accessible version); Handout.

Improving the accuracy of two-sample summary data Mendelian randomization: moving beyond the NOME assumption. To appear in International Journal of Epidemiology. [journal link] [bioRxiv]   Causal_Inference Epidemiology Instrumental_Variable Mendelian_Randomization

  • Authors: Jack Bowden, Fabiola Del Greco M, Cosetta Minelli, Debbie Lawlor, Qingyuan Zhao, Nuala Sheehan, John Thompson, George Davey Smith.
  • Summary: This paper proposes a modified Cochran's \(Q\) statistic to detect horizontal pleiotropy in Mendelian randomization. This extension is quite important when there are many weak genetic instruments.

Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. To appear in Annals of Statistics. [preprint] [arXiv]   Causal_Inference Epidemiology Instrumental_Variable Mendelian_Randomization

  • Authors: Qingyuan Zhao, Jingshu Wang, Gibran Hemani, Jack Bowden, Dylan Small.
  • Summary: We give a comprehensive theoretical basis for two-sample summary-data Mendelian randomization. We find that horizontal pleiotropy is pervasive in MR studies. We propose a new method—robust adjusted profile score— that can consistently estimate the causal effect under pervasive balanced pleiotropy and is robust to occasional outliers.
  • Software: R package mr.raps is available CRAN. It can be directly called from the TwoSampleMR platform, see this documentation.
  • Slides at UMN; Slides at JHU.


Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. To appear in Journal of the Royal Statistical Society (Series B, Statistical Methodology) [journal link] [preprint] [arXiv]   Causal_Inference Sensitivity_Analysis

  • Authors: Qingyuan Zhao, Dylan Small, Bhaswar Bhattacharya.
  • Summary: Rosenbaum’s sensitivity analysis framework has several limitations: 1. It is mostly applicable to matched observational studies; 2. It only tests the sharp null hypothesis; 3. It assumes treatment effect homogeneity to obtain a confidence interval of the causal effect. Seeking to overcome these limitations, we propose a new approach to sensitivity analysis based on the inverse probability weighting estimator. The key ideas are to use numerical optimization to estimate the causal effect bound and to use the percentile bootstrap to quantify the sampling uncertainty.
  • Slides at ACIC '18. Slides at OSU Bayesian causal inference workshop.

Graphical diagnosis of confounding bias in instrumental variables analysis. In Epidemiology, 2018. [journal link]   Causal_Inference Instrumental_Variable

  • Authors: Qingyuan Zhao, Dylan Small.
  • Summary: This research letter proposes a new diagnostic plot for IV analysis, so large bias ratios (compared to OLS estimator) are not over-interpreted when the covariate is unrelated to the outcome.
  • Software: R functions iv.diagnosis and iv.diagnosis.plot in the package ivmodel on CRAN.

Two-sample instrumental variable analyses using heterogeneous samples. In Statistical Science, 2019. [journal link] [preprint] [arXiv]   Causal_Inference Instrumental_Variable Mendelian_Randomization

  • Authors: Qingyuan Zhao, Jingshu Wang, Wes Spiller, Jack Bowden, Dylan Small.
  • Summary: Many modern IV studies (especially MR) are carried out with the two-sample design, where the samples may come from different populations. We derive a new class of linear IV estimates that are robust to sample heterogeneity. We then attempt to relax the linearity assumption and find that the two-sample design generally requires more untestable assumptions.
  • Slides at MRC Integrative Epidemiology Unit, University of Bristol.

Selective inference for effect modification via the lasso. [preprint] [arXiv]   Causal_Inference Effect_Modification Selective_Inference

  • Authors: Qingyuan Zhao, Dylan Small, Ashkan Ertefaie.
  • Summary: We approach the heterogeneous treatment effect problem in a different way. Instead of trying to obtain the optimal treatment regime, we seek an interpretable model for effect modification using the recently developed selective inference framework.
  • Slides at ACIC '17. Slides at ICSA '18.

Multiple testing when many p-values are uniformly conservative, with application to testing qualitative interaction in educational interventions. To appear in Journal of American Statistical Association. [journal link] [preprint] [arXiv]   Effect_Modification Selective_Inference Multiple_Testing

  • Authors: Qingyuan Zhao, Dylan Small, Weijie Su.
  • Summary: Qualitative interaction is an extreme form of treatment effect heterogeneity where the treatment can be beneficial for some but harmful for others. We formulated this question as a global testing problem with many conservative null \(p\)-values and proposed a simple technique—conditioning—to greatly improve the statistical power.

Cross-screening in observational studies that test many hypotheses. In Journal of American Statistical Association, 2018. [journal link] [preprint] [arXiv]   Causal_Inference Sensitivity_Analysis Multiple_Testing

  • Authors: Qingyuan Zhao, Dylan Small, Paul Rosenbaum.
  • Summary: This paper proposes a new method called "cross-screening" to increase the power of sensitivity analysis when multiple causal hypotheses need to be tested simultaneously.
  • Software: R package CrossScreening and package vignette.

On sensitivity value of pair-matched observational studies. To appear in Journal of American Statistical Association. [journal link] [preprint] [arXiv]   Causal_Inference Sensitivity_Analysis

  • Authors: Qingyuan Zhao.
  • Summary: A crucial quantity in Rosenbaum’s sensitivity analysis is the "sensitivity value", the amount of unmeasured confounding needed to alter the qualitative conclusions of an observational study. This paper looks into the properties of "sensitivity value" and characterizes its asymptotic behaviors.
  • Slides at JSM '17.

Estimation and prediction in sparse and unbalanced tables. [preprint] [arXiv]   Computation

  • Authors: Qingyuan Zhao, Trevor Hastie, Daryl Pregibon.
  • Summary: When there is a multi-way table where each dimension has large number of levels, it is computationally intensive to fit even the standard mixed effects models. We propose a novel hierarchical ANOVA representation for such data. Modeling back-fitting requires repeated calculations of sub-table means, which can be efficiently computed when observations are sparse.


Causal interpretations of black-box models. To appear in Journal of Business & Economic Statistics. [journal link] [preprint]   Causal_Inference Machine_Learning

  • Authors: Qingyuan Zhao, Trevor Hastie.
  • Summary: This is an invited discussion paper for Journal of Business & Economic Statistics. We link Friedman's partial dependence plot with Pearl's backdoor adjustment formula. We discuss situations when possible causal interpretations can be made for black-box machine learning models.
  • Slides at JSM '16 (JBES invited session).

Comment on "Causal inference using invariant prediction". In Journal of the Royal Statistical Society (Series B), 2016. [journal link] [preprint]   Causal_Inference

  • Authors: Qingyuan Zhao*, Charles Zheng*, Trevor Hastie and Robert Tibshirani.
  • Summary: This is a contributed discussion on the article "Causal inference using invariant prediction" by Peters et al.

Permutation p-value approximation via generalized Stolarsky invariance. In Annals of Statistics, 2019. [journal link] [preprint] [arXiv]   Genomics

  • Authors: Hera He, Kinjal Basu, Qingyuan Zhao, Art Owen.
  • Summary: This paper uses a generalized Stolarsky's invariance principle to approximate the permutation \(p\)-value for two-sample linear test statistics. Along the way we discovered a simple probabilistic proof of Stolarsky's invariance principle.

Covariate balancing propensity score by tailored loss functions. In Annals of Statistics, 2019. [journal link] [preprint] [arXiv]   Causal_Inference

  • Authors: Qingyuan Zhao.
  • Summary: This paper extends the dual interpretation of entropy balancing to general situations and proposes a tailored loss function. Minimizing this loss function by machine learning algorithms generates approximate covariate balance in large function classes.


Confounder adjustment in multiple hypothesis testing. In Annals of Statistics, 2017. [journal link] [preprint] [arXiv]   Causal_Inference Multiple_Testing Genomics

  • Authors: Jingshu Wang*, Qingyuan Zhao*, Trevor Hastie, Art Owen.
  • Summary: Confounding introduces hidden bias to the statistical inference. We show in modern simultaneous testing, it is possible to correct for unmeasured confounders. Previous methods including SVA, LEAPP, RUV are unified in the same framework in this paper. Interestingly, confounder adjustment is as efficient as the oracle linear regression when latent variables are strong.
  • Software: R package cate and vignette.
  • Slides.

Entropy balancing is doubly robust. In Journal of Causal Inference, 2017. [journal link] [preprint] [arXiv]   Causal_Inference

  • Authors: Qingyuan Zhao, Daniel Percival.
  • Summary: We show a recently proposed method called Entropy Balancing is doubly robust, that is the causal effect estimator is consistent if the propensity score is logistic and/or the outcome regression model is linear in the covariates.
  • Slides at JSM 2015.

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity. In Proceedings of ACM SIGKDD, 2015. [journal link] [preprint]    Computation

  • Authors: Qingyuan Zhao, Murat Erdogdu, Hera He, Anand Rajaraman, Jure Leskovec.
  • Summary: We study a simple hence extremely noisy form of information cascade—tweet. We use a doubly stochastic self-exciting point process to model the retweet process. The SEISMIC model we develop only requires the timestamps and the graph degrees to make more accurate predictions than the state-of-the-art.
  • Software: R package seismic. More information can be found at this webpage at SNAP.
  • Slides at KDD 2015 (Video on YouTubte).


Causal Inference (Michaelmas 2019)

Guest lectures

  • Randomization test in Wharton STATS 341. [Lecture notes]
  • A tutorial on instrumental variables and Mendelian randomization in Johns Hopkins. [Slides]


R packages

Software Link Description
bootsens On GitHub Bootstrapping sensitivity analysis
mr.raps On CRAN; Developer version on GitHub Mendelian randomization via robust adjusted profile score
CrossScreening On CRAN (package vignette) Multiple testing in pair-matched observational studies
cate On CRAN (package vignette) High-dimensional factor analysis and confounder adjusted testing and estimation
seismic On CRAN; More information on SNAP Self-exciting process model for information cascade prediction

Emacs setup

I use GNU Emacs (on a Mac) for my day-to-day work. You can find my Emacs configurations on GitHub.

For me, the two most useful features of this setup are:

  1. PDF-TeX synchronization via SyncTeX and Skim; Non-stop compilation via latexmk;
  2. Flexible writiting of R code via the Emacs Speaks Statistics (ESS) add-on.

Useful Mac apps

  • Homebrew: the missing package manager for Mac, should be used to install basically all apps and packages.
  • Skim: lightweight PDF reader which allows PDF-TeX synchronization.
  • Spectacle: keyboard shortcuts to move and resize windows.
  • Karabiner-Elements: keyboard customizer.
  • Caffeine: tiny app that keeps your Mac awake.

Author: Qingyuan Zhao

Email: qyzhao@statslab.cam.ac.uk

Created: 2019-12-11 Wed 12:01

Emacs 25.3.1 (Org mode 8.2.10)