Qingyuan Zhao

Qingyuan Zhao

Professor of Statistics

University of Cambridge

About

I am a Professor of Statistics in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics (DPMMS) at University of Cambridge, a Fellow of the Corpus Christi College, and an Associate Faculty of the Cambridge Centre for AI in Medicine (CCAIM).

My research interests lie primarily in drawing scientific conclusions about causal relationships using experimental and observational data, a fast-growing area known as “causal inference”. More broadly, I would like to understand how “design”—a principle I view as fundamental yet elusive in statistics—shapes the practice of statistical applications in biomedical and social sciences.

Click here for a bio-sketch in the third person narrative.

Interests

  • Causal inference
  • Selective inference
  • Applied statistics

Education

  • PhD in Statistics, 2016

    Stanford University

  • BSc in Mathematics, 2011

    University of Science and Technology of China (USTC)

News

  • <2024-10-01 Tue> I am promoted to Professor of Statistics.
  • <2024-09-23 Mon> I am joining the Associate Editor Board of Statistical Science.
  • If you have applied statistics questions, you might be interested in the Statistics Clinic that offers free consulting to University members. I am a regular consultant in the Clinic.
  • If you are interested in causal inference, you might want to check out the weekly Online Causal Inference Seminar.

Contact

  • qyzhao@statslab.cam.ac.uk
  • Centre for Mathematical Sciences, Wilberforce Road, Cambridge, CB3 0WB
  • Pavillion D – Room D1.01

Home

News

  • <2025-06-01 Sun> I will be co-organizing a long-residency program on causal inference at the Isaac Newton Institute from January to June, 2026.
  • <2024-10-01 Tue> I am promoted to Professor of Statistics.
  • <2024-09-23 Mon> I am joining the Associate Editor Board of Statistical Science.
  • If you have applied statistics questions, you might be interested in the Statistics Clinic that offers free consulting to University members. I am a regular consultant in the Clinic.
  • If you are interested in causal inference, you might want to check out the weekly Online Causal Inference Seminar.

Bio

Bio sketch

Qingyuan was born and raised in Wuhan in central China, a city known for its many lakes and rivers and rich cultural heritage. After high school, he went to the Special Class for the Gifted Young in University of Science and Technology of China and majored in mathematics. He then went to Stanford University for postgraduate studies and obtained his Ph.D. in Statistics in 2016. He spent three years in the Wharton School of University of Pennsylvania as a postdoctoral fellow before joining the Statistical Laboratory in University of Cambridge as a University Lecturer in 2019. He was promoted to Professor of Statistics at University of Cambridge in 2024.

Qingyuan’s research interests lie primarily in drawing scientific conclusions about causal relationships using experimental and observational data, a fast-growing area known as “causal inference”. More broadly, he strives to understand how “design”—a principle he views as fundamental yet elusive in statistics—shapes the practice of statistical applications in biomedical and social sciences.

Teaching

Teaching

TODO Causal Inference (Part III, Michaelmas 2019)

This is a 16-lecture course on causal inference, the statistical science of drawing causal conclusions from experimental and non-experimental data.

{{% toc %}}

General information

  • Course syllabus.

  • Time: Tuesday & Thursday, 12-1pm.

  • Location: MR14.

  • Office hour: By appointment.

  • Prerequisites:

    • Familiarity with undergraduate-level probability and statistics.
    • An open mind to apply statistical theory to practical problems.
    • Experience with R or other programming languages is helpful.

Course outline

Lecture notes will be provided after each lecture. I am not quite ready to make the lecture notes public, so email me if you miss the first lecture and need the username and password to access the notes.

Chapter Topic Last updated Additional materials
Part I Motivations
1 Principles of causal inference <2019-10-16 Wed>
2 Randomised experiments <2019-10-21 Mon> 2019 Nobel Prize in Economics
3 Linear structural equation models <2019-10-23 Wed> Excerpt from a psychology paper
Part II Languages for causality
4 Probabilistic graphical models <2019-11-06 Wed>
5 Nonparametric structural equations and counterfactuals <2019-11-06 Wed>
6 Causal Identification <2019-11-06 Wed>
Part III Statistical methods
7 Matching and randomisation inference <2019-11-14 Thu>
8 Semiparametric inference for average treatment effects <2019-11-19 Tue>
9 Instrumental variables <2019-11-25 Mon>
10 Regression discontinuity design <2019-11-28 Thu>
11 Negative control <2019-11-28 Thu>
12 Mediation analysis <2019-12-02 Mon>
Part I—III Full notes (Chapters 1–12) <2020-01-21 Tue>

The full notes include corrections and clarifications to the in-class notes and solution or hint to some exercises.

Readings

Example classes

  • Location: MR11.
  • Time: 2-3:30.
Time Sheet Last updatd Solution
<2019-10-28 Mon> Example Sheet 1; Dataset; SEM paper <2019-10-25 Fri> Mostly in the full lecture notes; R code for Q5
<2019-11-11 Mon> Example Sheet 2 <2019-11-06 Wed> See the full lecture notes.
<2019-12-02 Mon> Example Sheet 3 <2019-11-25 Mon> See the full lecture notes.
<2020-01-13 Mon> Reading materials; See below <2020-01-08 Wed>
<2020-05-11 Mon> Revision <2020-05-11 Mon> Revision notes; Video recording.

Presentation of applied research articles

The 4th example class will be interactive. The lecturer will provide a list of applied articles (in social sciences, public health, and other areas) before the 3rd example class. Each student will then be asked to pick an article and give a short presentation in the final example class.

It is intended that all students wishing to take the exam of this course can participate in this example class. Please let me know if you have trouble attending this class. More information will be provided during the Michaelmas term.

A tentative list of applied articles (being updated):

  1. Political intolerance and political repression during the McCarthy Red Scare (reprinted in David Freedman’s book, page 315–342) and Freedman’s comments (Section 6.3).

  2. Finishing high school and starting college: Do Catholic schools make a difference (reprinted in David Freedman’s book, page 343–376) and Freedman’s comments (Section 7.4).

  3. Education and fertility: Implications for the roles women occupy (reprinted in David Freedman’s book, page 377–401) and Freedman’s comments (Section 9.5).

  4. Institutional arrangements and the creation of social capital: The effects of public school choice (reprinted in David Freedman’s book, page 402–430) and Freedman’s comments (Section 9.7).

  5. A new perspective on John Snow’s communicable disease theory (a short version can be found in David Freedman’s book, Section 1.3) and Sociomedical indicators in the cholera epidemic in Ferrara of 1855 (also this editorial and commentary). These articles may be suitable for two presenters, one focusing on John Snow’s analysis and one on the Ferrara study.

  6. Thomas Cook’s commentary to Cochran (page 140–163, 1st Volume of Observational Studies) has three examples demonstrating the limitations of observational studies that seek to mimic randomised experiments (Page 146: Issue 1; Page 153: Issue 2; Page 157: Issue 3). This article may be suitable for up to three presenters.

  7. Triangulation in aetiological epidemiology has three illustrative examples for corroboration of evidence from different causal inference approaches. Some of the results are in the supplementary materials and can be downloaded from the IJE website. This article may be suitable for up to three presenters.

  8. Those confounded vitamins: what can we learn from the differences between observational versus randomised trial evidence?

  9. Global warming is anthropogenic (Section 3.1 of this book by Fred Bookstein).

TODO Statistical Modelling (Part II, Michaelmas 2020)

General information

  • This course consists of 16 lectures and 8 practical sessions. It complements the Part II Principles of Statistics, but takes a more applied perspective.

  • Course schedule (page 25).

  • Prerequisites: Part IB Statistics.

  • Due to the pandemic, all the lectures and practicals will be online. Videos and handwritten notes will be made available through this webpage. I won’t be providing LaTeX notes but will try to follow the notations in last year’s lecture notes.

  • Please email me or leave a comment below if you find any mistakes or have any questions.

  • Recordings can be found in Moodle.

Lectures

Number Date Topic Optional Reading
Part 1 Linear models
L1 <2020-10-08 Thu> Scope of the course, least squares and its geometry Agresti 2.1–2.4
L2 <2020-10-13 Tue> Gauss-Markov, exact inference under normality Agresti 2.7, 3.1–3.3
L3 <2020-10-15 Thu> Heteroscedasticity, diagnostics, robust regression Agresti 2.5
L4 <2020-10-20 Tue> Model misspecification, bias-variance tradeoff ISLR 2.2.1–2.2.2
L5 <2020-10-22 Thu> Simpson’s paradox, model selection ISLR 6.1–6.2
L6 <2020-10-27 Tue> Review of linear models, likelihood asymptotics 2019 notes 2.5.1–2.5.3
L7 <2020-10-29 Thu> Delta method; From LM to GLM Common distributions
L8 <2020-11-03 Tue> Properties of exponential families Efron’s notes on empirical Bayes
L9 <2020-11-05 Thu> Conjugate priors, MLE, Deviance
L10 <2020-11-10 Tue> Deviance residuals, exponential dispersion families 2019 notes 2.3
L11 <2020-11-12 Thu> GLMs: MLE Agresti 4.1–4.3
L12 <2020-11-17 Tue> GLMs: Analysis of deviance, computation Agresti 4.4–4.5
L13 <2020-11-19 Thu> GLMs: Model selection, diagnostics, binomial models Agresti 4.4, 4.6
L14 <2020-11-24 Tue> Poisson GLMs; Multinomial model and Poisson trick Agresti 7.1, 7.2
L15 <2020-11-26 Thu> Contigency tables and independence
L16 <2020-12-01 Tue> Reivew and look forward

Full notes (to L12): PDF (181MB); GoodNotes (146MB);

Practicals

Number Date Topic Optional Reading
P1 <2020-10-10 Sat> Basic R; Solution CRAN Intro to R 1,2,5,8
P2 <2020-10-17 Sat> Writing functions, linear models; Code; Solution CRAN Intro to R 6, 10
P3 <2020-10-24 Sat> Linear models; Code CRAN Intro to R 11.1–3
P4 <2020-10-31 Sat> Model selection; Code; Solution
P5 <2020-11-07 Sat> ANOVA and ANCOVA; Code; Solution 2019 notes 1.2.5
P6 <2020-11-14 Sat> Binomial GLMs; Code; Solution
P7 <2020-11-21 Sat> Binomial and Poisson GLMs; Code; Solution
P8 <2020-11-28 Sat> Contigency tables and Gamma GLMs; Code; Solution Agresti 4.7

Example sheets

Readings

  • R and statistical computing

TODO Causal Inference (Part III, Michaelmas 2020)

This is a 16-lecture course on causal inference, the statistical science of drawing causal conclusions from experimental and non-experimental data.

{{% toc %}}

General information

  • Course syllabus.
  • Time: Tuesday & Thursday, 11am–12.
  • Location: Live-stream via Zoom (link available in Moodle).
  • Office hour: I will stay on Zoom after each lecture to answer questions. I would also like to chat with every Part III student who is taking this course. Please sign up here for a 20 minute slot.
  • Please email me if you find any mistakes or have any suggestions.

Lectures

Number Date Topic Optional Reading
Part 1 Motivations
L1 <2020-10-08 Thu> Principles of causal inference Pearl Epilogue
L2 <2020-10-13 Tue> Potential outcomes and Neyman’s inference IR 1, 4, 6
L3 <2020-10-15 Thu> Randomisation test, regression adjustment IR 5, 7
L4 <2020-10-20 Tue> Regression adjustment; Linear SEM and path analysis Pearl 5.1
L5 <2020-10-22 Thu> Path analysis, correlation versus causation Review paper by Pearl
L6 <2020-10-27 Tue> Identification and estimation in linear SEMs A psychology paper
L7 <2020-10-29 Thu> Graphical models and Markov properties
L8 <2020-11-03 Tue> Structure discovery; Nonparametric SEMs Talk on SWIGs
L9 <2020-11-05 Thu> Single world intervention graphs; g-formula
L10 <2020-11-10 Tue> Causal identification HR 6
L11 <2020-11-12 Thu> No unmeasured confounders: Randomisation inference SSRMP tutorial slides
L12 <2020-11-17 Tue> Sensitivity analysis; Intro to semiparametric inference Review paper by Kennedy
L13 <2020-11-19 Thu> No unmeasured confounders: Semiparametric inference
L14 <2020-11-24 Tue> Doubly robust estimator; Leveraging specificity
L15 <2020-11-26 Thu> Instrumental variables
L16 <2020-12-01 Tue> Mediation analysis

Full Lecture notes (Last updated: December 16, 2020).

Lecture recordings can be found in Moodle.

Example classes

Readings

The following books/articles are optional. I am providing a short (personal) verdict to help you navigate the literature.

Statistical Modelling (Part II, Michaelmas 2021)

General information

  • This course consists of 16 lectures and 8 practical sessions. It complements the Part II Principles of Statistics, but takes a more applied perspective.

  • Prerequisites: Part IB Statistics.

  • Location: MR5.

  • Please email me or leave a comment below if you find any mistakes or have any questions.

  • Lectures will be recorded and the recordings can be found on Moodle.

Lectures

Practicals

Number Date Topic Optional Reading
P1 <2021-10-9 Sat> Basic R; Solution CRAN Intro to R 1,2,5,8
P2 <2021-10-16 Sat> Writing functions, linear models; Code; Solution CRAN Intro to R 6, 10
P3 <2021-10-23 Sat> Linear models; Code CRAN Intro to R 11.1–3
P4 <2021-10-30 Sat> Model selection; Code; Solution
P5 <2021-11-06 Sat> ANOVA and ANCOVA; Code; Solution 2019 notes 1.2.5
P6 <2021-11-13 Sat> Binomial GLMs; Code; Solution
P7 <2021-11-20 Sat> Binomial and Poisson GLMs; Code; Solution
P8 <2021-11-27 Sat> Contigency tables and Gamma GLMs; Code; Solution Agresti 4.7

Example sheets

Readings

  • R and statistical computing

    • W. N. Venables, D. M. Smith and the R Core Team. An Introduction to R.

    • H. Wickham. Advanced R (for anyone who wants to really understand R as a programming language).

Causal Inference (Part III, Michaelmas 2021)

This is a 16-lecture course on causal inference, the statistical science of drawing causal conclusions from experimental and non-experimental data.

{{% toc %}}

General information

  • Course syllabus.

  • Office hour: Wednesday at 2pm (if there is no example class) @ CMS, D1.01.

  • Please email me or leave a comment below if you find any mistakes or have any questions.

Lectures

Example classes

  • Time: 13:45–15:15 on 3 November, 17 November, 1 December, 19 January.
  • Instructor: Tobias Freidling, who also provided the solutions below.
  • Location: MR3.

Readings

The following books/articles are optional. I am providing a short (personal) verdict to help you navigate the literature.

TODO Introduction to Causal Inference (MPhil in Population Health, Lent 2022)

Click here for the theory slides.

Click here for the practical below in PDF format.

Randomization in design and analysis

Randomized controlled trials (RCTs) are widely regarded as the “gold standard” of establishing causality. The often forgotten component of the RCTs is that they can be objectively analyzed by randomization test. Haines and coauthors investigated the impact of disinvestment from weekend allied health services. We will use their dataset to explore the concept of randomization in the design and analysis of an experiment.

  1. [Group] Skim through the abstract and read the section called “Design” of their article. Then answer the following questions: What is the name of the design of the experiment in this study? How was it carried out?

  2. Download the patient-level data, then run the following code in R (you may need to install the readxl package first by install.packages("readxl")). What does the second line do?

    data <- readxl::read_excel("S2 Data.xlsx")
    data <- subset(data, hospital == "Dandenong" & study1 == 1)
    
  3. Unfortunately, this dataset is not very well annotated. The columns index_ward and sw_step contain the identifiers for hospital ward and time step (in calendar month), respectively. In which order do you think the wards crossed over to no weekend health services? You may find the following R code useful.

    table(data[, c("index_ward", "sw_step", "no_we_exposure")])
    
  4. Construct a vector called cross_over_realized that contains the calendar month in which the \(6\) hospital wards crossed over. Then use the following code to define the treatment and outcome of interest (“los” is short for length of stay).

    data$treatment_status <- as.numeric(data$sw_step >= cross_over_realized[data$index_ward])
    data$log_acute_los <- log(data$acute_los)
    
  5. [Group] Execute the following code in your R session. Then comment on the two interval estimators of the treatment effect (of no weekend health services on log length of stay).

    confint(lm(log_acute_los ~ treatment_status, data))
    confint(lm(log_acute_los ~ treatment_status + as.factor(index_ward), data))
    
  6. [Group] Next, we explore the randomization analysis of this dataset. First, use potential outcomes to define the null hypothesis that stopping weekend health services has no effect whatsoever. Notice that the treatment is not the same as the variable randomized in the experiment (crossover order). What assumption do you incurred while defining your null hypothesis? Give an example in which this assumption is not satisfied.

  7. Read the following code, then execute it in your R session (you may need to install the package combinat which contains a function permn that generates all the permutations of a vector). For your reference, the expected output is included.

    get_statistic <- function(index_ward,
                              sw_step,
                              log_acute_los,
                              cross_over) {
      treatment_status <- sw_step >= cross_over[index_ward]
      c(lm(log_acute_los ~ treatment_status)$coef[2],
        lm(log_acute_los ~ treatment_status + as.factor(index_ward))$coef[2])
    }
    
    T_obs <- get_statistic(data$index_ward, data$sw_step,
                           data$log_acute_los, cross_over_realized)
    
    T_random <- sapply(combinat::permn(2:7),
                       get_statistic,
                       index_ward = trial1$index_ward,
                       sw_step = trial1$sw_step,
                       log_acute_los = trial1$log_acute_los)
    
    par(mfrow = c(2, 1))
    for (m in 1:2) {
      hist(T_random[m, ], 20,
           main = paste0("Randomization distribution (model ", m, "): ",
                         "p-value = ", signif(mean(T_random[m, ] >= T_obs[m]), 2)),
           xlab = "Test statistic", xlim = range(T_random))
      abline(v = T_obs[m], col = "red")
    }
    
  8. [Group] Explain what the code above does and discuss the results. Here are some points you may consider

    • How do the two randomization tests compare with each other?
    • How do the randomization tests compare with the normal linear model? How would you interpret their results?
    • The randomization distribution of the second test statistic is clearly not centered at 0. Why?
    • How can you “invert” the randomization tests to obtain an interval estimator of the treatment effect?

[Group] Causal diagrams and causal identification

In this group exercise, we will read the article titled “A Note on Posttreatment Selection in Studying Racial Discrimination in Policing”.

  1. Read the section “Review”. Using Figure 1, explain the causal inference problem under investigation. Why do the authors say “the naive treatment effect \(\Delta\) [in Equation 1] can be quite misleading when used to represent the causal effect of race on police violence”? Hint: \(M\) is a collider.
  2. Use your own words to explain Assumption 1.

You may skip the section “Average treatment effects conditional on the mediator”.

  1. Read the first half of the section “A new estimator for the causal risk ratio”, then use your own words to explain Equation 3. Can we use the police admin data to estimate the “bias factor” in this equation?
  2. Read the first three paragraphs in “A reanalysis of the NYPD stop-and-frisk dataset”, then use your own words to explain the results in Table 1. Use Equation 3 and Figure 2 to explain the large discrepancy between the naive and adjusted estimators in Table 1.

TODO Causal Inference with Observational Data: Common Designs and Statistical Methods

Course description

Observational studies are non-interventional empirical investigations of causal effects and are playing an increasingly vital role in healthcare decision making in the era of data science. The study design is particularly important in planning observational studies due to the lack of randomization. Aspects of design include defining the objectives and context under investigation, collecting the right data, and choosing suitable strategies to remove bias from measured and unmeasured confounders. Statistical analysis should also align with the design.

This module covers key concepts and useful methods for designing and analyzing observational studies. The first part of the module will focus on matching and weighting methods for cohort and case-control studies for causal inference. Specific topics include basic tools of matching and weighting, randomization inference, and sensitivity analysis. The second part of the module will focus on methods to address unmeasured confounding via causal exclusion. Specific topics include instrumental variables, negative controls, and difference-in-differences. Participants will also gain practical experience by applying these methods to real datasets using R.

Target audiences for this module are:

  1. clinical researchers who need to use observational data to generate evidence of causality;
  2. biostatisticians who are interested in understanding how causal inference can be reliably made in practice.

Background in statistical inference and some knowledge of R are recommended.

General information

  • Instructors: Ting Ye, Qingyuan Zhao.
  • Teaching assistant: Marlena Bannick.
  • Time: July, 25-27, 2022.
  • SISCER page.
  • You should have access to the Slack channel for this module. If not, please contact us.
  • Lectures will be delivered via Zoom and be recorded. The recordings will be posted on the course website when they are available. Practical sessions will not be recorded.

Teaching materials

Computing environment

Before the module starts, please ensure that you have installed the latest version of R. We also recommend you to use an integrated development environment like RStudio.

Project

Projects Home

IV & MR (project page) Instrumental-VariablesMendelian-Randomization

COVID-19 (project page) Infectious-Diseases

Randomization (project page) Randomization

A related page is Instrumental Variables and Mendelian randomization.

Students

Students

Postdocs

Ph.D. students

  • Matt Tudball (co-supervision): 2018–2023.
  • Tobias Freidling: 2020–2025.
  • Joakim Blach Andersen: 2021–2025.
  • Max Zhu: 2022–
  • Martina Scauda: 2023–

Undergraduate students

  • Etaash Katiyar: July–September, 2020.
  • Naomi Wei: July–September, 2021.
  • Thalia Seale: July–September, 2021.
  • Junshi Wang: June–August, 2022.
  • Timothée Foutot: April–July, 2023.

Useful links

Maths & Stats

Computing