I am a Professor of Statistics in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics (DPMMS) at University of Cambridge, a Fellow of the Corpus Christi College, and an Associate Faculty of the Cambridge Centre for AI in Medicine (CCAIM).
My research interests lie primarily in drawing scientific conclusions about causal relationships using experimental and observational data, a fast-growing area known as “causal inference”. More broadly, I would like to understand how “design”—a principle I view as fundamental yet elusive in statistics—shapes the practice of statistical applications in biomedical and social sciences.
Click here for a bio-sketch in the third person narrative.
PhD in Statistics, 2016
Stanford University
BSc in Mathematics, 2011
University of Science and Technology of China (USTC)
Qingyuan was born and raised in Wuhan in central China, a city known for its many lakes and rivers and rich cultural heritage. After high school, he went to the Special Class for the Gifted Young in University of Science and Technology of China and majored in mathematics. He then went to Stanford University for postgraduate studies and obtained his Ph.D. in Statistics in 2016. He spent three years in the Wharton School of University of Pennsylvania as a postdoctoral fellow before joining the Statistical Laboratory in University of Cambridge as a University Lecturer in 2019. He was promoted to Professor of Statistics at University of Cambridge in 2024.
Qingyuan’s research interests lie primarily in drawing scientific conclusions about causal relationships using experimental and observational data, a fast-growing area known as “causal inference”. More broadly, he strives to understand how “design”—a principle he views as fundamental yet elusive in statistics—shapes the practice of statistical applications in biomedical and social sciences.
This is a 16-lecture course on causal inference, the statistical science of drawing causal conclusions from experimental and non-experimental data.
{{% toc %}}
Time: Tuesday & Thursday, 12-1pm.
Location: MR14.
Office hour: By appointment.
Prerequisites:
Lecture notes will be provided after each lecture. I am not quite ready to make the lecture notes public, so email me if you miss the first lecture and need the username and password to access the notes.
Chapter | Topic | Last updated | Additional materials |
---|---|---|---|
Part I | Motivations | ||
1 | Principles of causal inference | ||
2 | Randomised experiments | 2019 Nobel Prize in Economics | |
3 | Linear structural equation models | Excerpt from a psychology paper | |
Part II | Languages for causality | ||
4 | Probabilistic graphical models | ||
5 | Nonparametric structural equations and counterfactuals | ||
6 | Causal Identification | ||
Part III | Statistical methods | ||
7 | Matching and randomisation inference | ||
8 | Semiparametric inference for average treatment effects | ||
9 | Instrumental variables | ||
10 | Regression discontinuity design | ||
11 | Negative control | ||
12 | Mediation analysis | ||
Part I—III | Full notes (Chapters 1–12) |
The full notes include corrections and clarifications to the in-class notes and solution or hint to some exercises.
Causal Inference for Statistics, Social, and Biomedical Sciences by Guido Imbens and Donald Rubin.
Causality: Models, Reasoning, and Inference by Judea Pearl.
Statistical Models: Theory and Practice by David Freedman.
Graphical Models by Steffen Lauritzen.
Observational Studies by Paul Rosenbaum.
Causal Inference by Miguel Hernán and James Robins.
Mostly Harmless Econometrics: An Empiricist’s Companion by Joshua Angrist and Jörn-Steffen Pischke.
Linear models: A useful “microscope” for causal analysis by Judea Pearl.
Single World Intervention Graphs by Thomas Richardson and James Robins.
Time | Sheet | Last updatd | Solution |
---|---|---|---|
Example Sheet 1; Dataset; SEM paper | Mostly in the full lecture notes; R code for Q5 | ||
Example Sheet 2 | See the full lecture notes. | ||
Example Sheet 3 | See the full lecture notes. | ||
Reading materials; See below | |||
Revision | Revision notes; Video recording. |
The 4th example class will be interactive. The lecturer will provide a list of applied articles (in social sciences, public health, and other areas) before the 3rd example class. Each student will then be asked to pick an article and give a short presentation in the final example class.
It is intended that all students wishing to take the exam of this course can participate in this example class. Please let me know if you have trouble attending this class. More information will be provided during the Michaelmas term.
A tentative list of applied articles (being updated):
Political intolerance and political repression during the McCarthy Red Scare (reprinted in David Freedman’s book, page 315–342) and Freedman’s comments (Section 6.3).
Finishing high school and starting college: Do Catholic schools make a difference (reprinted in David Freedman’s book, page 343–376) and Freedman’s comments (Section 7.4).
Education and fertility: Implications for the roles women occupy (reprinted in David Freedman’s book, page 377–401) and Freedman’s comments (Section 9.5).
Institutional arrangements and the creation of social capital: The effects of public school choice (reprinted in David Freedman’s book, page 402–430) and Freedman’s comments (Section 9.7).
A new perspective on John Snow’s communicable disease theory (a short version can be found in David Freedman’s book, Section 1.3) and Sociomedical indicators in the cholera epidemic in Ferrara of 1855 (also this editorial and commentary). These articles may be suitable for two presenters, one focusing on John Snow’s analysis and one on the Ferrara study.
Thomas Cook’s commentary to Cochran (page 140–163, 1st Volume of Observational Studies) has three examples demonstrating the limitations of observational studies that seek to mimic randomised experiments (Page 146: Issue 1; Page 153: Issue 2; Page 157: Issue 3). This article may be suitable for up to three presenters.
Triangulation in aetiological epidemiology has three illustrative examples for corroboration of evidence from different causal inference approaches. Some of the results are in the supplementary materials and can be downloaded from the IJE website. This article may be suitable for up to three presenters.
Global warming is anthropogenic (Section 3.1 of this book by Fred Bookstein).
This course consists of 16 lectures and 8 practical sessions. It complements the Part II Principles of Statistics, but takes a more applied perspective.
Course schedule (page 25).
Prerequisites: Part IB Statistics.
Due to the pandemic, all the lectures and practicals will be online. Videos and handwritten notes will be made available through this webpage. I won’t be providing LaTeX notes but will try to follow the notations in last year’s lecture notes.
Please email me or leave a comment below if you find any mistakes or have any questions.
Recordings can be found in Moodle.
Number | Date | Topic | Optional Reading |
---|---|---|---|
Part 1 | Linear models | ||
L1 | Scope of the course, least squares and its geometry | Agresti 2.1–2.4 | |
L2 | Gauss-Markov, exact inference under normality | Agresti 2.7, 3.1–3.3 | |
L3 | Heteroscedasticity, diagnostics, robust regression | Agresti 2.5 | |
L4 | Model misspecification, bias-variance tradeoff | ISLR 2.2.1–2.2.2 | |
L5 | Simpson’s paradox, model selection | ISLR 6.1–6.2 | |
L6 | Review of linear models, likelihood asymptotics | 2019 notes 2.5.1–2.5.3 | |
L7 | Delta method; From LM to GLM | Common distributions | |
L8 | Properties of exponential families | Efron’s notes on empirical Bayes | |
L9 | Conjugate priors, MLE, Deviance | ||
L10 | Deviance residuals, exponential dispersion families | 2019 notes 2.3 | |
L11 | GLMs: MLE | Agresti 4.1–4.3 | |
L12 | GLMs: Analysis of deviance, computation | Agresti 4.4–4.5 | |
L13 | GLMs: Model selection, diagnostics, binomial models | Agresti 4.4, 4.6 | |
L14 | Poisson GLMs; Multinomial model and Poisson trick | Agresti 7.1, 7.2 | |
L15 | Contigency tables and independence | ||
L16 | Reivew and look forward |
Full notes (to L12): PDF (181MB); GoodNotes (146MB);
Number | Date | Topic | Optional Reading |
---|---|---|---|
P1 | Basic R; Solution | CRAN Intro to R 1,2,5,8 | |
P2 | Writing functions, linear models; Code; Solution | CRAN Intro to R 6, 10 | |
P3 | Linear models; Code | CRAN Intro to R 11.1–3 | |
P4 | Model selection; Code; Solution | ||
P5 | ANOVA and ANCOVA; Code; Solution | 2019 notes 1.2.5 | |
P6 | Binomial GLMs; Code; Solution | ||
P7 | Binomial and Poisson GLMs; Code; Solution | ||
P8 | Contigency tables and Gamma GLMs; Code; Solution | Agresti 4.7 |
Theory for LM and GLM
A. Agresti. Foundations of Linear and Generalized Linear Models. Wiley 2015. [Agresti]
G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning (with Applications in R). Springer 2013. [ISLR]
Last year’s lecture notes.
Prof Richard Weber’s notes for IB Statistics.
Prof Brad Efron’s notes on exponential families ( I, II) and generalised linear models ( III). (These notes are quite advanced and are only for the most ambitious students.)
R and statistical computing
This is a 16-lecture course on causal inference, the statistical science of drawing causal conclusions from experimental and non-experimental data.
{{% toc %}}
Full Lecture notes (Last updated: December 16, 2020).
Lecture recordings can be found in Moodle.
The following books/articles are optional. I am providing a short (personal) verdict to help you navigate the literature.
Causal Inference for Statistics, Social, and Biomedical Sciences by Guido Imbens and Donald Rubin [IR]. This book provides a gentle introduction to potential outcomes and statistical methods for simple randomised experiments and observational studies with no unmeasured confounders.
Causal Inference: What If by Miguel Hernán and James Robins [HR]. This book provides a comprehensive treatment for causal inference without and with models.
Causality: Models, Reasoning, and Inference by Judea Pearl [Pearl]. A great book if you are interested in the philosophical debates in causal inference.
Statistical Models: Theory and Practice by David Freedman. A less technical textbook is well suited for someone who wants to learn the basic ideas in causal inference through practical examples.
Graphical Models by Steffen Lauritzen. A good reference for probabilistic graphical models.
Observational Studies by Paul Rosenbaum. A good book for randomisation inference and sensitivity analysis.
Mostly Harmless Econometrics: An Empiricist’s Companion by Joshua Angrist and Jörn-Steffen Pischke. Very clearly written book from an applied econometrics point of view, with a lot of useful intuitions.
This course consists of 16 lectures and 8 practical sessions. It complements the Part II Principles of Statistics, but takes a more applied perspective.
Prerequisites: Part IB Statistics.
Location: MR5.
Please email me or leave a comment below if you find any mistakes or have any questions.
Lectures will be recorded and the recordings can be found on Moodle.
By Chapter
Combined
Number | Date | Topic | Optional Reading |
---|---|---|---|
P1 | <2021-10-9 Sat> | Basic R; Solution | CRAN Intro to R 1,2,5,8 |
P2 | Writing functions, linear models; Code; Solution | CRAN Intro to R 6, 10 | |
P3 | Linear models; Code | CRAN Intro to R 11.1–3 | |
P4 | Model selection; Code; Solution | ||
P5 | ANOVA and ANCOVA; Code; Solution | 2019 notes 1.2.5 | |
P6 | Binomial GLMs; Code; Solution | ||
P7 | Binomial and Poisson GLMs; Code; Solution | ||
P8 | Contigency tables and Gamma GLMs; Code; Solution | Agresti 4.7 |
Theory for LM and GLM
A. Agresti. Foundations of Linear and Generalized Linear Models. Wiley 2015.
G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning (with Applications in R). Springer 2013.
Prof Richard Weber’s notes for IB Statistics.
R and statistical computing
W. N. Venables, D. M. Smith and the R Core Team. An Introduction to R.
H. Wickham. Advanced R (for anyone who wants to really understand R as a programming language).
This is a 16-lecture course on causal inference, the statistical science of drawing causal conclusions from experimental and non-experimental data.
{{% toc %}}
Office hour: Wednesday at 2pm (if there is no example class) @ CMS, D1.01.
Please email me or leave a comment below if you find any mistakes or have any questions.
Lectures will be recorded and the recordings can be found on Moodle.
The following books/articles are optional. I am providing a short (personal) verdict to help you navigate the literature.
Causal Inference for Statistics, Social, and Biomedical Sciences by Guido Imbens and Donald Rubin [IR]. This book provides a gentle introduction to potential outcomes and statistical methods for simple randomised experiments and observational studies with no unmeasured confounders.
Causal Inference: What If by Miguel Hernán and James Robins [HR]. This book provides a comprehensive treatment for causal inference without and with models.
Causality: Models, Reasoning, and Inference by Judea Pearl [Pearl]. A great book if you are interested in the philosophical debates in causal inference.
Statistical Models: Theory and Practice by David Freedman. A less technical textbook is well suited for someone who wants to learn the basic ideas in causal inference through practical examples.
Graphical Models by Steffen Lauritzen. A good reference for probabilistic graphical models.
Observational Studies by Paul Rosenbaum. A good book for randomisation inference and sensitivity analysis.
Mostly Harmless Econometrics: An Empiricist’s Companion by Joshua Angrist and Jörn-Steffen Pischke. Very clearly written book from an applied econometrics point of view, with a lot of useful intuitions.
Click here for the theory slides.
Click here for the practical below in PDF format.
Randomized controlled trials (RCTs) are widely regarded as the “gold standard” of establishing causality. The often forgotten component of the RCTs is that they can be objectively analyzed by randomization test. Haines and coauthors investigated the impact of disinvestment from weekend allied health services. We will use their dataset to explore the concept of randomization in the design and analysis of an experiment.
[Group] Skim through the abstract and read the section called “Design” of their article. Then answer the following questions: What is the name of the design of the experiment in this study? How was it carried out?
Download the
patient-level data, then run the following code
in R (you may need to install the readxl
package first by
install.packages("readxl")
). What does the second line do?
data <- readxl::read_excel("S2 Data.xlsx")
data <- subset(data, hospital == "Dandenong" & study1 == 1)
Unfortunately, this dataset is not very well annotated. The
columns index_ward
and sw_step
contain the identifiers for
hospital ward and time step (in calendar month), respectively. In
which order do you think the wards crossed over to no weekend
health services? You may find the following R
code useful.
table(data[, c("index_ward", "sw_step", "no_we_exposure")])
Construct a vector called cross_over_realized
that contains
the calendar month in which the \(6\) hospital wards crossed
over. Then use the following code to define the treatment and
outcome of interest (“los” is short for length of stay).
data$treatment_status <- as.numeric(data$sw_step >= cross_over_realized[data$index_ward])
data$log_acute_los <- log(data$acute_los)
[Group] Execute the following code in your R
session. Then
comment on the two interval estimators of the treatment effect
(of no weekend health services on log length of stay).
confint(lm(log_acute_los ~ treatment_status, data))
confint(lm(log_acute_los ~ treatment_status + as.factor(index_ward), data))
[Group] Next, we explore the randomization analysis of this dataset. First, use potential outcomes to define the null hypothesis that stopping weekend health services has no effect whatsoever. Notice that the treatment is not the same as the variable randomized in the experiment (crossover order). What assumption do you incurred while defining your null hypothesis? Give an example in which this assumption is not satisfied.
Read the following code, then execute it in your R
session (you
may need to install the package combinat
which contains a
function permn
that generates all the permutations of a
vector). For your reference, the expected output is included.
get_statistic <- function(index_ward,
sw_step,
log_acute_los,
cross_over) {
treatment_status <- sw_step >= cross_over[index_ward]
c(lm(log_acute_los ~ treatment_status)$coef[2],
lm(log_acute_los ~ treatment_status + as.factor(index_ward))$coef[2])
}
T_obs <- get_statistic(data$index_ward, data$sw_step,
data$log_acute_los, cross_over_realized)
T_random <- sapply(combinat::permn(2:7),
get_statistic,
index_ward = trial1$index_ward,
sw_step = trial1$sw_step,
log_acute_los = trial1$log_acute_los)
par(mfrow = c(2, 1))
for (m in 1:2) {
hist(T_random[m, ], 20,
main = paste0("Randomization distribution (model ", m, "): ",
"p-value = ", signif(mean(T_random[m, ] >= T_obs[m]), 2)),
xlab = "Test statistic", xlim = range(T_random))
abline(v = T_obs[m], col = "red")
}
[Group] Explain what the code above does and discuss the results. Here are some points you may consider
In this group exercise, we will read the article titled “A Note on Posttreatment Selection in Studying Racial Discrimination in Policing”.
You may skip the section “Average treatment effects conditional on the mediator”.
Observational studies are non-interventional empirical investigations of causal effects and are playing an increasingly vital role in healthcare decision making in the era of data science. The study design is particularly important in planning observational studies due to the lack of randomization. Aspects of design include defining the objectives and context under investigation, collecting the right data, and choosing suitable strategies to remove bias from measured and unmeasured confounders. Statistical analysis should also align with the design.
This module covers key concepts and useful methods for designing and analyzing observational studies. The first part of the module will focus on matching and weighting methods for cohort and case-control studies for causal inference. Specific topics include basic tools of matching and weighting, randomization inference, and sensitivity analysis. The second part of the module will focus on methods to address unmeasured confounding via causal exclusion. Specific topics include instrumental variables, negative controls, and difference-in-differences. Participants will also gain practical experience by applying these methods to real datasets using R.
Target audiences for this module are:
Background in statistical inference and some knowledge of R are recommended.
Before the module starts, please ensure that you have installed the latest version of R. We also recommend you to use an integrated development environment like RStudio.
A related page is Instrumental Variables and Mendelian randomization.