Discussion on (quasi-)randomization inference

2023-01-10 12 min read

The following is an email exchange with Hyunseung Kang on the nature of randomization inference. This was sparked by a couple of papers by David Freedman and David Lane: A Nonstochastic Interpretation of Reported Significance Levels; Significance Testing in a Nonstochastic Setting.

Disclaimer: I removed some of the irrelevant discussion. Thank you Hyunseung for kindly agreeing to post our discussion here.

Email 1

From: Qingyuan Zhao <qyzhao@statslab.cam.ac.uk> Sent: Monday, December 12, 2022 3:58 PM To: HYUNSEUNG KANG <hkang84@wisc.edu> Subject: LD/Fisher

Hi Hyunseung,

I had a look at Freedman-Lane. Very interesting. Fisher was often quite vague and his ideas sometimes have different interpretations. According to his daughter Joan Fisher Box (and I guess you must know she was at Madison; actually is she still alive?), Fisher initially uses the geometry generated by randomization to justifies his ANOVA. This seems quite similar to what Freedman-Lane suggest and it’s unclear if they knew it. You might be interested in this post which summarizes the talk I gave at the Fisher conference earlier this year: http://www.statslab.cam.ac.uk/post/randomization-origin/.

Have you seen this review paper I wrote about randomization tests? https://arxiv.org/pdf/2203.10980.pdf. I think you know most if not all the technical ideas there, I would ask your opinion on one thing. What do you think about the name “quasi-randomization test” for a test that is algorithmically equivalent to a randomization test (like what Freedman-Lane suggests) but actually does not involve any randomization? I thought a lot of the confusion about randomization tests just arise from poor (or a lack of) terminology. Using the prefix “quasi” seems useful to clarify a lot of things.

Best wishes,

Qingyuan

Email 2

From: HYUNSEUNG KANG <hkang84@wisc.edu> Date: Tuesday, December 13, 2022 at 1:13 PM To: Qingyuan Zhao <qyzhao@statslab.cam.ac.uk> Subject: Re: LD/Fisher

Hi Qingyuan,

I wanted to share my (admittedly long) thoughts with you about the quasi-randomization question you asked and my brief re-reading of your excellent paper on this.

I personally think the Freedman and Lane’s example with “randomizing” gender makes the distinction between the classic randomization test and your quasi-randomization tests more clear than the example you have in page 2/3 on RCT vs observational study. Btw, there is also a more precise, “technical” angle/example to delineate randomization and permutation tests in Lehmann and Romano’s Statistical Hypothesis Testing book (chapters on 2 by 2 contingency tables, out of all places); this book also cites Freedman and Lane, along with other papers?

For me, after reading Freeman and Lane and going down the (very, very deep) rabbit hole on this about a few months ago, I agree with your final point that both randomization and quasi-randomization are permutation tests that give the exact same p-values/rejection regions. But, they are based under different assumptions and thus, carry different meanings. Also, deviating from your thoughts in the paper, I tend to see p-values from RCTs and obs study as those based on “randomization tests” as they still deal with the effect of a theoretically manipulable treatment and the statistical uncertainty arises from the assumed null distribution about this manipulable treatment.

Personally, I would define quasi-randomization tests (or I would call them “tests of unusualness” or “tests of aberrance” or “test of fishiness”) as tests where the treatment (or any variable, for that matter) cannot be physically manipulated and assuming a “null” distribution about this manipulable treatment surmounts to a conceptually (but not mathematically) implausible conditionally independent null distribution. For example, say I want to test that there is no race effect on hiring for Google and I’m going to test this by permuting black/white within two candidates that look very similar to each other. This null is mathematically well-defined, but conceptually implausible since permuting race is a lot more difficult to grasp than permuting treatment/placebo. You kind of hinted at this in your observational study example where the two units that are paired may actually be different in some future characteristic that may actually be related to race. Instead, quasi-randomization tests create extra randomness that is conceptually plausible/interpretable and this randomness is designed to measure the “unusualness” of the observed value; in other words, quasi-randomization tests creates a “control group of numbers” to compare your observed number against. Critically, this set of numbers is often based on permuting the entire data of the individual (not just the treatment itself).

As a concrete example, in the hiring example again, I can frame the testing question as whether the observed race effect is “unusual” or “aberrant” instead of “statistically significant” by comparing the observed effect to a set of numbers generated by

(a) looking at their shoe size and assigning individual \(1\) in a pair to “large feet”/“small feet” if individual \(1\) has larger/smaller feet than individual 2
(b) computing the difference in hiring between the individual with “large feet” versus “small feet”

Since shoe size should not affect candidate’s hiring and thus, should have a null/zero effect on hiring (this is akin to a negative control treatment). I can repeat (a) with other characteristics (or things completely irrelevant to the candidate) that I know should not affect hiring, including a random coin toss. Then, the resulting “distribution” is a set of “null” differences of hiring. If the observed gender effect is “aberrant” or “unusual” compared to this set (i.e. say, it’s larger, smaller, etc.), then something is fishy about this race-based difference in hiring rates compared other null-based ways of comparing hiring rates (e.g. difference in hiring based on shoe size, difference in hiring based on whether candidate 1 had head for coin toss, etc.). In fact, if you replace (a) with a random process like tossing a coin, what I described to you is the permutation test.

Also, while I’m not 100% sure, I think this approach to thinking about the null distribution and computing a p-value subtly avoids the sharp/weak null distinction, even though it’s p-value is identical to had we assumed a sharp null and did “quasi-randomization” or randomization test as you defined it. That is, you can arrive at the same p-value under different assumptions about stochasticity (or non-stochasticity) of your data. More broadly, from a conceptual perspective, I think what I’m conjecturing in this paragraph is equivalent to interpreting an OLS estimate as an estimate of a parameter of a linear model or an estimate of the best L2 projection of Y onto the column space of X. The OLS estimator’s value does not change under both setups, but its interpretation differs because of the different assumptions you make about X~Y relationship.

In general, after all of this reading, I now like to think of a permutation test (or any test) to be some stochastic (or non-stochastic) approach to generating a set of control numbers for which we can compare our observed number to. Different assumptions on sampling, population, exchangeabilty, sharp/weak/partially sharp hypotheses creates a different set of control numbers and the substantive debate for practitioners is to ask what assumption that generated these control numbers is conceptually easier from the perspective of interpreting “unusualness” of the observed number.

Happy to discuss this more and my exploration of this literature. My student and I are currently writing a paper on this, w.r.t. how algorithmic fairness (or fairness in law) is statistically assessed/interpreted and if you are interested, we can share our draft with you once completed.

Best,

-Hyunseung

Email 3

From: Qingyuan Zhao <qyzhao@statslab.cam.ac.uk> Sent: Tuesday, December 13, 2022 3:27 PM To: HYUNSEUNG KANG <hkang84@wisc.edu> Subject: Re: LD/Fisher

Hi Hyunseung,

Thanks for sharing your thoughts on this.

The name “quasi-randomization test” was inspired by the term “quasi-experiment” used in social science and coined by Donald Campbell (which, I think, basically means observational studies in our community). The question is: Is it the right name for the thought experiment you described (which reads similar to Freedman-Lane and what they quoted about Fisher)?

The most notable feature in your example is that there is nothing random; this is also what Freedman and Lane emphasized. In other words, there is no probability space that can be naturally defined. I wanted to say this is because we don’t know what randomizing race means in the real world, but that’s not exactly true, because by permuting race and insisting on the p-value of “aberrance” has any practical implication means that we have some models about race in our mind. So if you would like to convince me that this test says anything about the race effect on hiring for Google, you need to first convince me that the model implied by taking permutations is reasonable.

So, the beauty of randomization is that you no longer need to convince me that the model implied by taking permutations is reasonable. This is especially true if you tell me that someone else (who I trust) performed the randomization, so there is little room for colluding.

The tension here is that this model implied by permuting race is not obviously probabilistic – It is unclear what the right probability space is. Or maybe the more precise way to put this is: there is no probabilistic model that can be universally accepted, because race seems to depend on numerous random events that happened to our ancestors. I guess gender is quite different: it is essentially randomized at every birth (although the sex ratio differs slightly from 1:1). So permuting gender seems quite reasonable, but permuting gender while holding other gender-related things fixed is not. (An extreme example is to permute gender but hold fixed whether they go to an all-boys or all-girls school.)

Maybe the solution lies in interpreting probability as a degree of plausibility/belief rather than a measure of some natural phenomenon. This is related to the book by E T Jaynes that I am reading; in the last section that I read he was talking about ESP (extrasensory perception) and I was amazed by how this way of reasoning resolves some conflicting thoughts I have had. (Digression: I found it extremely interesting when I learned that randomized experiments were first used by psychologists in 1880s to study ESP, before Fisher was born.) I will need to read the book more to consider this possibility seriously.

But I guess this at least makes the role of randomization much clearer. By using randomization, the basis of inference is our understanding of the device of randomization. We just need to believe that drawing balls from an urn is accurately described by the binomial distribution. So in this sense randomization makes probability (as a measure of plausibility) objective. I have had the feeling for a while that randomization inference is somewhere in between Bayesian and frequentist. This seems to support that.

Fisher’s stance on this seems quite unclear: of course he did a lot of frequentist stuff and often talked about drawing a sample from an infinite population. That seems quite frequentist. But he of course introduced randomization and made it crystal clear that it provides the physical basis of inference. He strongly objected Bayesian but his fiducial argument just feels like a poor man’s Bayesian (although admitted I don’t understand much about fiducial inference.)

That last (and probably quite controversial) point I want to make is: the randomization inference for observational studies (I am thinking about Paul Rosenbaum’s writing) is in its essence a quasi-randomization inference. There are two reasons. First, there is no physical randomization involved, so calling it randomization inference is in some sense cheating. Second, there is an inconsistency in the theory from the frequentist perspective: the inference conditions on matching, but if we have another draw of the treatment variables, the matching result will be different. I’ve had discussion with Sam Pimentel about this and he said Paul was not very worried about this. Now I think the right way to look at it is to think about the conditional probability given the matching result, not in terms of the sampling distribution from a hypothetical population, but as a form of plausible reasoning like what you described in the race example. If our domain expert is willing to view the subjects within a matched group as nearly exchangeable, then the (quasi-)randomization test would make sense to that domain expert. The basis of inference is our domain expert’s belief about exchangeability, not some i.i.d. sampling assumption from a hypothetical population.

All in all, I still think “quasi-randomization” is not a bad term to add to our vocabulary. The discussion above seems to indicate that it is closer to Bayesian thinking/subjective probability, but interestingly most people who use it (not just Paul Rosenbaum, Lehmann-Romano, but also the people who do conformal inference for example) would not consider themselves as Bayesians. What’s David Freedman’s stance on this based on your reading? I know he did a lot of work on Bayesian statistics initially but his later writing seems to be very frequentist oriented.

Best wishes,

Qingyuan

Email 4

From: HYUNSEUNG KANG <hkang84@wisc.edu> Sent: Thursday, December 15, 2022 10:32:04 PM To: Qingyuan Zhao <qyzhao@statslab.cam.ac.uk> Subject: Re: LD/Fisher

Hi Qingyuan,

Thanks for your detailed thoughts on this! Your thoughts on different probabilistic (frequentist or Bayesian) models for race (or other attributes) made things clear for me; this was something that was not present in Freedman’s earlier papers that I have read and my impression is that he seemed to have gotten more frequentist in his later years with bootstrapping and re-sampling methods. I also think your terminology makes more sense in light of your argument, especially the point you raised here:

If our domain expert is willing to view the subjects within a matched group as nearly exchangeable, then the (quasi-)randomization test would make sense to that domain expert. The basis of inference is our domain expert’s belief about exchangeability, not some i.i.d. sampling assumption from a hypothetical population.

I think the part about exchangeability and i.i.d. sampling is a great point. Also, I should have thought more clearly about the Bayesian/frequentist analogy mentioned in your e-mail. I’m actually thinking whether I need to read even further back than David Freedman’s work to better understand where randomization inference could potentially be classified as.

As for David Freedman’s perspective, while I haven’t read all of this papers/opinions on randomization inference and/or permutation tests, I think he would probably be more inclined to call what you call “quasi-randomization” with “as-if randomized”, which, in general, is more frequently used in the context of a natural experiment. I’ve also seen Paul sometimes use this terminology/phrasing to describe exchangeability within pairs, but I can’t seem to pin-point to a paper where he actually started using this phrasing. Not sure if this is helpful, though…

In any case, I’ll also send any papers that may be relevant to what you’re thinking about. Happy Christmas!