Consider a simple causal claim: “α causes β in γ”. One type of event (say, caffeine after dinner) tends to cause another type of event (disrupted sleep) in a certain range of conditions (among typical North American college students).
Now consider a formal study you could run to test this. You design an intervention: 20 ounces of Peet’s Dark Roast in a white cup, served at 7 p.m. You design a control condition: 20 ounces of Peet’s decaf, served at the same time. You recruit a population: 400 willing undergrads from Bigfoot Dorm, delighted to have free coffee. Finally, you design a measure of disrupted sleep: wearable motion sensors that normally go quiet when a person is sleeping soundly.
You write it up for the campus newspaper: “Caffeine After Dinner Interferes with Sleep among College Students”.
But do you know that?
Of course it’s plausible. And you have excellent internal validity. But to get to a general claim of that sort, from your observation of 400 undergrads, requires further assumptions that we ought to be careful about. What we know, based on considerations of internal validity alone, is that this particular intervention (20 oz. of Peet’s Dark Roast) caused this particular outcome (more motion from 2 to 4 a.m.) the day and place the experiment was performed (Bigfoot Dorm, February 16, 2021). In fact, even calling the intervention “20 oz. of Peet’s Dark Roast” hides some assumptions — for of course, the roast was from a particular batch, brewed in a particular way by a particular person, etc. All you really know based on the methodology, if you’re going to be super conservative, is this: Whatever it is that you did that differed between treatment and control had an effect on whatever it was you measured.
Call whatever it was you did in the treatment condition “A” and whatever it was you did differently in the control condition “-A”. Call whatever it was you measured “B”. And call the conditions, including both the environment and everything that was the same or balanced between treatment and control, “C” (that it was among Bigfoot Dorm students, using white cups, brewed an average temperature of 195°F, etc.).
What we know then is that the probability, p, of B (whatever outcome you measured), was greater given A (whatever you did in the treatment condition) than in -A (whatever you did in the control condition), in C (the exact conditions in which the experiment was performed). In other words:
p(B|A&C) > p(B|-A&C). [Read this as “The probability of B given A and C is greater than the probability of B given not-A and C.”]
But remember, what you claimed was both more specific and more general than that. You claimed “caffeine after dinner interferes with sleep among college students”. To put it in the Greek-letter format with which we began, you claimed that α (caffeine after dinner) causes β (poor sleep) in γ (among college students, presumably in normal college dining and sleeping contexts in North America, though this was not clearly specified).
p(B|A&C) > p(B|-A&C)
but rather the more ambitious and specific sentence
p(β|α&γ) > p(β|-α&γ).
In order to get from one to the other, you need to do what Esterling, Brady, and I call causal specification.
You need to establish, or at least show plausible, that α is what mattered about A. You need to establish that it was the caffeine that had the observed effect on B, rather than something else that differed between treatment and control, like tannin levels (which differed slightly between the dark roast and decaf). The internally valid study tells you that the intervention had causal power, but nothing inside the study could possibly tell you what aspect of the intervention had the causal power. It may seem likely, based on your prior knowledge, that it would be the caffeine rather than the tannins or any of the potentially infinite number of other things that differ between treatment and control (if you’re creative, the list could be endless).
One way to represent this is to say that alongside α (the caffeine) are some presumably inert elements, θ (the tannins, etc.), that also differ between treatment and control. The intervention A is really a bundle of α and θ: A = α&θ. Now substituting α&θ for A, what the internally valid experiment established was
p(B|(α&θ)&C) > p(B|-(α&θ)&C).
If θ is causally inert, with no influence on the measured outcome B, you can can drop the θ, thus inferring from the sentence above to
p(B|α&C) > p(B|-α&C).
In this case, you have what Esterling, Brady, and I call construct validity of the cause. You have correctly specified the element that is doing the causal work. It’s not just A as a whole, but α in particular, the caffeine. Of course, you can’t just assert this. You ought to establish it somehow. That’s the process of establishing construct validity of the cause.
Analogous reasoning applies to the relationship between B (measured motion-sensor outputs) and β (disrupted sleep). If you can establish the right kind of relationship between B and β you can move from a claim about B to a conclusion about β, thus moving from
p(B|α&C) > p(B|-α&C)
p(β|α&C) > p(β|-α&C).
If this can be established, you have correctly specified the outcome and have achieved construct validity of the outcome. You’re really measuring disrupted sleep, as you claim to be, rather than something else (like non-disruptive limb movement during sleep).
And finally, if you can establish that the right kind of relationship holds between the actual testing conditions and the conditions to which you generalize (college students in typical North American eating and sleeping environments) — then you can move from C to γ. This will be so if your actual population is representative and the situation isn’t strange. More specifically, since what is “representative” and “strange” depends on what causes what, the specification of γ requires knowing what background conditions are required for α to have its effect on β. If you know that, you can generalize to populations beyond your sample where the relevant conditions γ are present (and refrain from generalizing to cases where the relevant conditions are absent). You can thus substitute γ for C, generating the causal generalization that you had been hoping for from the beginning:
p(β|α&γ) > p(β|-α&γ).
In this way, internal, construct, and external validity fit together. Moving from finite, historically particular data to a general causal claim requires all three. It requires establishing not only internal validity but also establishing construct validity of the cause and outcome and external validity. Otherwise, you don’t have the well-supported generalization you think you have.
Although internal validity is often privileged in social scientists’ discussions of causal inference, with internal validity alone, you know only that the particular intervention you made (whatever it was) had the specific effect you measured (whatever that effect amounts to) among the specific population you sampled at the time you ran the study. You know only that something caused something. You don’t know what causes what.
Here’s another way to think about it. If you claim that “α causes β in γ”, there are four ways you could go wrong:
(1.) Something might cause β in γ, but that something might not be α. (The tannin rather than the caffeine might disrupt sleep.)
(2.) α might cause something in γ, but it might not cause β. (The caffeine might cause more movement at night without actually disrupting sleep.)
(3.) α might cause β in some set of conditions, but not γ. (Caffeine might disrupt sleep only in unusual circumstances particular to your school. Maybe students are excitable because of a recent earthquake and wouldn’t normally be bothered.)
(4.) α might have some relationship to β in γ, but it might not be a causal relationship of the sort claimed. (Maybe, though an error in assignment procedures, only students on the noisy floors got the caffeine.)
Practices that ensure internal validity protect only against errors of Type 4. To protect against errors of Type 1-3, you need proper causal specification, with both construct and external validity.
Note 1: Throughout the post, I assume that causes monotonically increase the probability of their effects, including the presence of other causes.