Joseph Henrich’s ambitious tome, The WEIRDest People in the World, is driving me nuts. It’s good enough and interesting enough that I want to read it. Henrich’s general idea is that people in Western, Educated, Industrial, Rich, Democratic (WEIRD) societies differ psychologically from people in more traditionally structured societies, and that the family policies of the Catholic Church in medieval Europe lie at the historical root of this difference. It’s very cool and I’m almost convinced!
Despite my fascination with his argument, I find that when Henrich touches on topics I know something about, he tends to distort and simplify things. Maybe this is inevitable in a book of such sweeping scope. However, it does lead me to mistrust his judgment and wonder how accurate his presentation is on topics where I have no expertise.
Early in reading, I was struck by Henrich’s presentation of the famous / notorious “marshmallow test“. Here’s his description:
To measure self-control in children, researchers sit them in front of a single marshmallow and explain that if they wait until the experimenter returns to the room, they can have two marshmallows instead of just the one. The experimenter departs and then secretly watches to see how long it takes for the kid to cave and eat the marshmallow. Some kids eat the lone marshmallow right away. A few wait 15 or more minutes until the experimenters gives up and returns with the second marshmallow. The remainder of the children cave somewhere in between. A child’s self-control is measured by the number of seconds they wait.
Psychological tasks likes these are often powerful predictors of real-life behavior (p. 40).
It’s a cute test! However, I have a graduate student who is currently writing a dissertation chapter on problems with this test. Maybe the test is a measure of self-control, but it could also be a measure of how much the child trusts the experimenter to actually deliver on the promise, or how much the child desires the social approval of the experimenter, or how comfortable the child is with strange laboratory experiments of this sort, or how hungry they are, how much they want to end the situation so as to reunite with their waiting parent, etc. Indeed, the a recent conceptual replication of the experiment mostly does not find the types of predictive value that were claimed in early studies, after statistical controls are introduced to account for race, gender, home background, parents’ education, vocabulary, and other possible covariates.
In general, if you’ve been influenced, as I have, by the “replication crisis” and other recent methodological critiques of social science and medicine, this might be the kind of result that should set off your skeptical tinglers. The idea that how long a four-year-old waits before eating a marshmallow reveals how much self-control they have, which then “powerfully predicts” real-life behavior outside of the laboratory (e.g., college admission test scores over a decade later, as is sometimes claimed) — well, it could be true. I’m not saying it’s not. But I don’t think I’d have written it up as Henrich does, without skeptical caveats, as though there’s general consensus among psychologists that a child’s behavior with a single marshmallow in this peculiar laboratory situation is a valid, powerful measure of self-control with excellent predictive value. Its prominent placement near the beginning of the book furthermore suggests that Henrich regards this test as part of the general theoretical foundation on which psychological work like his appropriately builds.
In this matter, my knowledgeable judgment and Henrich’s differ. That’s fine. Researchers can differently weigh the considerations. But if I hadn’t had the background knowledge I did, his quick presentation might have led me into a much more optimistic assessment of the value of the marshmallow test than I would have arrived at from a more thorough presentation that acknowledged the caveats. So there’s a sense in which Henrich’s presentation is a bad fit for my theoretical inclinations.
Here’s another passage that bothered me:
Upon entering the economics laboratory, you are greeted by a friendly student assistant who takes you to a private cubicle. There, via a computer terminal, you are given $20 and placed into a group with three strangers. Then, all four of you are given an opportunity to contribute any portion of your endowment — from nothing at all to $20 — to a “group project.” After everyone has had an opportunity to contribute, all contributions to the group project are increased by 50 percent and then divided equally among all four group members. Since players get to keep any money that they don’t contribute to the group project, it’s obvious that players always make the most money if they give nothing to the project. But, since any money contributed to the project increases ($20 becomes $30), the group as a whole makes more money when people contribute more of their endowment. Your group will repeat the interaction for 10 rounds, and you’ll receive all of your earnings in cash at the end. Each round, you’ll see the anonymous contributions made by others and your own total income. If you were a player in this game, how much would you contribute in the first round with this group of strangers?
This is the Public Goods Game (PGG). It’s an experiment designed to capture the basic economic trade-offs faced by individuals when they decide to act in the interest of their broader communities…. societies with more intensive kin-based institutions contribute less on average to the group project in the first round (p. 210-211).
This describes a study in which participants will receive $200-$300 each. Of course, it’s rare to award research participants such large amounts of money. If you want, say, 200 participants, you’ll need a $60,000 budget! Henrich’s endnotes cite two general books, one brief commentary without empirical data, two classic articles in which participants exited the experiment having earned about $30 each on average, and two cross-cultural studies whose payout amounts weren’t readily discoverable by me from looking at the materials. Also in the notes, Henrich says that one study “increased contributions to the group project by 40 percent, not 50 percent. I’m simplifying” (p. 543). However, the majority of the cited studies in fact used 40 percent increases, not just the one study to which this caveat was attached.
I’m not seeing why the more accurate 40% is “simpler” than 50%. This seems to be a gratuitous inaccuracy. Characterizing the experiment as ten rounds with payoffs of $20-$30 per round is potentially a more serious distortion. Really, these experiments are run with units that are later exchanged for small amounts of real money. This is important for at least two reasons: First, these experimental monetary units might be psychologically different from real money, possibly encouraging a more game-like attitude. And second, when the actual amounts of money at stake are small, the costs of cooperating (and also the benefits) are less, which should amplify concerns about how representative this game-like laboratory behavior is of how the participants would behave in the real world, with more serious stakes.
Suppose that instead of exaggerating the stakes upward by a factor of about 10, Henrich had exaggerated the stakes down by a factor of about 10. What if, instead of saying that there was $20-$30 at stake per turn, when it’s typically more like $2-$3, he had said that $0.20 was at stake per turn? I suspect this would make an intuitive difference to most ordinary readers of the book. The leap from “here’s how cooperatively research subjects act with $20” to “here’s how cooperative people in that culture are with strangers in general” is more attractive than the leap from “here’s how cooperatively research subjects act with $0.20” to the same broad conclusion.
In general, I tend to be wary of quick inferences from laboratory behavior to real-world behavior outside the laboratory. Laboratories are strange social situations and differently familiar to people from different backgrounds. This is the problem of ecological validity or external validity, and concerns of this sort are why most of my own research on social behavior uses real-world measures. Other researchers, such as Henrich, might not be as worried about the external validity of laboratory/internet studies. There’s room for legitimate debate. But in order for us readers to get a sense of whether external validity might be an issue in the studies he cites, at the very least we need an accurate description of what the studies involve. Henrich’s presentation does not provide that, and simplification is a poor motive for this distortion, since $2 is no less simple than $20.
Henrich does not, in my mind, cross over into bald misrepresentation. He doesn’t, for example, say of any particular study that it involves $20 per round. Rather, the presentation seems to be loose. He’s trying to give the general flavor. He’s writing for a moderately broad audience and aiming to synthesize a huge range of work, unavoidably simplifying and idealizing along the way. He could respond to my concerns by saying that his best judgment of the conflicting evidence about the marshmallow test is that it’s a valid and highly predictive measure of self-control and that his simplified presentation of the material conveys that effectively by avoiding concerns and apparent replication failures that would just (in his judgment) be distracting. He could say that his best reading of the literature on external validity is that the difference between $2 and $20 doesn’t matter and that the quick leap to general conclusions about cooperativeness is justified because we can reasonably expect laboratory studies of this sort to be diagnostic. He could say that the reader ought to trust that he’s done his homework behind the scenes.
We must always trust, to some extent, the scientists we’re reading — that they are reporting their data correctly, that there aren’t big problems with the study’s execution that they’re failing to reveal, and so on. Part of this involves relying on their inevitably simplified summaries of material with which we are unfamiliar. We trust the researcher to have digested the material well and fairly, and not to be hiding worries that might legitimately undermine the central claims. The looser the presentation, the more trust is required.
This invites the question of whether there are conditions under which more versus less trust is justified. How much, as a reader, ought you be willing to glide through on trust?
I’d recommend reducing trust under the following three conditions:
(1.) The author has a prior agenda or a big picture theory that might motivate them to interpret and digest results in a biased way. Most scientists have agendas and theories, of course, and certainly Henrich does. But there is more and less agenda-driven work, and adversarial collaboration offers the opportunity for bias to be balanced through scientists’ opposing agendas.
(2.) The author is not as skeptical as you the reader are about some of the relevant types of research. If the author is less skeptical than you are, they might be interpreting that research more naively or more at face value than you would if you had read the same research.
(3.) Where the author makes contact with the issues you know best, they seem to be distorting, misinterpreting, or leaping too quickly to broad conclusions. This might indicate a general bias and sloppiness that might be present but harder for you to see regarding issues about which you know less.
On all three grounds, my trust of Henrich is impaired.
 Deep in an endnote, Henrich acknowledges this last concern. He responds that “it’s easy to weaken the relationship between measures of patience and later academic performance by statistically removing all the factors that create variation in patience in the first place” (p. 515). It’s a reasonable, though disputable point. Regardless, few readers are likely to pick up on something buried in the back half of one among hundreds of endnotes.