The Book of Why, by Judea Pearl. |
If I could sum up the message of this book in one pithy phrase, it would be that you are smarter than your data. Data do not understand causes and effects; humans do. I hope that the new science of causal inference will enable us to better understand how we do it, because there is no better way to understand ourselves than by emulating ourselves. In the age of computers, this new understanding also brings with it the prospect of amplifying our innate abilities so that we can make better sense of data, be it big or small.I had a hard time with this book, mainly because I’m not a fan of statistics. Rather than asking “why” questions, I usually ask “what if” questions. In other words, I build mathematical models and then analyze them and make predictions. Intermediate Physics for Medicine and Biology has a similar approach. For instance, what if drift and diffusion both act in a pore; which will dominate under what circumstances (Section 4.12 in IPMB)? What if an ultrasonic wave impinges on an interface between tissues having different acoustic impedances; what fraction of the energy in the wave is reflected (Section 13.3)? What if you divide up a round of radiation therapy into several small fractions; will this preferentially spare healthy tissue (Section 16.9)? Pearl asks a different type of question: the data shows that smokers are more likely to get lung cancer; why? Does smoking cause lung cancer, or is there some confounding effect responsible for the correlation (for instance, some people have a gene that makes them both more susceptible to lung cancer and more likely to smoke)?
Although I can’t say I’ve mastered Pearl’s statistical methods for causal inference, I do like the way he adopts a causal model to test data. Apparently for a long time statisticians analyzed data using no hypotheses, just statistical tests. If they found a correlation, they could not infer causation; does smoking cause lung cancer or does lung cancer cause smoking? Pearl draws many causal diagrams to make his causation assumptions explicit. He then uses these illustrations to derive his statistical model. These drawings remind me of Feynman diagrams that we physicists use to calculate the behavior of elementary particles.
Simpson’s Paradox
Just when my interest in The Book of Why was waning, Pearl shocked me back to attention with Simpson’s paradox.Imagine a doctor—Dr. Simpson, we’ll call him—reading in his office about a promising new drug (Drug D) that seems to reduce the risk of a heart attack. Excitedly, he looks up the researcher’s data online. His excitement cools a little when he looks at the data on male patients and notices that their risk of a heart attack is actually higher if they take Drug D. “Oh well,” he says, “Drug D must be very effective for women.”To illustrate this effect, consider the example analyzed by Pearl. In a clinical trial some patients received a drug (treatment) and some didn’t (control). Patients who subsequently had heart attacks are indicated by red boxes, and patients who did not by blue boxes. In the figure below, the data is analyzed by gender: males and females.
But then he turns to the next table, and his disappointment turns to bafflement. “What is this?” Dr. Simpson exclaims. “It says here that women who took Drug D were also at higher risk of a heart attack. I must be losing my marbles! This drug seems to be bad for women, bad for men, but good for people.”
One out of twenty (5%) of the females in the control group had heart attacks, while three out of forty (7.5%) in the treatment group did. For women, the drug caused heart attacks! For males, twelve out of forty men in the control group (30%) suffered heart attacks, and eight out of twenty (40%) in the treatment group did. The drug caused heart attacks for the men too!
Now combine the data for men and women.
In the control group, 13 out of 60 patients had a heart attack (22%). In the treatment group, 11 of 60 patients had one (18%). The drug prevented heart attacks! This seems impossible, but if you don’t believe me, count the boxes; it’s not a trick. What do we make of this? As Pearl says “A drug can’t simultaneously cause me and you to have a heart attack and at the same time prevent us both from having heart attacks.”
To resolve the paradox, Pearl notes that this was not a randomized clinical trial. Patients could decide to take the drug or not, and women chose the drug more often then men. The preference for taking the drug is what Pearl calls a “confounder.” The chance of having a heart attack is much greater for men than women, but more women elected to join the treatment group then men. Therefore, the treatment group was overweighted with low-risk women, and the control group was overweighted with high-risk men, so when data was pooled the treatment group looked like they had fewer heart attacks than the control group. In other words, the difference between treatment and control got mixed up with the difference between men and women. Thus, the apparent effectiveness of the drug in the pooled data is a statistical fluke. A random trial would have shown similar data for men and women, but a different result when the data was pooled. The drug causes heart attacks.
Mathematics
The Book of Why contains only a little mathematics; Pearl tries to make the discussion accessible to a wide audience. He does, however, use lots of math in his research. His opinion of math is similar to mine and to IPMB’s.Many people find formulas daunting, seeing them as a way of concealing rather than revealing information. But to a mathematician, or to a person who is adequately trained in the mathematical way of thinking, exactly the reverse is true. A formula reveals everything: it leaves nothing to doubt or ambiguity. When reading a scientific article, I often catch myself jumping from formula to formula, skipping the words altogether. To me, a formula is a baked idea. Words are ideas in the oven.One goal of IPMB is to help students gain the skills in mathematical modeling so that formulas reveal rather than conceal information. I often tell my students that formulas aren’t things you stick numbers into to get other numbers. Formulas tell a story. This idea is vitally important. I suspect Pearl would agree.
Modeling
The causal diagrams in The Book of Why aid Pearl in deriving the correct statistical equations needed to analyze data. Toy models in IPMB aid students in deriving the correct differential equations needed to predict behavior. I see modeling as central to both activities: you start with an underlying hypothesis about what causes what, you translate that into mathematics, and then you learn something about your system. As Pearl notes, statistics does not always have this approach.In certain circles there is an almost religious faith that we can find the answers to these questions in the data itself, if only we are sufficiently clever at data mining. However, readers of this book will know that this hype is likely to be misguided. The questions I have just asked are all causal, and causal questions can never be answered from data alone. They require us to formulate a model of the process that generates the data, or at least some aspects of that process. Anytime you see a paper or a study that analyzes the data in a model-free way, you can be certain that the output of the study will merely summarize, and perhaps transform, but not interpret the data.I enjoyed The Book of Why, even if I didn’t entirely understand it. It was skillfully written, thanks in part to coauthor Dana MacKenzie. It’s the sort of book that, once finished, I should go back and read again because it has something important to teach me. If I liked statistics more I might do that. But I won’t.
Another quote from The Book of Why:
ReplyDelete"For baseball fans, here is a lovely example concerning two star baseball players, David Justice and Derek Jeter. In 1995, Justice had a higher batting average, .253 to .250. In 1996, Justice had a higher batting average again, .321 to .314. And in 1997, he had a higher batting average than Jeter for a third season in a row, .329 to .291. Yet, over all three seasons combined, Jeter had the higher average!"
Simpson's paradox at work!