Is the scientific method broken?
A spate of articles in respected publications have appeared recently that seem to cast doubt on the scientific method, or at least on the statistical side of it:
A brief summary of the articles: Results of many statistical studies, each certified with the 5% significance stamp of approval, turn out on further research to be wrong.
For example, the New Yorker's article features the story of a psychologist, Jonathan Schooler, who conducted a study that showed significant "verbal overshadowing," suggesting that describing one's memories does not improve the ability to remember them. This idea was counterintuitive, and brought him fame. Unfortunately, each time he tried to replicate his experiments, the effect diminished. How could the truth deteriorate? What did nature have against him? "One of my mentors told me that my real mistake was trying to replicate my work. He told me doing that was just setting myself up for disappointment."
I exploit this phenomenon in STAT 100. I do an ESP experiment, asking students to guess the suits of eight playing cards. I look at the cards one by one, sending out ESP waves to the class. Here are the numbers guessed correctly for a class of 77:
||1|| 2|| 3|| 4|| 5|| 7|
|# People||13|| 25|| 23|| 7|| 8|| 1|
So one person guessed 7 correctly. The chance of guessing 7 or more? 0.000015. Very significant! Eight guessed five: The chance of five or more? 0.004227. Again, significant. Most students showed nothing, but some do have ESP. OK, then there was a second try. I tell them my goal is to make the people who did well the first time to do worse, and the people who did poorly the first time (i.e., got just 1 correct), to improve.
|# Correct first time
||1|| 2|| 3|| 4|| 5|| 7|
|Average # correct second time ||2.08 ||2.6 ||2.09 ||2.43|| 1.50 ||0|
I really did a number on the student who initially had 7 correct. The second time he got none correct! The eight who had had 5 correct averaged only 1.5 correct the second time, not even hitting the average of random guessing (of 2). But the people who had only one right the first time at least managed to climb a bit over 2.
So do I have extraordinary powers? Did the ESP of the best people evaporate because of fatigue? Bad karma? No, everything can be explained by randomness. Out of 77 people, even if everyone is randomly guessing, some are going to do well, and ones who are lucky the first time are likely to have average luck the second. The latter is the "regression effect."
This ESP example is extreme, but even when there really is a true effect, as in Prof. Schooler's work, it will seem to diminish because of the regression effect. Sports has infinite such examples. Take the top ten in one year, e.g., the people with the ten highest batting averages, and see how they do the next year. They'll on average have a lower batting average, but still be very good. That's because in year one, they were good and lucky. In year two, they're still good, but will have only average luck.
So is the scientific method broken? Not in theory, but the implementation can be troublesome. And I certainly do not think the truth deteriorates. What's happening then?
- First, a lot of studies have bias due to being poorly designed. Everybody knows the basics of randomized controls, double-blind, placebos, etc. Right? Maybe not. One letter responding to the Atlantic article had anecdotal evidence that 70% of physicians do not. (Yes, that percentage is not from a well-designed study.)
- Second, 5% is not 5%: A significance level of 5% does not mean that the chance is 95% that the effect is real. Watch the YouTube video, What the p-value?
- Third, a 5% significance level isn't even a 5% significance level. For example, if it took three tries to get a 5% significance level, the real significance level is more like 15%. Do enough experiments, some are bound to be significant. (About 5%, if there is nothing to be found.) Consider Alfred Hitchccock's The Mail Order Prophet: A con man sends out thousands of series of letters predicting certain events, like the outcome of an election, a horse race, etc. Each time, half the letters predict one thing, half the opposite. Eventually, there is one guy who has received all correct predictions. The con man hits him up for a big score.
- Fourth, there's a selection (and regression) effect. Only significant studies are published. Who knows whether there are 19 studies that showed no effect for a given significant one? They're not even submitted to the journal. But once one is submitted, the regression effect takes hold, and future articles may appear to discount the results of the original.
Another thing to keep in mind is that many studies do hold up.
The American Statistical Association assembled a number of statistical experts to
on the ASA's policy on p-values. The overall report, and a number of individual reports, was published in The American Statistician
in 2016 (volume 70).
Their main principles:
P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency;
p-values and related analyses should not be reported selectively.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Another group of researchers (72 of them) took on the task to
Redefine Statistical Significance
. Their conclusion: "For fields where the threshold for defining statistical significance for new
discoveries is p <0.05, we propose a change to p < 0.005."