Is the scientific method broken?

A spate of articles in respected publications have appeared recently that seem to cast doubt on the scientific method, or at least on the statistical side of it: A brief summary of the articles: Results of many statistical studies, each certified with the 5% significance stamp of approval, turn out on further research to be wrong.

For example, the New Yorker's article features the story of a psychologist, Jonathan Schooler, who conducted a study that showed significant "verbal overshadowing," suggesting that describing one's memories does not improve the ability to remember them. This idea was counterintuitive, and brought him fame. Unfortunately, each time he tried to replicate his experiments, the effect diminished. How could the truth deteriorate? What did nature have against him? "One of my mentors told me that my real mistake was trying to replicate my work. He told me doing that was just setting myself up for disappointment."

I exploit this phenomenon in STAT 100. I do an ESP experiment, asking students to guess the suits of eight playing cards. I look at the cards one by one, sending out ESP waves to the class. Here are the numbers guessed correctly for a class of 77:
# Correct 1 2 3 4 5 7
# People13 25 23 7 8 1

So one person guessed 7 correctly. The chance of guessing 7 or more? 0.000015. Very significant! Eight guessed five: The chance of five or more? 0.004227. Again, significant. Most students showed nothing, but some do have ESP. OK, then there was a second try. I tell them my goal is to make the people who did well the first time to do worse, and the people who did poorly the first time (i.e., got just 1 correct), to improve.

# Correct first time 1 2 3 4 5 7
Average # correct second time 2.08 2.6 2.09 2.43 1.50 0
I really did a number on the student who initially had 7 correct. The second time he got none correct! The eight who had had 5 correct averaged only 1.5 correct the second time, not even hitting the average of random guessing (of 2). But the people who had only one right the first time at least managed to climb a bit over 2.

So do I have extraordinary powers? Did the ESP of the best people evaporate because of fatigue? Bad karma? No, everything can be explained by randomness. Out of 77 people, even if everyone is randomly guessing, some are going to do well, and ones who are lucky the first time are likely to have average luck the second. The latter is the "regression effect."

This ESP example is extreme, but even when there really is a true effect, as in Prof. Schooler's work, it will seem to diminish because of the regression effect. Sports has infinite such examples. Take the top ten in one year, e.g., the people with the ten highest batting averages, and see how they do the next year. They'll on average have a lower batting average, but still be very good. That's because in year one, they were good and lucky. In year two, they're still good, but will have only average luck.

So is the scientific method broken? Not in theory, but the implementation can be troublesome. And I certainly do not think the truth deteriorates. What's happening then?

Another thing to keep in mind is that many studies do hold up.

The American Statistical Association assembled a number of statistical experts to prepare a statement on the ASA's policy on p-values. The overall report, and a number of individual reports, was published in The American Statistician in 2016 (volume 70). Their main principles:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency; p-values and related analyses should not be reported selectively.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Another group of researchers (72 of them) took on the task to Redefine Statistical Significance. Their conclusion: "For fields where the threshold for defining statistical significance for new discoveries is p <0.05, we propose a change to p < 0.005."