In 2011, Daryl Bem published a paper in a top psychology journal claiming evidence for precognition — the ability to perceive future events. The results were statistically significant (p < .05 across nine experiments) and the scientific community was almost universally skeptical, because the effect sizes were tiny and “significant” didn’t mean what most people think it means. The Bem (2011) affair crystallized a growing problem: the overreliance on p-values and the neglect of effect sizes.
What does a p-value actually tell you?
The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as:
The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
Notice what the p-value does not tell you:
- It is not the probability that the null (or alternative) hypothesis is true
- It does not tell you how large or important the effect is
- It does not tell you whether the result will replicate
A p-value of .03 means: “If there truly were no effect, there would be a 3% chance of seeing data this extreme or more extreme.” That’s all. Yet researchers routinely treat p < .05 as “there is an effect” and p > .05 as “there is no effect” — both incorrect inferences that have distorted decades of scientific literature.
Why is the p < .05 threshold problematic?
The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb, not a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed, with several systematic costs.
The file drawer effect. Studies with p > .05 are far less likely to be published. Rosenthal (1979) called this the “file drawer problem”: for every significant finding in print, an unknown number of non-significant replications languish unpublished, biasing the literature toward inflated effects.
P-hacking. Flexibility in analysis — choosing which variables to include, which outliers to exclude, when to stop collecting data — can push borderline results below .05. Simmons, Nelson, and Simonsohn (2011) demonstrated that with common “researcher degrees of freedom,” false-positive rates can exceed 60%, far above the nominal 5%.
Dichotomous thinking. The .05 boundary creates an absurd situation where p = .049 is “significant” and publishable while p = .051 is “non-significant” — despite the two values carrying essentially identical evidential weight.
What is an effect size and why does it matter?
An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. A p-value can be made significant simply by collecting more data — any non-zero effect reaches significance with a large enough sample — but an effect size tells you whether the finding is large enough to be meaningful. The most common effect sizes in psychological research:
| Effect Size | What It Measures | Small | Medium | Large |
|---|---|---|---|---|
| Cohen’s d | Standardized mean difference between groups | 0.20 | 0.50 | 0.80 |
| Pearson’s r | Correlation between two variables | .10 | .30 | .50 |
| Eta-squared (η²) | Proportion of variance explained (ANOVA) | .01 | .06 | .14 |
| Odds ratio | Ratio of odds between groups | 1.5 | 2.5 | 4.3 |
| R² | Proportion of variance explained (regression) | .02 | .13 | .26 |
The benchmarks above come from Cohen’s (1988) widely cited guidelines. They provide useful reference points but should not be applied mechanically — the practical significance of an effect size depends entirely on context.
When is a small effect size actually important?
Cohen himself warned against rigid use of his benchmarks, and with good reason. Context determines whether an effect matters. A drug that reduces mortality risk by d = 0.10 saves thousands of lives across millions of patients — the aspirin/heart-attack finding had an r of just .034. A tutoring program that raises scores by d = 0.15 is trivial-looking until it costs $10 per student and reaches millions. The relationship between IQ and income (r ≈ .30) seems modest in any single year but compounds into hundreds of thousands of dollars over a career. Conversely, an intensive brain-training program that improves working memory by d = 0.80 with no real-world transfer has near-zero practical significance despite the impressive-sounding number.
What did the ASA statement say about p-values?
In 2016, the American Statistical Association issued its first-ever formal policy statement on a specific statistical practice. Wasserstein and Lazar (2016) summarized six principles:
- P-values can indicate how incompatible the data are with a specified statistical model
- P-values do not measure the probability that the studied hypothesis is true
- Scientific conclusions should not be based only on whether a p-value passes a specific threshold
- Proper inference requires full reporting and transparency
- A p-value does not measure the size or importance of an effect
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis
The 2019 follow-up went further, calling for the abandonment of “statistically significant” altogether (Wasserstein, Schirm, & Lazar, 2019) — a striking position from the field’s preeminent professional organization.
How did the replication crisis expose p-value overreliance?
The replication crisis in psychology, sparked by large-scale replication projects from around 2011, exposed how fragile many “significant” findings were. The Open Science Collaboration (2015) replicated 100 published psychology studies; only 36% of replications produced significant results (compared to 97% of the originals), and the average replication effect size was roughly half the originals’. Subsequent Many Labs projects found some classic effects replicated robustly while others — including widely taught findings — failed consistently.
The pattern was clear: an incentive structure rewarding significance, combined with flexibility in analysis, produced a literature contaminated with false positives and inflated effects. Effect sizes taken seriously from the start would have flagged the fragile findings — a d of 0.10 in one underpowered study is far less credible than a d of 0.60 replicated across labs. The concern extends to psychometric work, where measurement reliability bounds any effect-size estimate.
What should researchers and readers do instead?
The emerging consensus is not to abandon significance testing but to supplement it with better practices.
Report effect sizes with confidence intervals. An effect of d = 0.40, 95% CI [0.05, 0.75] tells a very different story from d = 0.40, 95% CI [0.35, 0.45]. The interval communicates both magnitude and precision; the point estimate alone communicates neither.
Conduct power analyses a priori. Calculate the sample size needed to detect a meaningful effect at adequate power (typically 80%) before collecting data — preventing both underpowered studies and massively overpowered studies that render trivial effects “significant” by sheer N.
Pre-register hypotheses and analysis plans. Committing to specific hypotheses and analytical decisions before seeing the data eliminates most opportunities for p-hacking.
Use meta-analysis to synthesize evidence. No single study is definitive. Pooled effect sizes across studies are far more reliable than any one estimate and can detect publication bias — the growth-mindset evidence, for example, settled around an effect size an order of magnitude smaller than the original popular claims only after competing meta-analyses were run.
Interpret effect sizes in context. Given the domain, the cost of intervention, and the importance of the outcome, is this effect large enough to matter? A small effect can be revolutionary in one context and trivial in another.
The bottom line
The p-value answers a narrow question: is the effect likely to be non-zero? The effect size answers the one that actually matters: is it large enough to care about? Treat significance as a minimum threshold rather than a sufficient one, and always ask — alongside “is it significant?” — “how big is it, and does it matter?”
Frequently asked questions
Is a small p-value the same as a strong effect?
No. A small p-value indicates the data are unlikely under the null, but says nothing about effect magnitude. With a large enough sample, even a trivially small effect can produce p < .001. Always read the effect size before concluding that a “highly significant” finding is meaningful.
What’s the difference between statistical significance and practical significance?
Statistical significance asks whether an effect is reliably different from zero. Practical significance asks whether it is large enough to matter clinically, educationally, or economically. Large effects in small samples sometimes fail significance tests; trivial effects in huge samples often pass them.
What effect size counts as “small,” “medium,” or “large”?
The conventional benchmarks are Cohen’s (1988): d = 0.20 / 0.50 / 0.80, or r = .10 / .30 / .50. These are reference points, not rules. Cohen himself cautioned against mechanical use; whether a given effect is meaningful depends on the domain, the cost of intervention, and the importance of the outcome.
Why don’t researchers just stop using p-values?
Some statisticians argue for exactly that — the 2019 follow-up to the ASA statement called for retiring “statistically significant” entirely. Most working researchers use a hybrid: report p-values alongside effect sizes, confidence intervals, and pre-registration. The goal is to break the dichotomous publish/don’t-publish rule .05 enabled, not to ban hypothesis testing.
How does the replication crisis relate to p-values?
It exposed how easily a literature filled with significant p-values can still be wrong. When the Open Science Collaboration (2015) replicated 100 psychology studies, only 36% produced significant replications, and replication effect sizes averaged about half the originals’. Effect sizes proved more reliable than significance status as a guide to which findings would survive.
References
- Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
- Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
- Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
- Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
Related Research
Does Music Training Increase IQ?
Few claims in popular science have been as durable as the idea that music makes you smarter. The 1990s "Mozart Effect" sent pregnant women rushing…
Apr 15, 2026Does Birth Order Affect Intelligence?
The belief that firstborn children are smarter than their younger siblings is one of the most persistent ideas in folk psychology. Parents joke about it,…
Oct 11, 2025The Flynn Effect: Are Humans Getting Smarter?
In 1984, political scientist James Flynn published a finding that would reshape how we think about intelligence: IQ scores had been rising steadily across the…
May 30, 2025Can You Actually Increase Your IQ?
Few questions in psychology generate as much debate as whether intelligence is fixed or malleable. The idea that IQ is set in stone — hardwired…
Feb 1, 2025Trends in the Flynn Effect Over Time
For about a century, average IQ test scores in industrialized countries rose steadily — roughly three points per decade — across cohort after cohort. James…
Mar 3, 2023People Also Ask
Does Music Training Increase IQ? What the Research Actually Shows?
Few claims in popular science are as persistent as the idea that music makes you smarter. From the "Mozart Effect" craze of the 1990s — which sent pregnant women rushing to buy classical CDs — to today's parents enrolling toddlers in Suzuki violin, the belief that music training enhances general intelligence has deep cultural roots. But what does the research actually show? The answer is nuanced, sometimes contradictory, and more interesting than the headlines suggest.
Read more →Does Birth Order Affect Intelligence? What Large-Scale Studies Reveal?
The belief that firstborn children are smarter than their younger siblings is one of the most persistent ideas in folk psychology. Parents joke about it, media repeats it, and surprisingly, the research largely supports it — though the effect is far smaller than most people assume and the reasons behind it are still debated.
Read more →The Flynn Effect: Are Humans Getting Smarter — or Dumber?
In 1984, political scientist James Flynn published a finding that would reshape how we think about intelligence: IQ scores had been rising steadily across the developed world for as long as records existed. The gains averaged roughly 3 points per decade — meaning the average person today would score in the gifted range on a test normed 70 years ago. But the story doesn't end there. Recent evidence suggests the trend may be reversing. Are humans getting smarter, getting dumber, or is the question itself misleading?
Read more →Can You Actually Increase Your IQ? What the Research Shows?
Few questions in psychology generate as much debate as whether intelligence is fixed or malleable. The idea that IQ is set in stone — hardwired by genetics and sealed by early childhood — persists in popular culture, but the scientific picture is considerably more nuanced. Decades of research show that IQ scores can and do change, though the mechanisms, magnitude, and permanence of those changes vary widely. Here is what the evidence actually supports.
Read more →What are the key aspects of what does a p-value actually tell you??
The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Notice what the p-value does not tell you:
Why does why is the p < .05 threshold problematic? matter in psychology?
The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems:
Jouve, X. (2026, March 27). Effect Size vs. P-Value Explained. PsychoLogic. https://www.psychologic.online/effect-size-vs-p-value/

