What is the bottom line?

The p-value answered a useful but narrow question: is the effect likely to be non-zero? The effect size answers the question that actually matters: is the effect large enough to care about? Psychology's replication crisis has demonstrated, painfully, what happens when an entire field bases its conclusions on the first question while ignoring the second. The solution isn't to abandon statistical testing, but to treat significance as a minimum threshold rather than a sufficient one — and to always ask, alongside "is it significant?", the more important question: "how big is it, and does it matter?"

Effect Size vs. P-Value Explained

Q: What does a p-value actually tell you?

The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Notice what the p-value does not tell you: A p-value of .03 means: "If there truly were no effect, there would be a 3% chance of seeing data this extreme or more extreme." That's all. Yet researchers routinely interpret p .05 as "there is no effect" — both incorrect inferences that have distorted decades of scientific literature.

Q: Why is the p < .05 threshold problematic?

The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems: The file drawer effect: Studies with p > .05 are far less likely to be published, creating a literature that systematically overestimates effect sizes. Rosenthal (1979) called this the "file drawer problem" — for every published significant finding, an unknown number of non-significant replications languish unpublished.

Q: What is an effect size and why does it matter?

An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. While a p-value can be made significant simply by collecting more data (any non-zero effect will reach significance with a large enough sample), an effect size tells you whether the finding is large enough to be meaningful. The most commonly used effect sizes in psychological research include:

Q: What did the ASA statement say about p-values?

In 2016, the American Statistical Association (ASA) took the extraordinary step of issuing a formal statement on p-values — the first time the organization had made a policy statement about a specific statistical practice. The statement outlined six key principles: In 2019, the ASA followed up with editorial guidance in The American Statistician calling for the abandonment of the term "statistically significant" altogether — a dramatic statement from the field's preeminent professional organization.

Q: How did the replication crisis expose p-value overreliance?

The replication crisis in psychology — sparked by large-scale replication projects beginning around 2011 — exposed just how fragile many "significant" findings were: The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies. Only 36% of replications produced significant results (compared to 97% of the originals), and the average effect size in replications was half that of the original studies. The Many Labs projects found that some classic effects replicated robustly across diverse samples while others — including several widely taught findings — failed consistently.

Published: March 27, 2026 · Last reviewed: May 6, 2026

📖1,616 words⏱7 min read📚7 references cited

In 2011, Daryl Bem published a paper in a top psychology journal claiming evidence for precognition — the ability to perceive future events. The results were statistically significant (p < .05 across nine experiments) and the scientific community was almost universally skeptical, because the effect sizes were tiny and “significant” didn’t mean what most people think it means. The Bem (2011) affair crystallized a growing problem: the overreliance on p-values and the neglect of effect sizes.

What does a p-value actually tell you?

The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as:

The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.

Notice what the p-value does not tell you:

It is not the probability that the null (or alternative) hypothesis is true
It does not tell you how large or important the effect is
It does not tell you whether the result will replicate

A p-value of .03 means: “If there truly were no effect, there would be a 3% chance of seeing data this extreme or more extreme.” That’s all. Yet researchers routinely treat p < .05 as “there is an effect” and p > .05 as “there is no effect” — both incorrect inferences that have distorted decades of scientific literature.

Why is the p < .05 threshold problematic?

The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb, not a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed, with several systematic costs.

The file drawer effect. Studies with p > .05 are far less likely to be published. Rosenthal (1979) called this the “file drawer problem”: for every significant finding in print, an unknown number of non-significant replications languish unpublished, biasing the literature toward inflated effects.

P-hacking. Flexibility in analysis — choosing which variables to include, which outliers to exclude, when to stop collecting data — can push borderline results below .05. Simmons, Nelson, and Simonsohn (2011) demonstrated that with common “researcher degrees of freedom,” false-positive rates can exceed 60%, far above the nominal 5%.

Dichotomous thinking. The .05 boundary creates an absurd situation where p = .049 is “significant” and publishable while p = .051 is “non-significant” — despite the two values carrying essentially identical evidential weight.

What is an effect size and why does it matter?

An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. A p-value can be made significant simply by collecting more data — any non-zero effect reaches significance with a large enough sample — but an effect size tells you whether the finding is large enough to be meaningful. The most common effect sizes in psychological research:

Effect Size	What It Measures	Small	Medium	Large
Cohen’s d	Standardized mean difference between groups	0.20	0.50	0.80
Pearson’s r	Correlation between two variables	.10	.30	.50
Eta-squared (η²)	Proportion of variance explained (ANOVA)	.01	.06	.14
Odds ratio	Ratio of odds between groups	1.5	2.5	4.3
R²	Proportion of variance explained (regression)	.02	.13	.26

The benchmarks above come from Cohen’s (1988) widely cited guidelines. They provide useful reference points but should not be applied mechanically — the practical significance of an effect size depends entirely on context.

When is a small effect size actually important?

Cohen himself warned against rigid use of his benchmarks, and with good reason. Context determines whether an effect matters. A drug that reduces mortality risk by d = 0.10 saves thousands of lives across millions of patients — the aspirin/heart-attack finding had an r of just .034. A tutoring program that raises scores by d = 0.15 is trivial-looking until it costs $10 per student and reaches millions. The relationship between IQ and income (r ≈ .30) seems modest in any single year but compounds into hundreds of thousands of dollars over a career. Conversely, an intensive brain-training program that improves working memory by d = 0.80 with no real-world transfer has near-zero practical significance despite the impressive-sounding number.

What did the ASA statement say about p-values?

In 2016, the American Statistical Association issued its first-ever formal policy statement on a specific statistical practice. Wasserstein and Lazar (2016) summarized six principles:

P-values can indicate how incompatible the data are with a specified statistical model
P-values do not measure the probability that the studied hypothesis is true
Scientific conclusions should not be based only on whether a p-value passes a specific threshold
Proper inference requires full reporting and transparency
A p-value does not measure the size or importance of an effect
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

The 2019 follow-up went further, calling for the abandonment of “statistically significant” altogether (Wasserstein, Schirm, & Lazar, 2019) — a striking position from the field’s preeminent professional organization.

How did the replication crisis expose p-value overreliance?

The replication crisis in psychology, sparked by large-scale replication projects from around 2011, exposed how fragile many “significant” findings were. The Open Science Collaboration (2015) replicated 100 published psychology studies; only 36% of replications produced significant results (compared to 97% of the originals), and the average replication effect size was roughly half the originals’. Subsequent Many Labs projects found some classic effects replicated robustly while others — including widely taught findings — failed consistently.

The pattern was clear: an incentive structure rewarding significance, combined with flexibility in analysis, produced a literature contaminated with false positives and inflated effects. Effect sizes taken seriously from the start would have flagged the fragile findings — a d of 0.10 in one underpowered study is far less credible than a d of 0.60 replicated across labs. The concern extends to psychometric work, where measurement reliability bounds any effect-size estimate.

What should researchers and readers do instead?

The emerging consensus is not to abandon significance testing but to supplement it with better practices.

Report effect sizes with confidence intervals. An effect of d = 0.40, 95% CI [0.05, 0.75] tells a very different story from d = 0.40, 95% CI [0.35, 0.45]. The interval communicates both magnitude and precision; the point estimate alone communicates neither.

Conduct power analyses a priori. Calculate the sample size needed to detect a meaningful effect at adequate power (typically 80%) before collecting data — preventing both underpowered studies and massively overpowered studies that render trivial effects “significant” by sheer N.

Pre-register hypotheses and analysis plans. Committing to specific hypotheses and analytical decisions before seeing the data eliminates most opportunities for p-hacking.

Use meta-analysis to synthesize evidence. No single study is definitive. Pooled effect sizes across studies are far more reliable than any one estimate and can detect publication bias — the growth-mindset evidence, for example, settled around an effect size an order of magnitude smaller than the original popular claims only after competing meta-analyses were run.

Interpret effect sizes in context. Given the domain, the cost of intervention, and the importance of the outcome, is this effect large enough to matter? A small effect can be revolutionary in one context and trivial in another.

The bottom line

The p-value answers a narrow question: is the effect likely to be non-zero? The effect size answers the one that actually matters: is it large enough to care about? Treat significance as a minimum threshold rather than a sufficient one, and always ask — alongside “is it significant?” — “how big is it, and does it matter?”

Frequently asked questions

Is a small p-value the same as a strong effect?

No. A small p-value indicates the data are unlikely under the null, but says nothing about effect magnitude. With a large enough sample, even a trivially small effect can produce p < .001. Always read the effect size before concluding that a “highly significant” finding is meaningful.

What’s the difference between statistical significance and practical significance?

Statistical significance asks whether an effect is reliably different from zero. Practical significance asks whether it is large enough to matter clinically, educationally, or economically. Large effects in small samples sometimes fail significance tests; trivial effects in huge samples often pass them.

What effect size counts as “small,” “medium,” or “large”?

The conventional benchmarks are Cohen’s (1988): d = 0.20 / 0.50 / 0.80, or r = .10 / .30 / .50. These are reference points, not rules. Cohen himself cautioned against mechanical use; whether a given effect is meaningful depends on the domain, the cost of intervention, and the importance of the outcome.

Why don’t researchers just stop using p-values?

Some statisticians argue for exactly that — the 2019 follow-up to the ASA statement called for retiring “statistically significant” entirely. Most working researchers use a hybrid: report p-values alongside effect sizes, confidence intervals, and pre-registration. The goal is to break the dichotomous publish/don’t-publish rule .05 enabled, not to ban hypothesis testing.

How does the replication crisis relate to p-values?

It exposed how easily a literature filled with significant p-values can still be wrong. When the Open Science Collaboration (2015) replicated 100 psychology studies, only 36% produced significant replications, and replication effect sizes averaged about half the originals’. Effect sizes proved more reliable than significance status as a guide to which findings would survive.

References

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Child Cognitive Development

Does Music Training Increase IQ?

Few claims in popular science have been as durable as the idea that music makes you smarter. The 1990s "Mozart Effect" sent pregnant women rushing…

Apr 15, 2026

Cognitive Development and Neurodevelopment

Does Birth Order Affect Intelligence?

The belief that firstborn children are smarter than their younger siblings is one of the most persistent ideas in folk psychology. Parents joke about it,…

Oct 11, 2025

Cognitive Abilities and Intelligence

The Flynn Effect: Are Humans Getting Smarter?

In 1984, political scientist James Flynn published a finding that would reshape how we think about intelligence: IQ scores had been rising steadily across the…

May 30, 2025

Cognitive Abilities and Intelligence

Can You Actually Increase Your IQ?

Few questions in psychology generate as much debate as whether intelligence is fixed or malleable. The idea that IQ is set in stone — hardwired…

Feb 1, 2025

Cognitive Abilities and Intelligence

Trends in the Flynn Effect Over Time

For about a century, average IQ test scores in industrialized countries rose steadily — roughly three points per decade — across cohort after cohort. James…

Mar 3, 2023

What does a p-value actually tell you?

Why is the p < .05 threshold problematic?

What is an effect size and why does it matter?

When is a small effect size actually important?

What did the ASA statement say about p-values?

How did the replication crisis expose p-value overreliance?

What should researchers and readers do instead?

The bottom line

Frequently asked questions

Is a small p-value the same as a strong effect?

What’s the difference between statistical significance and practical significance?

What effect size counts as “small,” “medium,” or “large”?

Why don’t researchers just stop using p-values?

How does the replication crisis relate to p-values?

References

Related Research

People Also Ask

Popular Posts