Effect Size vs. P-Value: Why Statistical Significance Isn’t Enough

Published: March 27, 2026 · Last reviewed: March 31, 2026

📖1,464 words⏱6 min read📚11 references cited

In 2011, Daryl Bem published a paper in a top psychology journal claiming to find evidence for precognition — the ability to perceive future events. The results were statistically significant (p < .05 across nine experiments). Yet the scientific community was almost universally skeptical. Why? Because the effect sizes were tiny, the methodology was questionable, and "significant" didn't mean what most people think it means. The Bem affair crystallized a growing crisis in psychology: the overreliance on p-values and the neglect of effect sizes.

Key Takeaway: A p-value tells you whether an effect is likely to be non-zero; an effect size tells you whether the effect is large enough to matter. The American Statistical Association’s 2016 statement formally warned against using p-values as the sole basis for scientific conclusions, and leading psychology journals now require effect size reporting alongside significance tests.

What does a p-value actually tell you?

Key Takeaway: The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Notice what the p-value does not tell you: A p-value of .

The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as:

The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.

Notice what the p-value does not tell you:

It is not the probability that the null hypothesis is true
It is not the probability that the finding is a fluke
It is not the probability that the alternative hypothesis is true
It does not tell you how large or important the effect is
It does not tell you whether the result will replicate

A p-value of .03 means: “If there truly were no effect, there would be a 3% chance of seeing data this extreme or more extreme.” That’s all. Yet researchers routinely interpret p .05 as “there is no effect” — both incorrect inferences that have distorted decades of scientific literature.

Why is the p < .05 threshold problematic?

Key Takeaway: The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems: The file drawer effect: Studies with p > .

The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems:

The file drawer effect: Studies with p > .05 are far less likely to be published, creating a literature that systematically overestimates effect sizes. Rosenthal (1979) called this the “file drawer problem” — for every published significant finding, an unknown number of non-significant replications languish unpublished.

P-hacking: Researchers can exploit the flexibility in data analysis — choosing which variables to include, which outliers to exclude, when to stop collecting data — to push borderline results below .05. Simmons, Nelson, and Simonsohn (2011) demonstrated that with common “researcher degrees of freedom,” false-positive rates can exceed 60%, far above the nominal 5%.

Inflated effect sizes: When only significant results are published, the published effects are systematically larger than the true effects — a phenomenon called the “winner’s curse.” This helps explain why replication studies consistently find smaller effects than original studies.

Dichotomous thinking: The .05 boundary creates an absurd situation where p = .049 is “significant” and publishable while p = .051 is “non-significant” and often unpublishable — despite the two values carrying essentially identical evidential weight.

What is an effect size and why does it matter?

Key Takeaway: An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. While a p-value can be made significant simply by collecting more data (any non-zero effect will reach significance with a large enough sample), an effect size tells you whether the finding is large enough to be meaningful.

An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. While a p-value can be made significant simply by collecting more data (any non-zero effect will reach significance with a large enough sample), an effect size tells you whether the finding is large enough to be meaningful.

The most commonly used effect sizes in psychological research include:

Effect Size	What It Measures	Small	Medium	Large
Cohen’s d	Standardized mean difference between groups	0.20	0.50	0.80
Pearson’s r	Correlation between two variables	.10	.30	.50
Eta-squared (η²)	Proportion of variance explained (ANOVA)	.01	.06	.14
Odds ratio	Ratio of odds between groups	1.5	2.5	4.3
R²	Proportion of variance explained (regression)	.02	.13	.26

The benchmarks above come from Jacob Cohen’s (1988) widely cited guidelines. They provide useful reference points but should not be applied mechanically — the practical significance of an effect size depends entirely on context.

When is a small effect size actually important?

Key Takeaway: Cohen himself warned against rigid use of his benchmarks, and with good reason. Context determines whether an effect matters: When the outcome is severe: A drug that reduces mortality risk by d = 0.10 might save thousands of lives when applied across millions of patients.

Cohen himself warned against rigid use of his benchmarks, and with good reason. Context determines whether an effect matters:

When the outcome is severe: A drug that reduces mortality risk by d = 0.10 might save thousands of lives when applied across millions of patients. The aspirin-heart attack finding — one of the most important medical discoveries of the 20th century — had an r of just .034.

When the effect accumulates: A small daily advantage compounds over time. The relationship between IQ and income (r ≈ .30–.40) may seem modest in any given year, but over a career it translates to hundreds of thousands of dollars in earning differences.

When the intervention is cheap and scalable: A tutoring program that raises test scores by d = 0.15 might not seem impressive, but if it costs $10 per student and reaches millions, the aggregate benefit is enormous.

Conversely, a large effect size isn’t always meaningful. If a 12-month intensive brain training program costing $5,000 improves working memory by d = 0.80 but shows no transfer to real-world tasks, the practical significance is near zero despite the impressive-sounding effect.

What did the ASA statement say about p-values?

Key Takeaway: In 2016, the American Statistical Association (ASA) took the extraordinary step of issuing a formal statement on p-values — the first time the organization had made a policy statement about a specific statistical practice.

In 2016, the American Statistical Association (ASA) took the extraordinary step of issuing a formal statement on p-values — the first time the organization had made a policy statement about a specific statistical practice. The statement outlined six key principles:

P-values can indicate how incompatible the data are with a specified statistical model
P-values do not measure the probability that the studied hypothesis is true
Scientific conclusions should not be based only on whether a p-value passes a specific threshold
Proper inference requires full reporting and transparency
A p-value does not measure the size or importance of an effect
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

In 2019, the ASA followed up with editorial guidance in The American Statistician calling for the abandonment of the term “statistically significant” altogether — a dramatic statement from the field’s preeminent professional organization.

How did the replication crisis expose p-value overreliance?

Key Takeaway: The replication crisis in psychology — sparked by large-scale replication projects beginning around 2011 — exposed just how fragile many "significant" findings were: The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies.

The replication crisis in psychology — sparked by large-scale replication projects beginning around 2011 — exposed just how fragile many “significant” findings were:

The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies. Only 36% of replications produced significant results (compared to 97% of the originals), and the average effect size in replications was half that of the original studies.

The Many Labs projects found that some classic effects replicated robustly across diverse samples while others — including several widely taught findings — failed consistently.

The pattern was clear: the scientific ecosystem’s incentive structure (publish significant results or perish) combined with the flexibility of statistical analysis had produced a literature contaminated with false positives and inflated effects. Effect sizes, if they had been taken seriously from the beginning, would have flagged many of these fragile findings — a d of 0.10 found once in an underpowered study is far less credible than a d of 0.60 replicated across multiple laboratories.

These concerns about measurement reliability extend to psychometric research as well, where the precision of our instruments directly affects the accuracy of our effect size estimates.

What should researchers and readers do instead?

Key Takeaway: The emerging consensus is not to abandon significance testing, but to supplement it with better practices: Always report effect sizes with confidence intervals. A confidence interval around an effect size communicates both the magnitude and the precision of the estimate. An effect of d = 0.40, 95% CI [0.05, 0.

The emerging consensus is not to abandon significance testing, but to supplement it with better practices:

Always report effect sizes with confidence intervals. A confidence interval around an effect size communicates both the magnitude and the precision of the estimate. An effect of d = 0.40, 95% CI [0.05, 0.75] tells a very different story from d = 0.40, 95% CI [0.35, 0.45].

Conduct power analyses a priori. Calculate the sample size needed to detect a meaningful effect size with adequate power (typically 80%). This prevents both underpowered studies (which produce unreliable estimates) and massively overpowered studies (which find trivial effects “significant”).

Pre-register hypotheses and analysis plans. Committing to specific hypotheses and analytical decisions before seeing the data eliminates most opportunities for p-hacking.

Use meta-analysis to synthesize evidence. No single study is definitive. Meta-analyses that aggregate effect sizes across studies provide far more reliable estimates and can detect publication bias.

Consider Bayesian alternatives. Bayes factors quantify the relative evidence for competing hypotheses and allow researchers to distinguish “evidence of absence” from “absence of evidence” — a distinction p-values cannot make.

Interpret effect sizes in context. Ask: Given the domain, the intervention, and the outcome, is this effect large enough to matter? A small effect might be revolutionary in one context and trivial in another.

The bottom line

Key Takeaway: The p-value answered a useful but narrow question: is the effect likely to be non-zero? The effect size answers the question that actually matters: is the effect large enough to care about? Psychology's replication crisis has demonstrated, painfully, what happens when an entire field bases its conclusions on the first question while ignoring the second.

The p-value answered a useful but narrow question: is the effect likely to be non-zero? The effect size answers the question that actually matters: is the effect large enough to care about? Psychology’s replication crisis has demonstrated, painfully, what happens when an entire field bases its conclusions on the first question while ignoring the second. The solution isn’t to abandon statistical testing, but to treat significance as a minimum threshold rather than a sufficient one — and to always ask, alongside “is it significant?”, the more important question: “how big is it, and does it matter?”

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID