Statistical Methods in Psychology

Effect Size vs. P-Value: Why Statistical Significance Isn’t Enough

Published: March 27, 2026 · Last reviewed:
📖1,464 words6 min read📚11 references cited

In 2011, Daryl Bem published a paper in a top psychology journal claiming to find evidence for precognition — the ability to perceive future events. The results were statistically significant (p < .05 across nine experiments). Yet the scientific community was almost universally skeptical. Why? Because the effect sizes were tiny, the methodology was questionable, and "significant" didn't mean what most people think it means. The Bem affair crystallized a growing crisis in psychology: the overreliance on p-values and the neglect of effect sizes.

Key Takeaway: A p-value tells you whether an effect is likely to be non-zero; an effect size tells you whether the effect is large enough to matter. The American Statistical Association’s 2016 statement formally warned against using p-values as the sole basis for scientific conclusions, and leading psychology journals now require effect size reporting alongside significance tests.

What does a p-value actually tell you?

Key Takeaway: The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Notice what the p-value does not tell you: A p-value of .

The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as:

The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.

Notice what the p-value does not tell you:

  • It is not the probability that the null hypothesis is true
  • It is not the probability that the finding is a fluke
  • It is not the probability that the alternative hypothesis is true
  • It does not tell you how large or important the effect is
  • It does not tell you whether the result will replicate

A p-value of .03 means: “If there truly were no effect, there would be a 3% chance of seeing data this extreme or more extreme.” That’s all. Yet researchers routinely interpret p .05 as “there is no effect” — both incorrect inferences that have distorted decades of scientific literature.

Why is the p < .05 threshold problematic?

Key Takeaway: The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems: The file drawer effect: Studies with p > .

The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems:

The file drawer effect: Studies with p > .05 are far less likely to be published, creating a literature that systematically overestimates effect sizes. Rosenthal (1979) called this the “file drawer problem” — for every published significant finding, an unknown number of non-significant replications languish unpublished.

P-hacking: Researchers can exploit the flexibility in data analysis — choosing which variables to include, which outliers to exclude, when to stop collecting data — to push borderline results below .05. Simmons, Nelson, and Simonsohn (2011) demonstrated that with common “researcher degrees of freedom,” false-positive rates can exceed 60%, far above the nominal 5%.

Inflated effect sizes: When only significant results are published, the published effects are systematically larger than the true effects — a phenomenon called the “winner’s curse.” This helps explain why replication studies consistently find smaller effects than original studies.

Dichotomous thinking: The .05 boundary creates an absurd situation where p = .049 is “significant” and publishable while p = .051 is “non-significant” and often unpublishable — despite the two values carrying essentially identical evidential weight.

What is an effect size and why does it matter?

Key Takeaway: An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. While a p-value can be made significant simply by collecting more data (any non-zero effect will reach significance with a large enough sample), an effect size tells you whether the finding is large enough to be meaningful.

An effect size quantifies the magnitude of an observed phenomenon, independent of sample size. While a p-value can be made significant simply by collecting more data (any non-zero effect will reach significance with a large enough sample), an effect size tells you whether the finding is large enough to be meaningful.

The most commonly used effect sizes in psychological research include:

Effect Size What It Measures Small Medium Large
Cohen’s d Standardized mean difference between groups 0.20 0.50 0.80
Pearson’s r Correlation between two variables .10 .30 .50
Eta-squared (η²) Proportion of variance explained (ANOVA) .01 .06 .14
Odds ratio Ratio of odds between groups 1.5 2.5 4.3
Proportion of variance explained (regression) .02 .13 .26

The benchmarks above come from Jacob Cohen’s (1988) widely cited guidelines. They provide useful reference points but should not be applied mechanically — the practical significance of an effect size depends entirely on context.

When is a small effect size actually important?

Key Takeaway: Cohen himself warned against rigid use of his benchmarks, and with good reason. Context determines whether an effect matters: When the outcome is severe: A drug that reduces mortality risk by d = 0.10 might save thousands of lives when applied across millions of patients.

Cohen himself warned against rigid use of his benchmarks, and with good reason. Context determines whether an effect matters:

When the outcome is severe: A drug that reduces mortality risk by d = 0.10 might save thousands of lives when applied across millions of patients. The aspirin-heart attack finding — one of the most important medical discoveries of the 20th century — had an r of just .034.

When the effect accumulates: A small daily advantage compounds over time. The relationship between IQ and income (r ≈ .30–.40) may seem modest in any given year, but over a career it translates to hundreds of thousands of dollars in earning differences.

When the intervention is cheap and scalable: A tutoring program that raises test scores by d = 0.15 might not seem impressive, but if it costs $10 per student and reaches millions, the aggregate benefit is enormous.

Conversely, a large effect size isn’t always meaningful. If a 12-month intensive brain training program costing $5,000 improves working memory by d = 0.80 but shows no transfer to real-world tasks, the practical significance is near zero despite the impressive-sounding effect.

What did the ASA statement say about p-values?

Key Takeaway: In 2016, the American Statistical Association (ASA) took the extraordinary step of issuing a formal statement on p-values — the first time the organization had made a policy statement about a specific statistical practice.

In 2016, the American Statistical Association (ASA) took the extraordinary step of issuing a formal statement on p-values — the first time the organization had made a policy statement about a specific statistical practice. The statement outlined six key principles:

  1. P-values can indicate how incompatible the data are with a specified statistical model
  2. P-values do not measure the probability that the studied hypothesis is true
  3. Scientific conclusions should not be based only on whether a p-value passes a specific threshold
  4. Proper inference requires full reporting and transparency
  5. A p-value does not measure the size or importance of an effect
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

In 2019, the ASA followed up with editorial guidance in The American Statistician calling for the abandonment of the term “statistically significant” altogether — a dramatic statement from the field’s preeminent professional organization.

How did the replication crisis expose p-value overreliance?

Key Takeaway: The replication crisis in psychology — sparked by large-scale replication projects beginning around 2011 — exposed just how fragile many "significant" findings were: The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies.

The replication crisis in psychology — sparked by large-scale replication projects beginning around 2011 — exposed just how fragile many “significant” findings were:

The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies. Only 36% of replications produced significant results (compared to 97% of the originals), and the average effect size in replications was half that of the original studies.

The Many Labs projects found that some classic effects replicated robustly across diverse samples while others — including several widely taught findings — failed consistently.

The pattern was clear: the scientific ecosystem’s incentive structure (publish significant results or perish) combined with the flexibility of statistical analysis had produced a literature contaminated with false positives and inflated effects. Effect sizes, if they had been taken seriously from the beginning, would have flagged many of these fragile findings — a d of 0.10 found once in an underpowered study is far less credible than a d of 0.60 replicated across multiple laboratories.

These concerns about measurement reliability extend to psychometric research as well, where the precision of our instruments directly affects the accuracy of our effect size estimates.

What should researchers and readers do instead?

Key Takeaway: The emerging consensus is not to abandon significance testing, but to supplement it with better practices: Always report effect sizes with confidence intervals. A confidence interval around an effect size communicates both the magnitude and the precision of the estimate. An effect of d = 0.40, 95% CI [0.05, 0.

The emerging consensus is not to abandon significance testing, but to supplement it with better practices:

Always report effect sizes with confidence intervals. A confidence interval around an effect size communicates both the magnitude and the precision of the estimate. An effect of d = 0.40, 95% CI [0.05, 0.75] tells a very different story from d = 0.40, 95% CI [0.35, 0.45].

Conduct power analyses a priori. Calculate the sample size needed to detect a meaningful effect size with adequate power (typically 80%). This prevents both underpowered studies (which produce unreliable estimates) and massively overpowered studies (which find trivial effects “significant”).

Pre-register hypotheses and analysis plans. Committing to specific hypotheses and analytical decisions before seeing the data eliminates most opportunities for p-hacking.

Use meta-analysis to synthesize evidence. No single study is definitive. Meta-analyses that aggregate effect sizes across studies provide far more reliable estimates and can detect publication bias.

Consider Bayesian alternatives. Bayes factors quantify the relative evidence for competing hypotheses and allow researchers to distinguish “evidence of absence” from “absence of evidence” — a distinction p-values cannot make.

Interpret effect sizes in context. Ask: Given the domain, the intervention, and the outcome, is this effect large enough to matter? A small effect might be revolutionary in one context and trivial in another.

The bottom line

Key Takeaway: The p-value answered a useful but narrow question: is the effect likely to be non-zero? The effect size answers the question that actually matters: is the effect large enough to care about? Psychology's replication crisis has demonstrated, painfully, what happens when an entire field bases its conclusions on the first question while ignoring the second.

The p-value answered a useful but narrow question: is the effect likely to be non-zero? The effect size answers the question that actually matters: is the effect large enough to care about? Psychology’s replication crisis has demonstrated, painfully, what happens when an entire field bases its conclusions on the first question while ignoring the second. The solution isn’t to abandon statistical testing, but to treat significance as a minimum threshold rather than a sufficient one — and to always ask, alongside “is it significant?”, the more important question: “how big is it, and does it matter?”

People Also Ask

Does Music Training Increase IQ? What the Research Actually Shows?

Few claims in popular science are as persistent as the idea that music makes you smarter. From the "Mozart Effect" craze of the 1990s — which sent pregnant women rushing to buy classical CDs — to today's parents enrolling toddlers in Suzuki violin, the belief that music training enhances general intelligence has deep cultural roots. But what does the research actually show? The answer is nuanced, sometimes contradictory, and more interesting than the headlines suggest.

Read more →
Does Birth Order Affect Intelligence? What Large-Scale Studies Reveal?

The belief that firstborn children are smarter than their younger siblings is one of the most persistent ideas in folk psychology. Parents joke about it, media repeats it, and surprisingly, the research largely supports it — though the effect is far smaller than most people assume and the reasons behind it are still debated.

Read more →
The Flynn Effect: Are Humans Getting Smarter — or Dumber?

In 1984, political scientist James Flynn published a finding that would reshape how we think about intelligence: IQ scores had been rising steadily across the developed world for as long as records existed. The gains averaged roughly 3 points per decade — meaning the average person today would score in the gifted range on a test normed 70 years ago. But the story doesn't end there. Recent evidence suggests the trend may be reversing. Are humans getting smarter, getting dumber, or is the question itself misleading?

Read more →
Can You Actually Increase Your IQ? What the Research Shows?

Few questions in psychology generate as much debate as whether intelligence is fixed or malleable. The idea that IQ is set in stone — hardwired by genetics and sealed by early childhood — persists in popular culture, but the scientific picture is considerably more nuanced. Decades of research show that IQ scores can and do change, though the mechanisms, magnitude, and permanence of those changes vary widely. Here is what the evidence actually supports.

Read more →
What are the key aspects of what does a p-value actually tell you??

The p-value is perhaps the most misunderstood statistic in all of science. Formally, it is defined as: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Notice what the p-value does not tell you:

Why does why is the p < .05 threshold problematic? matter in psychology?

The .05 threshold was popularized by Ronald Fisher in the 1920s as a rough rule of thumb — not as a rigid decision boundary. Yet it has calcified into the primary arbiter of what gets published, funded, and believed. This creates several systematic problems:

📋 Cite This Article

Jouve, X. (2026, March 27). Effect Size vs. P-Value: Why Statistical Significance Isn’t Enough. PsychoLogic. https://www.psychologic.online/2026/03/27/effect-size-vs-p-value-why-statistical-significance-isnt-enough/