Statistical Methods and Data Analysis

Addressing the Divide Between Psychology and Psychometrics

Addressing the Divide Between Psychology and Psychometrics
Published: December 19, 2024 · Last reviewed:
📖1,865 words8 min read📚5 references cited

In 2024, Psychometrika ran an unusual exchange. Three senior psychometricians — Klaas Sijtsma, Jules Ellis, and Denny Borsboom — published a focus article arguing that the humble sum score, the simple total of right-or-wrong answers on a test, is psychometrics’ greatest accomplishment and should remain central to practice. Two commentaries, one by Daniel McNeish and one by Robert Mislevy, pushed back. The authors then published a rejoinder. Read together, the four papers describe a real fault line in measurement: a growing gap between what psychometricians can do and what psychologists actually do, and disagreement about which side needs to move first.

The exchange matters beyond psychometrics journals because it directly affects how researchers, clinicians, and educators interpret almost every test result they encounter — depression inventories, IQ tests, school assessments, personality questionnaires.

The core claim: sum scores are mathematically defensible

The sum score is what almost everyone outside specialist measurement circles uses. Add up the right answers; that’s the score. For decades, methodological reformers have argued that this is a primitive practice that should be replaced by latent variable models — Item Response Theory (IRT) and structural equation modeling — that more accurately recover the underlying construct.

Sijtsma, Ellis, and Borsboom’s 2024 focus article makes three technical points against this dismissal:

  • Stochastic ordering. Across a wide variety of standard IRT models, the sum score stochastically orders the latent variable — meaning higher sum scores correspond, on average, to higher levels of the underlying trait. The sum score is not a competitor to IRT; it is a quantity that IRT itself certifies as ordinally informative.
  • Reliability is not opaque. Classical Test Theory, often dismissed as obsolete, provides a family of lower bounds on reliability — including coefficient alpha and several superior alternatives — that under reasonable conditions are close to the true reliability of a test.
  • Value comes from prediction. The ultimate test of a score is whether it predicts practically relevant outcomes. Sum scores frequently do this nearly as well as scores derived from more sophisticated models, while remaining transparent.

The position is not that latent variable models are useless — the authors are explicit that they have unique uses, especially in test equating and computerized adaptive testing — but that displacing the sum score in routine practice is a methodological luxury most empirical research does not need.

The McNeish commentary: practical limits of sum scoring

McNeish’s 2024 commentary largely agrees with the mathematical content of the Sijtsma et al. article and then asks a different question: are sum scores actually positioned to improve psychometric practice in psychology, education, and adjacent fields? He raises three areas of concern:

  • Likert and ordinal response scales. Sum scores treat ordinal categories as if they had equal numerical spacing. This is mathematically questionable when the response options are “strongly disagree” through “strongly agree” rather than right/wrong.
  • Multidimensional constructs. Many psychological scales tap multiple correlated facets. A sum score collapses this structure, sometimes obscuring meaningful subscale differences.
  • Moderated and heterogeneous associations. When the relationship between a construct and an outcome differs across groups, simple rank ordering by sum score may not be enough to characterize that heterogeneity. Latent variable models offer tools for representing measurement non-invariance that sum scores do not.

McNeish’s framing is pragmatic: he agrees with the Sijtsma et al. theory and questions whether the sum score, as deployed in practice, is doing the work that the theory permits.

The Mislevy commentary: evidentiary reasoning

Mislevy’s commentary takes a different tack. Writing from the perspective of evidentiary reasoning — a framework that treats test scores as evidence claims about latent attributes — he questions whether the sum score’s success reflects genuine psychometric accomplishment or a kind of intuitive folk theory that happens to work most of the time. The provocative title — “Are Sum Scores a Great Accomplishment of Psychometrics or Intuitive Test Theory?” — sets up the question directly. His comments are short and concentrated, but the implicit challenge is that calling sum scores psychometrics’ greatest accomplishment may credit the discipline for an artifact whose practical robustness emerged largely without psychometric input.

The rejoinder: the real problem is the gap

Sijtsma, Ellis, and Borsboom’s rejoinder concedes individual technical points where appropriate and then re-frames the disagreement around a broader concern: the growing gap between psychology and psychometrics. Several themes:

  • Psychometrics outreach is underdeveloped. Most empirical psychologists have only superficial training in measurement, and psychometricians have not done enough to communicate their methods in usable form.
  • Different methods often produce similar results. Where they disagree, the disagreement frequently traces to an under-specified theory of the attribute being measured, not to the choice of estimator.
  • Sum scores serve communication. Test users, clinicians, and clients understand and can act on a sum score in ways that latent variable estimates rarely permit. This communication function is not an embarrassment to be eliminated; it is part of what makes a score useful.
  • Latent variables shine in advanced applications. Equating across test forms, computerized adaptive testing, and certain forms of differential item functioning analysis genuinely require latent variable models. The argument is for appropriate matching of method to use case, not blanket replacement.
  • Decisions are usually coarse. Most consequential test-based decisions are binary (refer or not, intervene or not, classify above or below a cutoff) or use a small number of categories. Granular precision below that level adds little.

The rejoinder closes with the position that psychology and psychometrics need to work together — that neither side can solve the problem alone, and that further insulation between them will continue to produce a literature in which sophisticated measurement tools coexist with widespread questionable measurement practices in the same field.

Why this exchange matters: the measurement crisis context

The Sijtsma–McNeish–Mislevy exchange is happening in the broader context of what Flake and Fried (2020) called the measurement schmeasurement problem — a documented pattern of questionable measurement practices in published psychology research. Flake and Fried describe practices including failure to report measurement reliability, undocumented modifications to validated scales, ad hoc subscale construction, and lack of transparency about how scores were computed. They argue these practices are common, hide a substantial source of researcher degrees of freedom, and pose a serious threat to the cumulative validity of psychological knowledge.

This is the context in which the sum score debate is consequential. If empirical psychology is using validated scales loosely, modifying them informally, and reporting only minimal psychometric information, then the marginal value of arguing about sum scores versus IRT in the abstract is small. The more pressing issue is whether any measurement practice is being done with the rigor that either approach assumes.

That framing — that the gap between psychology and psychometrics is the real problem and the choice between sum scores and IRT is a downstream technical question — is what unifies all four papers in the Psychometrika exchange, despite their disagreements on specifics.

What this means for researchers and clinicians

Several practical implications emerge from reading the exchange together:

  • Sum scores are not a methodological embarrassment. Reporting a sum score for a validated instrument with adequate reliability is defensible practice, not a fallback. The mathematical work is sound.
  • Match the method to the question. If you are equating across test forms, comparing groups under suspected non-invariance, or building adaptive testing, IRT is the right tool. If you are summarizing performance on a single fixed instrument for prediction or screening, a sum score is often equally good and more interpretable.
  • Transparency is more important than method choice. Reporting how items were selected, how missing data was handled, what reliability evidence was computed, and what modifications were made matters more for valid inference than the choice between summing and modeling.
  • Communication still uses sum scores. Even when latent variable models are used internally, downstream communication with clients, patients, schools, and policy decisions almost universally uses sum scores or transformations of them. This is structural, not a failure.
  • Distrust strong claims either way. Anyone who tells you that sum scoring is an unredeemed error, or that latent variable modeling is an unjustified luxury, is selling the simpler version of a real disagreement.

What the exchange does not resolve

Several questions remain genuinely open:

  • How serious is the practical impact of ordinal-versus-interval treatment? McNeish’s concerns about Likert scales are real but not always large in magnitude. The empirical literature on when this matters and when it does not is uneven.
  • How should multidimensional instruments be reported? A single sum score, multiple subscale scores, or a latent variable representation are all defensible. Field consensus on when each is appropriate is incomplete.
  • What does psychometrics outreach actually look like? Both sides agree training and outreach are needed. The institutional structures that would deliver them — required graduate methods curricula, journal reporting standards, software defaults — are still under-developed.
  • Whether the exchange changes practice. Methodological exchanges in Psychometrika reach a relatively small specialist audience. The empirical psychology mainstream may continue with the same practices regardless.

Frequently Asked Questions

Is summing test items a primitive practice?

No. Across a broad class of standard psychometric models, sum scores stochastically order the underlying trait, and they predict practical outcomes nearly as well as more sophisticated estimates. The view that sum scores are pre-scientific is not supported by the technical literature.

Why use IRT at all if sum scores are so good?

IRT is genuinely necessary for some applications: equating different forms of a test, computerized adaptive testing, evaluating measurement invariance across groups, and certain item-level diagnostics. For routine summarization of a fixed scale, the marginal benefit over sum scoring is often small.

What is “the gap between psychology and psychometrics”?

The observation that psychology routinely uses measurement instruments without engaging the full technical apparatus that psychometrics has developed. The result is that empirical research often reports inadequate measurement evidence even when better methods are available.

What are “questionable measurement practices”?

A label introduced by Flake and Fried for choices researchers make that compromise the validity of measurement: not reporting reliability, modifying validated scales without documentation, ad hoc subscale construction, lack of transparency about scoring procedures, and similar.

Does this debate affect everyday clinical practice?

Indirectly. Clinical instruments are typically scored by summing item responses, and that practice is defensible. Concerns enter when scales are used in non-validated populations, modified without psychometric checks, or interpreted with more precision than their reliability supports.

What should I cite if I want to enter this conversation?

The four papers in the 2024 Psychometrika exchange (Sijtsma et al. focus article, McNeish commentary, Mislevy commentary, Sijtsma et al. rejoinder) form the core. Flake and Fried (2020) provides the broader measurement-crisis context.

References

Related Research

Psychological Measurement and Testing

Psychometrics: The Science of Psychological Measurement

Psychometrics, a specialized branch within psychology, is dedicated to the theory and methodology of psychological measurement. This discipline encompasses the development and refinement of testing…

Feb 27, 2025
Statistical Methods and Data Analysis

Refining Reliability with Attenuation-Corrected Estimators

Most psychometrics textbooks teach the classical "correction for attenuation" — Spearman's century-old technique for estimating what the correlation between two psychological constructs would be if…

Nov 1, 2022

People Also Ask

What is psychometrics: the science of psychological measurement?

The discipline of psychometrics emerged from two distinct yet complementary intellectual traditions. The first, championed by figures such as Charles Darwin, Francis Galton, and James McKeen Cattell, emphasized the study of individual differences and sought to develop systematic methods for their quantification. The second, rooted in the psychophysical research of Johann Friedrich Herbart, Ernst Heinrich Weber, Gustav Fechner, and Wilhelm Wundt, laid the foundation for the empirical investigation of human perception, cognition, and consciousness. Together, these two traditions converged to form the scientific underpinnings of modern psychological measurement.

Read more →
What are refining reliability with attenuation-corrected estimators?

Jari Metsämuuronen’s (2022) article introduces a significant advancement in how reliability is estimated within psychological assessments. The study critiques traditional methods for their tendency to yield deflated results and proposes new attenuation-corrected estimators to address these limitations. This review examines the article’s contributions and its implications for improving measurement precision.

Read more →
What is optimizing item parameter estimation for the generalized graded unfolding model?

Roberts and Thompson (2011) conducted a thorough analysis of item parameter estimation methods within the Generalized Graded Unfolding Model (GGUM). Their work focused on the performance of the Marginal Maximum A Posteriori (MMAP) procedure compared to other approaches, including Marginal Maximum Likelihood (MML) and Markov Chain Monte Carlo (MCMC). By conducting simulation studies, the authors provided evidence for MMAP’s effectiveness in addressing challenges associated with item parameter estimation.

Read more →
Why is background important?

The rejoinder builds on discussions about the use of the sum score versus more sophisticated latent variable models like item response theory (IRT) in psychological measurement. The authors argue for the responsible use of both methods, stressing that the sum score remains a useful tool in contexts where simplicity and transparency are essential. The conversation highlights a growing divide between psychology and psychometrics, urging collaboration to strengthen theoretical and practical foundations in the field.

How does key insights work in practice?

The Role of Psychometrics Education and Outreach: The authors emphasize the need for enhanced education and outreach to help researchers responsibly use advanced methods like IRT. They highlight that while training is essential, direct collaboration between researchers and psychometricians is crucial for practical application. Sum Score Versus Latent Variable Models: The

Why does significance matter in psychology?

This work underscores the importance of aligning psychometric advancements with practical needs in psychological testing. By advocating for a dual approach that incorporates both sum scores and latent variable models, the authors address concerns about oversimplification without dismissing the value of transparency. The discussion contributes to ongoing debates about measurement theory, education, and application in psychology.

📋 Cite This Article

Jouve, X. (2024, December 19). Addressing the Divide Between Psychology and Psychometrics. PsychoLogic. https://www.psychologic.online/2024/12/19/psychology-psychometrics-divide/

Leave a Reply