What are future directions?

The authors suggest several areas for further research, including improving collaboration between psychology and psychometrics, exploring methods to balance simplicity and precision in scoring, and investigating the implications of machine learning and data-driven approaches in measurement and prediction. They also highlight the need for more robust theoretical development in psychological attributes.

The article effectively addresses concerns raised by McNeish and Mislevy, advocating for a balanced perspective on psychometric methods. By emphasizing education, outreach, and collaboration, the authors make a compelling case for bridging the divide between psychology and psychometrics to enhance the field's theoretical and practical contributions.

Sijtsma, K., Ellis, J. L., & Borsboom, D. (2024). Rejoinder to McNeish and Mislevy: What Does Psychological Measurement Require? Psychometrika, 89, 11175–1185 (2024). https://doi.org/10.1007/s11336-024-10004-7

Addressing the Divide Between Psychology and Psychometrics

Published: December 19, 2024 · Last reviewed: May 4, 2026

📖1,865 words⏱8 min read📚5 references cited

In 2024, Psychometrika ran an unusual exchange. Three senior psychometricians — Klaas Sijtsma, Jules Ellis, and Denny Borsboom — published a focus article arguing that the humble sum score, the simple total of right-or-wrong answers on a test, is psychometrics’ greatest accomplishment and should remain central to practice. Two commentaries, one by Daniel McNeish and one by Robert Mislevy, pushed back. The authors then published a rejoinder. Read together, the four papers describe a real fault line in measurement: a growing gap between what psychometricians can do and what psychologists actually do, and disagreement about which side needs to move first.

The exchange matters beyond psychometrics journals because it directly affects how researchers, clinicians, and educators interpret almost every test result they encounter — depression inventories, IQ tests, school assessments, personality questionnaires.

The core claim: sum scores are mathematically defensible

The sum score is what almost everyone outside specialist measurement circles uses. Add up the right answers; that’s the score. For decades, methodological reformers have argued that this is a primitive practice that should be replaced by latent variable models — Item Response Theory (IRT) and structural equation modeling — that more accurately recover the underlying construct.

Sijtsma, Ellis, and Borsboom’s 2024 focus article makes three technical points against this dismissal:

Stochastic ordering. Across a wide variety of standard IRT models, the sum score stochastically orders the latent variable — meaning higher sum scores correspond, on average, to higher levels of the underlying trait. The sum score is not a competitor to IRT; it is a quantity that IRT itself certifies as ordinally informative.
Reliability is not opaque. Classical Test Theory, often dismissed as obsolete, provides a family of lower bounds on reliability — including coefficient alpha and several superior alternatives — that under reasonable conditions are close to the true reliability of a test.
Value comes from prediction. The ultimate test of a score is whether it predicts practically relevant outcomes. Sum scores frequently do this nearly as well as scores derived from more sophisticated models, while remaining transparent.

The position is not that latent variable models are useless — the authors are explicit that they have unique uses, especially in test equating and computerized adaptive testing — but that displacing the sum score in routine practice is a methodological luxury most empirical research does not need.

The McNeish commentary: practical limits of sum scoring

McNeish’s 2024 commentary largely agrees with the mathematical content of the Sijtsma et al. article and then asks a different question: are sum scores actually positioned to improve psychometric practice in psychology, education, and adjacent fields? He raises three areas of concern:

Likert and ordinal response scales. Sum scores treat ordinal categories as if they had equal numerical spacing. This is mathematically questionable when the response options are “strongly disagree” through “strongly agree” rather than right/wrong.
Multidimensional constructs. Many psychological scales tap multiple correlated facets. A sum score collapses this structure, sometimes obscuring meaningful subscale differences.
Moderated and heterogeneous associations. When the relationship between a construct and an outcome differs across groups, simple rank ordering by sum score may not be enough to characterize that heterogeneity. Latent variable models offer tools for representing measurement non-invariance that sum scores do not.

McNeish’s framing is pragmatic: he agrees with the Sijtsma et al. theory and questions whether the sum score, as deployed in practice, is doing the work that the theory permits.

The Mislevy commentary: evidentiary reasoning

Mislevy’s commentary takes a different tack. Writing from the perspective of evidentiary reasoning — a framework that treats test scores as evidence claims about latent attributes — he questions whether the sum score’s success reflects genuine psychometric accomplishment or a kind of intuitive folk theory that happens to work most of the time. The provocative title — “Are Sum Scores a Great Accomplishment of Psychometrics or Intuitive Test Theory?” — sets up the question directly. His comments are short and concentrated, but the implicit challenge is that calling sum scores psychometrics’ greatest accomplishment may credit the discipline for an artifact whose practical robustness emerged largely without psychometric input.

The rejoinder: the real problem is the gap

Sijtsma, Ellis, and Borsboom’s rejoinder concedes individual technical points where appropriate and then re-frames the disagreement around a broader concern: the growing gap between psychology and psychometrics. Several themes:

Psychometrics outreach is underdeveloped. Most empirical psychologists have only superficial training in measurement, and psychometricians have not done enough to communicate their methods in usable form.
Different methods often produce similar results. Where they disagree, the disagreement frequently traces to an under-specified theory of the attribute being measured, not to the choice of estimator.
Sum scores serve communication. Test users, clinicians, and clients understand and can act on a sum score in ways that latent variable estimates rarely permit. This communication function is not an embarrassment to be eliminated; it is part of what makes a score useful.
Latent variables shine in advanced applications. Equating across test forms, computerized adaptive testing, and certain forms of differential item functioning analysis genuinely require latent variable models. The argument is for appropriate matching of method to use case, not blanket replacement.
Decisions are usually coarse. Most consequential test-based decisions are binary (refer or not, intervene or not, classify above or below a cutoff) or use a small number of categories. Granular precision below that level adds little.

The rejoinder closes with the position that psychology and psychometrics need to work together — that neither side can solve the problem alone, and that further insulation between them will continue to produce a literature in which sophisticated measurement tools coexist with widespread questionable measurement practices in the same field.

Why this exchange matters: the measurement crisis context

The Sijtsma–McNeish–Mislevy exchange is happening in the broader context of what Flake and Fried (2020) called the measurement schmeasurement problem — a documented pattern of questionable measurement practices in published psychology research. Flake and Fried describe practices including failure to report measurement reliability, undocumented modifications to validated scales, ad hoc subscale construction, and lack of transparency about how scores were computed. They argue these practices are common, hide a substantial source of researcher degrees of freedom, and pose a serious threat to the cumulative validity of psychological knowledge.

This is the context in which the sum score debate is consequential. If empirical psychology is using validated scales loosely, modifying them informally, and reporting only minimal psychometric information, then the marginal value of arguing about sum scores versus IRT in the abstract is small. The more pressing issue is whether any measurement practice is being done with the rigor that either approach assumes.

That framing — that the gap between psychology and psychometrics is the real problem and the choice between sum scores and IRT is a downstream technical question — is what unifies all four papers in the Psychometrika exchange, despite their disagreements on specifics.

What this means for researchers and clinicians

Several practical implications emerge from reading the exchange together:

Sum scores are not a methodological embarrassment. Reporting a sum score for a validated instrument with adequate reliability is defensible practice, not a fallback. The mathematical work is sound.
Match the method to the question. If you are equating across test forms, comparing groups under suspected non-invariance, or building adaptive testing, IRT is the right tool. If you are summarizing performance on a single fixed instrument for prediction or screening, a sum score is often equally good and more interpretable.
Transparency is more important than method choice. Reporting how items were selected, how missing data was handled, what reliability evidence was computed, and what modifications were made matters more for valid inference than the choice between summing and modeling.
Communication still uses sum scores. Even when latent variable models are used internally, downstream communication with clients, patients, schools, and policy decisions almost universally uses sum scores or transformations of them. This is structural, not a failure.
Distrust strong claims either way. Anyone who tells you that sum scoring is an unredeemed error, or that latent variable modeling is an unjustified luxury, is selling the simpler version of a real disagreement.

What the exchange does not resolve

Several questions remain genuinely open:

How serious is the practical impact of ordinal-versus-interval treatment? McNeish’s concerns about Likert scales are real but not always large in magnitude. The empirical literature on when this matters and when it does not is uneven.
How should multidimensional instruments be reported? A single sum score, multiple subscale scores, or a latent variable representation are all defensible. Field consensus on when each is appropriate is incomplete.
What does psychometrics outreach actually look like? Both sides agree training and outreach are needed. The institutional structures that would deliver them — required graduate methods curricula, journal reporting standards, software defaults — are still under-developed.
Whether the exchange changes practice. Methodological exchanges in Psychometrika reach a relatively small specialist audience. The empirical psychology mainstream may continue with the same practices regardless.

Frequently Asked Questions

Is summing test items a primitive practice?

No. Across a broad class of standard psychometric models, sum scores stochastically order the underlying trait, and they predict practical outcomes nearly as well as more sophisticated estimates. The view that sum scores are pre-scientific is not supported by the technical literature.

Why use IRT at all if sum scores are so good?

IRT is genuinely necessary for some applications: equating different forms of a test, computerized adaptive testing, evaluating measurement invariance across groups, and certain item-level diagnostics. For routine summarization of a fixed scale, the marginal benefit over sum scoring is often small.

What is “the gap between psychology and psychometrics”?

The observation that psychology routinely uses measurement instruments without engaging the full technical apparatus that psychometrics has developed. The result is that empirical research often reports inadequate measurement evidence even when better methods are available.

What are “questionable measurement practices”?

A label introduced by Flake and Fried for choices researchers make that compromise the validity of measurement: not reporting reliability, modifying validated scales without documentation, ad hoc subscale construction, lack of transparency about scoring procedures, and similar.

Does this debate affect everyday clinical practice?

Indirectly. Clinical instruments are typically scored by summing item responses, and that practice is defensible. Concerns enter when scales are used in non-validated populations, modified without psychometric checks, or interpreted with more precision than their reliability supports.

What should I cite if I want to enter this conversation?

The four papers in the 2024 Psychometrika exchange (Sijtsma et al. focus article, McNeish commentary, Mislevy commentary, Sijtsma et al. rejoinder) form the core. Flake and Fried (2020) provides the broader measurement-crisis context.

References

Sijtsma, K., Ellis, J. L., & Borsboom, D. (2024). Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment. Psychometrika, 89(1), 84–117. https://doi.org/10.1007/s11336-024-09964-7
McNeish, D. (2024). Practical Implications of Sum Scores Being Psychometrics’ Greatest Accomplishment. Psychometrika, 89(4), 1148–1169. https://doi.org/10.1007/s11336-024-09988-z
Mislevy, R. J. (2024). Are Sum Scores a Great Accomplishment of Psychometrics or Intuitive Test Theory? Psychometrika, 89(4), 1170–1174. https://doi.org/10.1007/s11336-024-10003-8
Sijtsma, K., Ellis, J. L., & Borsboom, D. (2024). Rejoinder to McNeish and Mislevy: What Does Psychological Measurement Require? Psychometrika, 89(4), 1175–1185. https://doi.org/10.1007/s11336-024-10004-7
Flake, J. K., & Fried, E. I. (2020). Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Advances in Methods and Practices in Psychological Science, 3(4), 456–465. https://doi.org/10.1177/2515245920952393

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Psychological Measurement and Testing

Psychometrics: The Science of Psychological Measurement

Psychometrics, a specialized branch within psychology, is dedicated to the theory and methodology of psychological measurement. This discipline encompasses the development and refinement of testing…

Feb 27, 2025

Statistical Methods and Data Analysis

Refining Reliability with Attenuation-Corrected Estimators

Most psychometrics textbooks teach the classical "correction for attenuation" — Spearman's century-old technique for estimating what the correlation between two psychological constructs would be if…

Nov 1, 2022

Statistical Methods and Data Analysis

Optimizing Item Parameter Estimation for the Generalized Graded Unfolding Model

Roberts and Thompson (2011) conducted a thorough analysis of item parameter estimation methods within the Generalized Graded Unfolding Model (GGUM). Their work focused on the…

Jun 5, 2011

Statistical Methods and Data Analysis

A Beginner's Guide to Item Response Theory (IRT): How Modern Tests Work

Every time you take a standardized test — an IQ assessment, a college entrance exam, a professional certification — the questions have been calibrated using…

Nov 18, 2025

Statistical Methods and Data Analysis

Interpreting Differential Item Functioning with Response Process Data

A test item that scores differently for two groups of equally able examinees is called a differential item functioning (DIF) item, and identifying these items…

Dec 16, 2024

Addressing the Divide Between Psychology and Psychometrics

The core claim: sum scores are mathematically defensible

The McNeish commentary: practical limits of sum scoring

The Mislevy commentary: evidentiary reasoning

The rejoinder: the real problem is the gap

Why this exchange matters: the measurement crisis context

What this means for researchers and clinicians

What the exchange does not resolve

Frequently Asked Questions

Is summing test items a primitive practice?

Why use IRT at all if sum scores are so good?

What is “the gap between psychology and psychometrics”?

What are “questionable measurement practices”?

Does this debate affect everyday clinical practice?

What should I cite if I want to enter this conversation?

References

Related Research

Psychometrics: The Science of Psychological Measurement

Refining Reliability with Attenuation-Corrected Estimators

Optimizing Item Parameter Estimation for the Generalized Graded Unfolding Model

A Beginner's Guide to Item Response Theory (IRT): How Modern Tests Work

Interpreting Differential Item Functioning with Response Process Data

People Also Ask

Leave a Reply Cancel reply

The core claim: sum scores are mathematically defensible

The McNeish commentary: practical limits of sum scoring

The Mislevy commentary: evidentiary reasoning

The rejoinder: the real problem is the gap

Why this exchange matters: the measurement crisis context

What this means for researchers and clinicians

What the exchange does not resolve

Frequently Asked Questions

Is summing test items a primitive practice?

Why use IRT at all if sum scores are so good?

What is “the gap between psychology and psychometrics”?

What are “questionable measurement practices”?

Does this debate affect everyday clinical practice?

What should I cite if I want to enter this conversation?

References

Related Research

People Also Ask

You may also like...

Popular Posts

Leave a Reply Cancel reply