What is significance?

The findings have practical implications for the design and administration of credentialing exams in fields where small cohorts are common. By demonstrating the advantages of Rasch methods and the value of data pooling, the research offers actionable strategies for improving fairness and accuracy in score equating. The study also informs future use of Bayesian methods, emphasizing the importance of selecting appropriate priors to avoid potential biases.

What are future directions?

This research opens opportunities for further exploration into data pooling techniques and the optimization of prior distributions in Bayesian equating methods. Expanding the analysis to include larger sample sizes and diverse testing contexts could provide additional insights and enhance the generalizability of the findings.

Babcock and Hodge's (2020) study makes a valuable contribution to the field of educational measurement by addressing the challenges of equating in small-sample contexts. Their comparison of Rasch and classical methods underscores the importance of leveraging advanced techniques to improve fairness and reliability in exam score interpretation. This research serves as a guide for educators and psychometricians seeking effective solutions for credentialing exams and similar applications.

Babcock, B., & Hodge, K. J. (2020). Rasch Versus Classical Equating in the Context of Small Sample Sizes. Educational and Psychological Measurement, 80(3), 499-521. https://doi.org/10.1177/0013164419878483

Rasch vs Classical Equating in Small Samples

Published: June 2, 2020 · Last reviewed: May 7, 2026

📖1,710 words⏱7 min read📚4 references cited

Equating is the procedure that lets two different forms of the same test produce comparable scores. A candidate who scores 70 on Form A should be regarded as having the same ability as a candidate who scores 73 on Form B if the forms differ slightly in difficulty; equating is the statistical machinery that produces the conversion. The procedure is well-developed for large-sample contexts — large-scale state assessments, college-admissions exams, certification programs with thousands of test-takers — and the standard methods (linear, equipercentile, IRT-based) all work tolerably well when sample sizes per form exceed about 1,000 per form per administration.

The problem is small-sample equating. Credentialing programs in specialized fields, regional certifications, and many niche professional examinations administer tests to cohorts of fewer than 100 candidates per form per cycle. The standard equating methods either fail to converge, produce wildly unstable estimates, or violate their own assumptions in ways that bias the resulting score conversions. Babcock and Hodge (2020), in Educational and Psychological Measurement, evaluate Rasch-based and classical equating approaches under realistic small-sample credentialing conditions and document where each approach succeeds and fails.

What equating actually requires

Kolen and Brennan’s (2014) textbook treatment lays out the canonical equating framework: two forms are equated if their score distributions are made comparable through a transformation that preserves a defined invariance property (typically that examinees of equal ability would have equal expected scaled scores on either form). Classical equating methods — linear, equipercentile, mean equating — operate on the raw-score distributions and the cross-form correlation structure. They make minimal assumptions about the underlying psychometric model but are correspondingly demanding of sample size: they need enough cross-form data to estimate the joint score distribution reliably.

IRT-based equating, including Rasch equating, operates on item parameters rather than directly on score distributions. It requires an IRT model fit to the calibration data — typically with a common-item design where some items appear on both forms — and uses the fitted item parameters to derive the score conversion. The advantage is that IRT methods can equate forms that share only a subset of items, which is the structurally common situation in operational credentialing programs. The cost is the model-fit assumption: if the IRT model is wrong, the equating it produces is biased in the direction the model is wrong.

Rasch equating in particular fits a one-parameter logistic model where item discriminations are constrained to be equal. This restriction is psychometrically demanding — real items rarely have identical discriminations — but it makes calibration tractable with samples that would defeat 2PL or 3PL estimation. The trade-off Babcock and Hodge address is whether the parsimony of Rasch equating is worth the model-fit risk in small-sample contexts where the alternatives are also compromised.

The small-sample equating literature

Skaggs (2005) ran one of the formative empirical evaluations of small-sample equating in the random-groups design, examining how classical equating methods perform when sample sizes drop below 200 per form. The findings were stark: equipercentile equating produced unacceptable error in the tails of the score distribution at sample sizes below 100, and even mean equating — the simplest method — was unstable when sample sizes were very small. The study established a methodological consensus that classical equating below 200 per form was risky and below 100 per form was generally indefensible.

Livingston and Lewis (2009), in an ETS research report, proposed that small-sample equating could be salvaged by incorporating prior information from previous administrations or expert judgment. Their methods — which are partially Bayesian in spirit — pool data across administrations or use informative priors derived from outside the sample. The pooling approach addresses the structural issue that any one small-sample administration is uninformative about its own equating; multiple small-sample administrations together can be more informative.

The methodological landscape that Babcock and Hodge inherit thus has two main strategies for small-sample equating: switch to a more parsimonious psychometric model (Rasch instead of 2PL or 3PL), or supplement the small sample with prior information (Bayesian methods or data pooling).

What Babcock and Hodge (2020) found

Babcock and Hodge ran simulation studies tailored to credentialing exam realities — sample sizes per form below 100, common-item linking design, modest test lengths typical of certification programs. They compared Rasch equating, classical equipercentile equating with smoothing, and Bayesian methods with informative priors derived from previous administrations.

The headline finding: Rasch equating outperformed classical equating across the small-sample conditions tested, with smaller equating errors and more stable score conversions across replications. The Rasch parsimony — equal item discriminations — turned out to be a feature rather than a bug in this regime, because the alternative was a 2PL or classical method that had insufficient data to estimate the additional parameters reliably. The Rasch model misfit (real items don’t have identical discriminations) was a smaller error source than the small-sample noise in the more flexible alternatives.

Combining Rasch equating with data pooling across previous administrations improved estimates further. The pooled estimates of item difficulty stabilized faster than the per-administration estimates, and the resulting score conversions were less sensitive to which particular candidates happened to take the form in any given cycle. For credentialing programs that administer the same form repeatedly to small cohorts over time, this is a practical and immediate methodological recommendation.

The authors also documented a Bayesian failure mode: when prior distributions reflect a different item-difficulty profile than the current administration’s true profile, the prior dominates the data and biases the equating in the prior’s direction. The remedy is empirically derived priors from comparable past administrations rather than expert-judgment priors that may be miscalibrated. The lesson is that Bayesian small-sample equating is reliable to the extent that the prior is honestly grounded in similar past data; expert priors that “should” be reasonable can fail badly in practice.

Practical implications for credentialing programs

For programs with persistent small-sample sizes, the actionable recommendations distill to:

Use Rasch equating in preference to classical methods when sample sizes per form are below approximately 200. The Rasch model’s restrictions are unrealistic in detail but produce more stable equating in this regime than the alternatives.
Pool calibration data across recent administrations when feasible. Three or four cohorts of 75 candidates each behave better methodologically than one cohort of 75 candidates analyzed alone, because the pooled item-difficulty estimates are more stable.
Use empirically-derived priors, not expert priors, in any Bayesian equating. The honesty cost of saying “we don’t have a defensible prior for this form” is smaller than the cost of using an expert prior that biases the equating.
Report the equating’s expected error alongside the conversion table. Small-sample equating produces larger conditional standard errors of measurement than large-sample equating, and reporting the error budget honestly is part of the credentialing program’s transparency obligation.
Monitor item drift over time. Pooled calibration assumes items behave similarly across administrations; drift indicates the assumption is breaking down and requires recalibration.

For programs with adequate sample sizes — above ~300 per form — the standard equating methods (equipercentile, 2PL or 3PL IRT-based) work as advertised, and the small-sample machinery adds complexity without benefit. The Babcock-Hodge findings are specifically about the regime where standard methods fail, not a general displacement of large-sample equating practice.

Where this fits in the broader equating literature

Test equating is one of the older subfields of psychometrics, and small-sample equating is one of its most practically pressing problems. Kolen and Brennan’s (2014) textbook is the standard reference; Skaggs (2005) established the empirical case that classical methods fail below specific sample-size thresholds; Livingston and Lewis (2009) opened the Bayesian and pooling pathways; Babcock and Hodge (2020) consolidate the case for Rasch-based small-sample equating with empirical-prior or pooled-data augmentation. The field has not converged on a single recommended practice, partly because the choice depends on the specific program’s data flow and partly because the methodological literature is still working out edge cases.

The connection to broader IRT practice — including Bayesian hierarchical 2PL estimation under ADVI — runs through the prior-information theme. Hierarchical Bayesian methods, properly applied, are explicitly designed to share information across groups and stabilize sparse estimates; small-sample equating is one of the cleanest application cases. The methodological convergence between modern Bayesian psychometric estimation and small-sample equating is one of the more useful theoretical developments of the past decade.

Frequently Asked Questions

Why does classical equating fail with small samples?

Classical methods estimate the joint distribution of scores across forms, which requires enough cross-form data to characterize the joint behavior reliably. Below ~100-200 candidates per form, the empirical joint distribution is too noisy to support stable equating, particularly in the score-distribution tails where high-stakes decisions are most often made.

What does the Rasch model assume that 2PL or 3PL doesn’t?

Rasch (1PL) assumes that all items have equal discrimination — they differentiate between high and low ability examinees with the same precision. The 2PL relaxes this and estimates a separate discrimination per item; the 3PL adds a guessing parameter. Rasch is more restrictive and therefore tractable with fewer respondents, but the equal-discrimination assumption rarely holds exactly.

Is data pooling always defensible?

It assumes that the items behave similarly across the pooled administrations. If items drift — through changing test-prep practices, content updates, or population shifts — the pooled calibration mixes incompatible data. Routine monitoring for item drift is the defensible accompanying practice.

What is the difference between equating and linking?

Equating produces score conversions that allow scores from one form to be used interchangeably with scores from another. Linking produces score correspondences across measures of related but distinct constructs and does not support full interchangeability. Equating is more demanding and requires that the forms measure the same construct on the same scale; linking is more permissive.

What sample size is large enough for classical equating?

Conventional rules of thumb specify ~1,000 per form per administration as comfortable, ~300-500 as workable for equipercentile equating with smoothing, and below ~200 as risky. The Babcock-Hodge findings sharpen the lower end: Rasch-based methods with pooled calibration outperform classical methods substantially below ~100 per form.

References

Babcock, B., & Hodge, K. J. (2020). Rasch versus classical equating in the context of small sample sizes. Educational and Psychological Measurement, 80(3), 499–521. https://doi.org/10.1177/0013164419878483
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking (3rd ed.). Springer. https://doi.org/10.1007/978-1-4939-0317-7
Livingston, S. A., & Lewis, C. (2009). Small-sample equating with prior information (ETS Research Report No. RR-09-25). https://doi.org/10.1002/j.2333-8504.2009.tb02182.x
Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309–330. https://doi.org/10.1111/j.1745-3984.2005.00018.x

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Statistical Methods and Data Analysis

Attenuation-Corrected Reliability Estimators

Most psychometrics textbooks teach the classical "correction for attenuation" — Spearman's century-old technique for estimating what the correlation between two psychological constructs would be if…

Nov 1, 2022

Psychological Measurement and Testing

Continuous Norming for Cognitive Tests

The standard practice in psychometric test publication is to develop norm tables by stratifying the standardization sample into age bands and computing percentile-rank tables within…

Apr 14, 2021

Statistical Methods and Data Analysis

Missing Data Methods in Educational Testing

Missing data is the rule, not the exception, in educational testing. Examinees skip items they don't know, run out of time on long tests, encounter…

Oct 10, 2020

Psychological Measurement and Testing

WISC-V Short-Form IQ Estimation

Administering the full Wechsler Intelligence Scale for Children, Fifth Edition (WISC-V) takes 60 to 80 minutes for the seven subtests that compose Full Scale IQ.…

Jun 24, 2020

Statistical Methods and Data Analysis

Estimation Methods and SEM Fit Indices

Structural equation modeling (SEM) reports its goodness of fit through a small set of indices that have, by convention, hardened into thresholds. The Hu and…

Jun 2, 2020

Rasch vs Classical Equating in Small Samples

What equating actually requires

The small-sample equating literature

What Babcock and Hodge (2020) found

Practical implications for credentialing programs

Where this fits in the broader equating literature

Frequently Asked Questions

Why does classical equating fail with small samples?

What does the Rasch model assume that 2PL or 3PL doesn’t?

Is data pooling always defensible?

What is the difference between equating and linking?

What sample size is large enough for classical equating?

References

Related Research

Attenuation-Corrected Reliability Estimators

Continuous Norming for Cognitive Tests

Missing Data Methods in Educational Testing

WISC-V Short-Form IQ Estimation

Estimation Methods and SEM Fit Indices

People Also Ask

Leave a Reply Cancel reply

What equating actually requires

The small-sample equating literature

What Babcock and Hodge (2020) found

Practical implications for credentialing programs

Where this fits in the broader equating literature

Frequently Asked Questions

Why does classical equating fail with small samples?

What does the Rasch model assume that 2PL or 3PL doesn’t?

Is data pooling always defensible?

What is the difference between equating and linking?

What sample size is large enough for classical equating?

References

Related Research

People Also Ask

You may also like...

Popular Posts

Leave a Reply Cancel reply