What is significance?

This research provides actionable insights for practitioners dealing with sparse datasets in educational and psychological contexts. By demonstrating the conditions under which each method excels, it informs decisions about how to handle missing data to minimize bias and improve the reliability of ability estimates. The study also emphasizes the importance of understanding the underlying mechanism of missing data when selecting an imputation method.

What are future directions?

The findings suggest opportunities for further research into improving the performance of imputation methods, particularly for datasets where missing data is not random. Additional studies could explore the integration of domain-specific knowledge into imputation algorithms or examine the effects of these methods in real-world assessments with diverse populations.

Xiao and Bulut's (2020) study highlights the challenges of working with sparse data and provides practical guidance for improving ability estimation through appropriate missing data handling techniques. These findings contribute to the broader understanding of psychometric methods and their applications in educational measurement.

Xiao, J., & Bulut, O. (2020). Evaluating the Performances of Missing Data Handling Methods in Ability Estimation From Sparse Data. Educational and Psychological Measurement, 80(5), 932-954. https://doi.org/10.1177/0013164420911136

Missing Data Methods in Educational Testing

Published: October 10, 2020 · Last reviewed: May 7, 2026

📖1,769 words⏱7 min read📚4 references cited

Missing data is the rule, not the exception, in educational testing. Examinees skip items they don’t know, run out of time on long tests, encounter technical glitches that drop responses, or have items they were never administered under multistage or adaptive designs. Whatever the cause, the resulting response matrix has holes, and what an analyst does about those holes affects every downstream ability estimate. The choice of method matters most where it is hardest to make: in sparse datasets, where the proportion missing is high or the response matrix is structurally incomplete by design.

Xiao and Bulut (2020) ran one of the more thorough Monte Carlo evaluations of how four common missing-data methods perform under realistic IRT calibration conditions. Their headline finding — full-information maximum likelihood (FIML) outperforms imputation-based alternatives across most realistic conditions — confirms what the methodological literature has been saying for two decades, but their secondary findings sharpen the practical guidance about when each method’s failure modes start to bite.

Why missingness is a substantive problem, not a bookkeeping one

Rubin’s (1976) taxonomy is the framework everything else builds on. Data are missing completely at random (MCAR) when the probability of missingness depends on neither observed nor unobserved values; missing at random (MAR) when it depends on observed values but not on the missing values themselves; and missing not at random (MNAR, sometimes NMAR) when it depends on the missing values themselves even after conditioning on what is observed. The taxonomy matters because the validity of every missing-data method depends on which mechanism is operating, and the mechanism is not directly testable from the data — it has to be reasoned about from substantive knowledge of how the missingness arose.

For ability estimation in IRT, the worst case is when low-ability examinees disproportionately omit hard items because they don’t know the answers. The mechanism is MNAR: the probability of a missing response depends on the latent ability of the respondent, which is exactly the quantity being estimated. Naive treatments — scoring missing as wrong, or dropping respondents with any missing data — both bias the ability estimates downward for the affected examinees. Listwise deletion is rarely viable in real testing because it discards too many respondents; the question is what to do with the data that remain.

The four methods Xiao and Bulut compared

Full-information maximum likelihood (FIML) uses every available response and integrates over the unobserved ones inside the IRT likelihood. It does not impute missing values; it just respects which values are observed and conditions the likelihood on the observed pattern. FIML is consistent under MCAR and MAR by construction, and its standard errors are correct to the asymptotic order. The cost is computational: with very high missingness, the integral becomes high-dimensional and the optimizer has to work harder to converge.

Zero replacement treats every missing response as a wrong answer. It is the simplest method possible — a one-line patch to the data matrix before calibration runs as if no data were missing. Theoretically it should bias estimates downward whenever low-ability respondents are over-represented in the missing set, which is almost always. In practice, the bias is real but bounded; under specific patterns it produces ability estimates that are not far from the truth, and the result is computationally trivial.

MICE-CART and MICE-RFI are multivariate-imputation-by-chained-equations methods using classification and regression trees (CART) or random forest imputation (RFI) as the per-variable conditional model. MICE iteratively imputes each variable with missing values from a regression on the other variables, cycling through the variables until imputations stabilize. CART and RFI use machine-learning regressors as the conditional models, which is more flexible than parametric MICE alternatives but introduces the usual machine-learning vulnerabilities (overfitting on small samples, sensitivity to tuning). The mice R package by van Buuren and Groothuis-Oudshoorn (2011) is the canonical implementation and supports both backends.

What the simulation showed

Xiao and Bulut crossed missing-data mechanism (MCAR, MAR, MNAR), missing proportion (5%, 15%, 30%, 40%), test length (20, 40, 60 items), and sample size (500, 1,000, 3,000) — a fully crossed factorial design that covers most realistic operational scenarios. For each cell they generated 2PL response data, induced missingness according to the mechanism, applied each of the four methods, and recorded the root-mean-square error (RMSE) of the recovered ability parameters against ground truth.

The headline result: FIML produced the lowest RMSE across most conditions, regardless of mechanism. Even under MNAR — where FIML’s consistency is no longer guaranteed — it still outperformed the imputation alternatives in absolute RMSE terms, presumably because the bias from MNAR was smaller than the noise from imputation in the conditions tested.

Zero replacement was the worst method on average, with RMSE consistently higher than the alternatives, but it had a counterintuitive property: at very high missingness proportions (40%), it became competitive. The reason is that imputation methods need enough observed data to fit a credible imputation model; when 40% of responses are missing, the imputation model is being fit on a thin substrate and produces noisy fills, while zero replacement at least delivers a deterministic answer. The crossover is not a recommendation to use zero replacement at high missingness — FIML still wins — but it explains why the simple method is hard to displace in some operational settings.

MICE-CART and MICE-RFI performed similarly to each other, with CART slightly ahead in most conditions but the differences small relative to the gap between either ML-based method and FIML. Both improved as test length increased (more observed items per respondent gives a richer imputation model) and as missingness decreased. Under MAR they were close to FIML; under MNAR they fell behind, as expected.

What this means for practice

The practical implication is straightforward: use FIML when it’s available. Modern IRT software — Stan, PyMC, mirt in R, flexMIRT, IRTPRO — all support FIML natively for the standard 1PL/2PL/3PL models. The integration cost is modest in software that is already optimized for the missing-data pattern, and the asymptotic guarantees under MAR are real.

The exceptions are scenarios where FIML is structurally unavailable: ability estimation downstream of an unrelated software pipeline that does not expose the FIML option, or models with complex non-IRT components where the FIML integral would be intractable. In those cases, MICE with a flexible conditional model is the next-best option, with the proviso that imputation quality degrades when missingness is high or the test is short. Zero replacement should only be used as a transparent baseline against which other methods are compared, not as a recommended production method.

For sparse-by-design data — multistage testing, computerized adaptive testing where examinees see only a subset of items — the missingness is typically MAR by construction (the routing rules depend on observed responses, not unobserved abilities), and FIML is the standard treatment. The Xiao-Bulut findings extend cleanly to this case: FIML is consistent and efficient, and there is no reason to introduce imputation as an extra layer.

The MNAR caveat

The honest qualifier is that all four methods, including FIML, are biased under MNAR. Rubin’s (1976) original distinction between ignorable and non-ignorable missingness is still binding: when missingness depends on unobserved quantities even conditional on observed ones, no missing-data method can recover unbiased estimates without additional modeling assumptions about the missingness process itself. Pattern-mixture models, selection models, and shared-parameter models can each address MNAR but require the analyst to specify a non-identifiable component of the model — usually via sensitivity analysis across plausible specifications (Enders, 2010).

For high-stakes testing where MNAR is plausible — adaptive tests where ability-driven omissions are common, or accommodations testing where systematic non-response is a feature of the population — sensitivity analysis is the responsible reporting standard. FIML or MICE results are presented as the primary finding, with secondary analyses showing how the conclusions move under alternative MNAR specifications. This is more work than running a single method and reporting the answer, but the cost of mis-reporting an ability estimate that depends on a wrong missingness assumption is paid by the examinees, not by the analyst.

Where this connects to broader psychometric methodology

Missing-data handling is one of several places where the IRT calibration workflow has methodological choices with substantive consequences. The choice of estimator (Bayesian hierarchical with ADVI vs MMLE vs JMLE), the assumed prior structure for item parameters, the treatment of differential item functioning, and the handling of item distributions for reliability estimation all interact with the missingness method. The headline lesson from Xiao and Bulut (2020) — FIML is robust enough to be the default, with explicit alternatives for unusual conditions — generalizes: modern IRT estimation rewards making methodological choices explicitly, defending them in writing, and reporting sensitivity to alternatives that a sophisticated reader might prefer.

Frequently Asked Questions

Is FIML the same as multiple imputation?

No. FIML uses every observed value directly in the likelihood without filling in the missing ones. Multiple imputation generates several complete datasets by imputing missing values, fits the model to each, and pools the results. Both are valid under MAR; FIML is more efficient when applicable because it avoids the imputation step.

When is zero replacement defensible?

As a transparent baseline against which other methods are compared, or in operational settings where the test specification scores omitted items as wrong by definition (some criterion-referenced certification programs do this). Outside those cases, it biases ability estimates downward and the bias is hard to quantify without simulation.

Does the choice of imputation backend matter?

Less than the choice of FIML vs imputation in the first place. Xiao and Bulut (2020) found that MICE-CART and MICE-RFI produced similar RMSE; CART had a small edge. The bigger lever is using FIML when it is available, and reserving imputation for scenarios where it is not.

What if the missingness is MNAR?

No standard method is unbiased under MNAR without additional assumptions about the missingness process. The defensible workflow is sensitivity analysis: report the primary FIML or MICE result, then show how the conclusions move under explicit MNAR specifications. This is the methodological recommendation in Enders (2010) and the consensus practice in modern missing-data analysis.

How much missingness is too much?

There is no universal threshold, but Xiao and Bulut’s results suggest that imputation-based methods degrade noticeably above 30% missing, while FIML remains competitive up through 40%. Beyond 40% the parameter estimates become noisy regardless of method, and the question shifts from “which method to use” to “is the design adequate to support inference at all”.

References

Enders, C. K. (2010). Applied missing data analysis. Guilford.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03
Xiao, J., & Bulut, O. (2020). Evaluating the performances of missing data handling methods in ability estimation from sparse data. Educational and Psychological Measurement, 80(5), 932–954. https://doi.org/10.1177/0013164420911136

Priya Sharma, Ph.D.Educational PsychologistPhD

Priya Sharma, Ph.D., is an educational psychologist and Head of Assessment Development at Cogn-IQ. She serves as contributing editor on the technical manuals for all seven Cogn-IQ cognitive assessments (JCTI, IAW, JCCES, JCFS, JCWS, GIE, WN), overseeing test development standards, norming procedures, and documentation quality. Her broader research focuses on how standardized cognitive assessments can be used more effectively to support diverse learners, and on translating cognitive science findings into evidence-based educational practices. Her work spans child cognitive development, the impact of environmental and socioeconomic factors on learning outcomes, and the design of interventions that bridge the gap between psychometric research and classroom application. ORCID: 0000-0001-8606-4520

ORCID

Related Research

Psychometric Testing and IQ Assessment

Raven's Progressive Matrices: Culture-Fair IQ Test

Among the hundreds of cognitive tests developed over the past century, few have achieved the global reach of Raven's Progressive Matrices. Administered in settings from…

Mar 19, 2026

Technological Advances in Psychology

Computerized Adaptive Testing Explained

If you've taken the GRE, GMAT, or certain professional certification exams, you may have noticed something odd: the questions seemed to adjust to your level.…

Feb 24, 2026

Statistical Methods and Data Analysis

Item Response Theory: How Modern Tests Work

Every time you take a standardized test — an IQ assessment, a college entrance exam, a professional certification — the questions have been calibrated using…

Nov 18, 2025

Cognitive Abilities and Intelligence

What an IQ of 130, 140, or 150 Means

If you've received a score of 130, 140, or 150 on an IQ test — or if you're simply curious about what these numbers represent…

Sep 27, 2025

Psychological Measurement and Testing

Do IQ Tests Measure What They Claim?

IQ tests are among the most scrutinized instruments in all of psychology. Critics argue they are culturally biased, too narrow to capture real intelligence, and…

Aug 24, 2025