What is significance?

This work contributes meaningfully to psychometric research and practice. By addressing the challenges of item parameter stability in online testing, Kang’s methods provide practical solutions for maintaining the integrity of assessments. The emphasis on joint monitoring of parameters reflects a holistic approach, ensuring that the complexities of item behavior are considered in quality control efforts.

What are future directions?

The study opens avenues for further exploration in the application of sequential tests to more diverse testing environments. Future research could investigate their scalability in large-scale assessments and adaptive testing platforms. Additionally, extending these methods to nonparametric settings may broaden their applicability.

Hyeon-Ah Kang’s contribution to psychometric testing addresses a pressing need for effective item monitoring in contemporary assessments. Her sequential generalized likelihood ratio tests offer a reliable and empirically supported solution for maintaining test quality. As online testing continues to evolve, methodologies like these will remain integral to advancing psychometric standards and practices.

Kang, Hyeon-Ah. (2023). Sequential Generalized Likelihood Ratio Tests for Online Item Monitoring. Psychometrika, 88(2), 672-696. https://doi.org/10.1007/s11336-022-09871-9

Sequential GLR Tests for Item Monitoring

Published: June 1, 2023 · Last reviewed: May 6, 2026

📖2,042 words⏱9 min read📚5 references cited

The Kang (2023) Psychometrika paper applies sequential generalized likelihood ratio (GLR) testing to a problem that has become increasingly pressing as testing has moved online and continuous: detecting when item parameters in an operational test bank have drifted. The core issue is straightforward—if item difficulty, discrimination, or guessing parameters change over time (because items get exposed, leaked, or simply because the population shifts), then ability estimates derived from those items become biased. The methodological challenge is to detect drift quickly enough to act on it without raising too many false alarms, and to do so for multiple parameters jointly rather than one at a time.

Why item-parameter monitoring is operationally consequential

The standard IRT-based testing workflow has three phases: item calibration (estimating item parameters from a pretest sample), operational deployment (using the calibrated parameters to estimate ability for new examinees), and periodic recalibration (updating parameters when the calibration sample is judged to have aged out). The middle phase can last years for many tests, during which the parameters are treated as fixed.

This treatment is correct only as long as the items behave the same way in the operational population as they did in the calibration sample. Three failure modes are well-documented (Glas, 2000): items get exposed (examinees learn answers from prior takers), items leak (compromised by external publication of operational forms), and the operational population shifts (curriculum changes, demographic changes, intervention effects between cohorts). All three appear in the test data as drift in item parameters—typically the difficulty parameter softens (items become easier as more examinees see them) and the discrimination parameter degrades (the item separates ability levels less cleanly).

When drift goes undetected, the consequences propagate. Ability estimates become biased, equating across forms breaks down, and the test’s scores no longer have the meaning they were validated to have. The cost of running an operational test with stale calibrations is real, and the value of timely drift detection is correspondingly substantial.

What sequential GLR offers over standard alternatives

The classical approach to drift monitoring is to fit a fresh IRT model on a recent sample and compare its parameter estimates against the calibration estimates, applying confidence intervals or hypothesis tests at parameter level. This batch approach requires accumulating enough data for stable re-estimation (typically several hundred administrations per item) and then conducting the comparison. It detects drift but slowly—weeks to months between drift onset and detection in operational contexts.

Sequential testing approaches the problem differently. Rather than waiting for batch accumulation, sequential procedures evaluate evidence after every new administration (or every block of K administrations) and decide whether to flag an item, continue monitoring, or terminate. The foundational work is Wald’s (1945) sequential probability ratio test (SPRT), which evaluates the likelihood ratio between two hypothesized parameter values and stops when the ratio crosses one of two predetermined thresholds. The CUSUM (cumulative sum) framework, introduced by Page (1954) for industrial quality control, accumulates standardized deviations from a target value and signals when the cumulative deviation exceeds a threshold. Both have been adapted to item-parameter monitoring (Lee & Lewis, 2021, applied CUSUM specifically to continuous testing applications), and both are designed to provide faster detection than batch methods at the same false-alarm rate.

The Kang (2023) contribution sits in this lineage but generalizes it in two specific ways. First, the procedure is a generalized likelihood ratio rather than an SPRT or CUSUM, which makes it usable when the alternative-hypothesis parameter values are not specified in advance. Second, it monitors multiple item parameters jointly, which exploits the typical correlation structure of drift (difficulty and discrimination tend to drift together when items are exposed) and yields more powerful detection than monitoring each parameter independently.

The Kang procedure

The sequential GLR test compares two hypotheses at each monitoring step:

H₀: parameters unchanged from the calibration values.
H₁: parameters have drifted by some amount, with the magnitude estimated from the running data rather than specified in advance.

The test statistic is the ratio of the likelihood under H₁ (with drift estimated by maximum likelihood from observed responses since calibration) to the likelihood under H₀ (with drift fixed at zero). When the GLR exceeds a critical threshold, the procedure flags the item as drifted.

The multivariate generalization—monitoring multiple parameters jointly—is the methodological core of the paper. For an item with parameters (a, b) under the 2PL model or (a, b, c) under the 3PL, the joint GLR evaluates whether any combination of drift across these parameters exceeds the joint detection threshold. Compared to running separate univariate tests on each parameter, the multivariate approach has two advantages: (1) it accounts for the correlation among parameter estimates from the same response data, avoiding the inflated joint Type I error of independent univariate tests; and (2) it captures patterns of drift that no single parameter would flag on its own (e.g., difficulty rising slightly and discrimination falling slightly, with each individual change below the univariate threshold but the joint pattern clearly anomalous).

Sampling techniques. Kang’s framework supports both continuous monitoring (the test is recomputed after every new administration) and intermittent monitoring (the test is recomputed at fixed intervals or after fixed blocks of administrations). Continuous monitoring offers fastest detection but is computationally heavier; intermittent monitoring trades a small detection delay for lower computational overhead. The choice depends on operational constraints—high-stakes tests with rapid administration frequency benefit from continuous monitoring; lower-stakes or lower-volume tests can use intermittent.

What the simulation and real-data validation show

The 2023 paper validates the procedure on simulated and real assessment data. The headline findings:

Detection power is satisfactory for parameter shifts of practical magnitude—shifts on the order of those typically observed in real test compromise. The procedure flags drift with high probability while keeping the false-alarm rate within nominal Type I error bounds.
Detection is timely. Average run lengths to detection are short enough to support practical operational response (item retirement, recalibration, or replacement) before substantial bias accumulates in ability estimates.
Multivariate monitoring outperforms univariate alternatives. The joint procedure detects drift earlier than parameter-by-parameter univariate tests at matched false-alarm rates, with the advantage growing as the number of monitored parameters increases.
The procedure is comparable or superior to existing CUSUM and SPRT-based methods across the simulated drift conditions, with the multivariate generalization being the specific feature that drives the comparative advantage.

Operational deployment considerations

For test publishers considering implementation, several considerations follow from the paper’s design choices.

Threshold selection. The detection threshold trades false alarms against detection delay. Lower thresholds detect drift faster but generate more false flags; higher thresholds reduce false flags at the cost of slower detection. The appropriate threshold depends on the operational cost of each error type—the cost of investigating a false drift flag versus the cost of running a stale item undetected. Kang (2023) provides simulation guidance for matching threshold choice to operational tolerance.

Item-level versus pool-level decisions. The procedure flags individual items. Operational response then requires decisions about what to do with flagged items: retire and replace, recalibrate in place using the recent data, or schedule for re-examination at the next batch recalibration. These response decisions are downstream of the detection procedure and depend on the test’s overall item-management strategy.

Computational footprint. Continuous multivariate GLR monitoring across a full operational item bank is non-trivial computationally. For a bank of 500 items with multiple parameters each, each new administration triggers updates to the monitoring statistics for the items administered. The Kang formulation is tractable on modern test-delivery infrastructure but requires explicit attention to the implementation rather than treating it as a free addition to existing monitoring workflows.

Position in the methodology landscape

The paper extends a fifty-year line of sequential statistical work, anchored in Wald’s (1945) SPRT and Page’s (1954) CUSUM, into the specific domain of educational and psychological measurement. Glas’s (2000) chapter on item calibration and parameter drift articulated the operational need for ongoing monitoring; Lee and Lewis’s (2021) CUSUM-based framework provided one well-developed answer; Kang’s GLR framework provides another with explicit multivariate generalization.

The methodological direction across these contributions is consistent: treat the operational test bank as a real-time-monitored measurement instrument rather than as a fixed-once asset. The shift parallels broader moves in adjacent fields (clinical trial sequential analysis, manufacturing quality control, financial fraud detection) where the move from batch evaluation to streaming detection has been the dominant trend over the past two decades.

Where the framework still has development space

Three directions for extension remain underdeveloped relative to the operational need.

The procedures are parametric, assuming the IRT model under which calibration was performed continues to apply. Real drift sometimes manifests as model-fit degradation rather than parameter shift within a fixed model—an item that was a 2PL item at calibration may behave more like a 3PL item under operational conditions because guessing increases under exposure. Sequential model-fit monitoring is a natural complement to sequential parameter monitoring but is not the focus of the 2023 framework.

The procedures monitor known operational items. Detecting compromise of items that have not yet been flagged for monitoring (because they are too new, or because the publisher does not know they have been exposed) requires complementary techniques—response-time anomaly detection, person-fit statistics, or external intelligence about leaks. The 2023 framework is part of a multi-pronged test-security strategy rather than a complete solution.

The procedures detect drift but do not diagnose the source. Drift due to exposure has different operational implications than drift due to population shift, and ideally the response would differ. Distinguishing these mechanisms requires additional information beyond what the GLR statistics provide.

Frequently asked questions

What is item parameter drift in IRT?

Item parameter drift is the change over time in an item’s calibrated IRT parameters (difficulty, discrimination, guessing) when the item is used operationally. Drift can result from item exposure (examinees learning answers), item leaks (compromised forms), or shifts in the operational population. When drift is undetected, ability estimates derived from the affected items become biased.

How is sequential testing different from batch testing for item monitoring?

Batch testing accumulates a recent sample of administrations and re-estimates item parameters all at once, comparing them to calibration values. Sequential testing evaluates evidence after each new administration (or each block of administrations) and decides whether to flag, continue, or stop. Sequential procedures detect drift faster at matched false-alarm rates.

What does the Kang (2023) generalized likelihood ratio approach add to existing methods?

Two things. First, it uses a generalized likelihood ratio rather than Wald’s SPRT or Page’s CUSUM, so the alternative-hypothesis parameter values do not have to be specified in advance. Second, it monitors multiple item parameters jointly, which exploits the typical correlation structure of drift and yields more powerful detection than monitoring each parameter independently.

How quickly does the procedure detect drift?

Average run lengths to detection in the simulation studies are short enough to support practical operational response—item retirement, recalibration, or replacement—before substantial bias accumulates in ability estimates. Exact detection delay depends on the magnitude of the drift and the chosen threshold.

Can sequential GLR be applied to operational testing programs without re-calibration?

Yes. The procedure operates on existing calibrated parameters and incoming response data, so a publisher already running an item bank with conventional IRT calibration can layer GLR monitoring on top without recalibrating the bank. The main implementation cost is the computational overhead of updating monitoring statistics after administrations.

What does sequential GLR not detect?

The procedure detects parameter drift in items that are already being monitored, but it does not detect compromise of items that have not been flagged for monitoring, nor does it diagnose the source of drift (exposure vs. population shift vs. model misfit). Complementary techniques—response-time anomaly detection, person-fit statistics, sequential model-fit monitoring—are needed for a complete test-security strategy.

References

Glas, C. A. W. (2000). Item calibration and parameter drift. In Computerized Adaptive Testing: Theory and Practice. Springer. https://doi.org/10.1007/0-306-47531-6_10
Kang, H.-A. (2023). Sequential generalized likelihood ratio tests for online item monitoring. Psychometrika, 88(2), 672-696. https://doi.org/10.1007/s11336-022-09871-9
Lee, Y.-H., & Lewis, C. (2021). Monitoring item performance with CUSUM statistics in continuous testing. Journal of Educational and Behavioral Statistics, 46(5), 611-648. https://doi.org/10.3102/1076998621994563
Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2), 100-115. https://doi.org/10.1093/biomet/41.1-2.100
Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2), 117-186. https://doi.org/10.1214/aoms/1177731118

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Psychometric Testing and IQ Assessment

IQ Test Anxiety: How Stress Affects Your Score

You sit down for an IQ assessment. Your palms are sweating, your mind races, and the moment you see the first timed task, your thoughts…

Mar 22, 2026

Technological Advances in Psychology

Computerized Adaptive Testing Explained

If you've taken the GRE, GMAT, or certain professional certification exams, you may have noticed something odd: the questions seemed to adjust to your level.…

Feb 24, 2026

Statistical Methods and Data Analysis

Item Response Theory: How Modern Tests Work

Every time you take a standardized test — an IQ assessment, a college entrance exam, a professional certification — the questions have been calibrated using…

Nov 18, 2025

Psychological Measurement and Testing

Do IQ Tests Measure What They Claim?

IQ tests are among the most scrutinized instruments in all of psychology. Critics argue they are culturally biased, too narrow to capture real intelligence, and…

Aug 24, 2025

Psychological Measurement and Testing

Online IQ Tests vs. Professional Assessments

A quick search for "IQ test" returns dozens of websites promising to measure your intelligence in 10 minutes. Meanwhile, a professional cognitive assessment takes 2–3…

Apr 13, 2025

Sequential GLR Tests for Item Monitoring

Why item-parameter monitoring is operationally consequential

What sequential GLR offers over standard alternatives

The Kang procedure

What the simulation and real-data validation show

Operational deployment considerations

Position in the methodology landscape

Where the framework still has development space

Frequently asked questions

What is item parameter drift in IRT?

How is sequential testing different from batch testing for item monitoring?

What does the Kang (2023) generalized likelihood ratio approach add to existing methods?

How quickly does the procedure detect drift?

Can sequential GLR be applied to operational testing programs without re-calibration?

What does sequential GLR not detect?

References

Related Research

IQ Test Anxiety: How Stress Affects Your Score

Computerized Adaptive Testing Explained

Item Response Theory: How Modern Tests Work

Do IQ Tests Measure What They Claim?

Online IQ Tests vs. Professional Assessments

People Also Ask

Leave a Reply Cancel reply

Why item-parameter monitoring is operationally consequential

What sequential GLR offers over standard alternatives

The Kang procedure

What the simulation and real-data validation show

Operational deployment considerations

Position in the methodology landscape

Where the framework still has development space

Frequently asked questions

What is item parameter drift in IRT?

How is sequential testing different from batch testing for item monitoring?

What does the Kang (2023) generalized likelihood ratio approach add to existing methods?

How quickly does the procedure detect drift?

Can sequential GLR be applied to operational testing programs without re-calibration?

What does sequential GLR not detect?

References

Related Research

People Also Ask

You may also like...

Popular Posts

Leave a Reply Cancel reply