Technological Advances in Psychology

Sequential GLR Tests for Item Monitoring

Sequential Generalized Likelihood Ratio Tests for Item Monitoring
Published: June 1, 2023 · Last reviewed:
📖2,042 words9 min read📚5 references cited
The Kang (2023) Psychometrika paper applies sequential generalized likelihood ratio (GLR) testing to a problem that has become increasingly pressing as testing has moved online and continuous: detecting when item parameters in an operational test bank have drifted. The core issue is straightforward—if item difficulty, discrimination, or guessing parameters change over time (because items get exposed, leaked, or simply because the population shifts), then ability estimates derived from those items become biased. The methodological challenge is to detect drift quickly enough to act on it without raising too many false alarms, and to do so for multiple parameters jointly rather than one at a time.

Why item-parameter monitoring is operationally consequential

The standard IRT-based testing workflow has three phases: item calibration (estimating item parameters from a pretest sample), operational deployment (using the calibrated parameters to estimate ability for new examinees), and periodic recalibration (updating parameters when the calibration sample is judged to have aged out). The middle phase can last years for many tests, during which the parameters are treated as fixed.

This treatment is correct only as long as the items behave the same way in the operational population as they did in the calibration sample. Three failure modes are well-documented (Glas, 2000): items get exposed (examinees learn answers from prior takers), items leak (compromised by external publication of operational forms), and the operational population shifts (curriculum changes, demographic changes, intervention effects between cohorts). All three appear in the test data as drift in item parameters—typically the difficulty parameter softens (items become easier as more examinees see them) and the discrimination parameter degrades (the item separates ability levels less cleanly).

When drift goes undetected, the consequences propagate. Ability estimates become biased, equating across forms breaks down, and the test’s scores no longer have the meaning they were validated to have. The cost of running an operational test with stale calibrations is real, and the value of timely drift detection is correspondingly substantial.

What sequential GLR offers over standard alternatives

The classical approach to drift monitoring is to fit a fresh IRT model on a recent sample and compare its parameter estimates against the calibration estimates, applying confidence intervals or hypothesis tests at parameter level. This batch approach requires accumulating enough data for stable re-estimation (typically several hundred administrations per item) and then conducting the comparison. It detects drift but slowly—weeks to months between drift onset and detection in operational contexts.

Sequential testing approaches the problem differently. Rather than waiting for batch accumulation, sequential procedures evaluate evidence after every new administration (or every block of K administrations) and decide whether to flag an item, continue monitoring, or terminate. The foundational work is Wald’s (1945) sequential probability ratio test (SPRT), which evaluates the likelihood ratio between two hypothesized parameter values and stops when the ratio crosses one of two predetermined thresholds. The CUSUM (cumulative sum) framework, introduced by Page (1954) for industrial quality control, accumulates standardized deviations from a target value and signals when the cumulative deviation exceeds a threshold. Both have been adapted to item-parameter monitoring (Lee & Lewis, 2021, applied CUSUM specifically to continuous testing applications), and both are designed to provide faster detection than batch methods at the same false-alarm rate.

The Kang (2023) contribution sits in this lineage but generalizes it in two specific ways. First, the procedure is a generalized likelihood ratio rather than an SPRT or CUSUM, which makes it usable when the alternative-hypothesis parameter values are not specified in advance. Second, it monitors multiple item parameters jointly, which exploits the typical correlation structure of drift (difficulty and discrimination tend to drift together when items are exposed) and yields more powerful detection than monitoring each parameter independently.

The Kang procedure

The sequential GLR test compares two hypotheses at each monitoring step:

  1. H₀: parameters unchanged from the calibration values.
  2. H₁: parameters have drifted by some amount, with the magnitude estimated from the running data rather than specified in advance.

The test statistic is the ratio of the likelihood under H₁ (with drift estimated by maximum likelihood from observed responses since calibration) to the likelihood under H₀ (with drift fixed at zero). When the GLR exceeds a critical threshold, the procedure flags the item as drifted.

The multivariate generalization—monitoring multiple parameters jointly—is the methodological core of the paper. For an item with parameters (a, b) under the 2PL model or (a, b, c) under the 3PL, the joint GLR evaluates whether any combination of drift across these parameters exceeds the joint detection threshold. Compared to running separate univariate tests on each parameter, the multivariate approach has two advantages: (1) it accounts for the correlation among parameter estimates from the same response data, avoiding the inflated joint Type I error of independent univariate tests; and (2) it captures patterns of drift that no single parameter would flag on its own (e.g., difficulty rising slightly and discrimination falling slightly, with each individual change below the univariate threshold but the joint pattern clearly anomalous).

Sampling techniques. Kang’s framework supports both continuous monitoring (the test is recomputed after every new administration) and intermittent monitoring (the test is recomputed at fixed intervals or after fixed blocks of administrations). Continuous monitoring offers fastest detection but is computationally heavier; intermittent monitoring trades a small detection delay for lower computational overhead. The choice depends on operational constraints—high-stakes tests with rapid administration frequency benefit from continuous monitoring; lower-stakes or lower-volume tests can use intermittent.

What the simulation and real-data validation show

The 2023 paper validates the procedure on simulated and real assessment data. The headline findings:

  • Detection power is satisfactory for parameter shifts of practical magnitude—shifts on the order of those typically observed in real test compromise. The procedure flags drift with high probability while keeping the false-alarm rate within nominal Type I error bounds.
  • Detection is timely. Average run lengths to detection are short enough to support practical operational response (item retirement, recalibration, or replacement) before substantial bias accumulates in ability estimates.
  • Multivariate monitoring outperforms univariate alternatives. The joint procedure detects drift earlier than parameter-by-parameter univariate tests at matched false-alarm rates, with the advantage growing as the number of monitored parameters increases.
  • The procedure is comparable or superior to existing CUSUM and SPRT-based methods across the simulated drift conditions, with the multivariate generalization being the specific feature that drives the comparative advantage.

Operational deployment considerations

For test publishers considering implementation, several considerations follow from the paper’s design choices.

Threshold selection. The detection threshold trades false alarms against detection delay. Lower thresholds detect drift faster but generate more false flags; higher thresholds reduce false flags at the cost of slower detection. The appropriate threshold depends on the operational cost of each error type—the cost of investigating a false drift flag versus the cost of running a stale item undetected. Kang (2023) provides simulation guidance for matching threshold choice to operational tolerance.

Item-level versus pool-level decisions. The procedure flags individual items. Operational response then requires decisions about what to do with flagged items: retire and replace, recalibrate in place using the recent data, or schedule for re-examination at the next batch recalibration. These response decisions are downstream of the detection procedure and depend on the test’s overall item-management strategy.

Computational footprint. Continuous multivariate GLR monitoring across a full operational item bank is non-trivial computationally. For a bank of 500 items with multiple parameters each, each new administration triggers updates to the monitoring statistics for the items administered. The Kang formulation is tractable on modern test-delivery infrastructure but requires explicit attention to the implementation rather than treating it as a free addition to existing monitoring workflows.

Position in the methodology landscape

The paper extends a fifty-year line of sequential statistical work, anchored in Wald’s (1945) SPRT and Page’s (1954) CUSUM, into the specific domain of educational and psychological measurement. Glas’s (2000) chapter on item calibration and parameter drift articulated the operational need for ongoing monitoring; Lee and Lewis’s (2021) CUSUM-based framework provided one well-developed answer; Kang’s GLR framework provides another with explicit multivariate generalization.

The methodological direction across these contributions is consistent: treat the operational test bank as a real-time-monitored measurement instrument rather than as a fixed-once asset. The shift parallels broader moves in adjacent fields (clinical trial sequential analysis, manufacturing quality control, financial fraud detection) where the move from batch evaluation to streaming detection has been the dominant trend over the past two decades.

Where the framework still has development space

Three directions for extension remain underdeveloped relative to the operational need.

The procedures are parametric, assuming the IRT model under which calibration was performed continues to apply. Real drift sometimes manifests as model-fit degradation rather than parameter shift within a fixed model—an item that was a 2PL item at calibration may behave more like a 3PL item under operational conditions because guessing increases under exposure. Sequential model-fit monitoring is a natural complement to sequential parameter monitoring but is not the focus of the 2023 framework.

The procedures monitor known operational items. Detecting compromise of items that have not yet been flagged for monitoring (because they are too new, or because the publisher does not know they have been exposed) requires complementary techniques—response-time anomaly detection, person-fit statistics, or external intelligence about leaks. The 2023 framework is part of a multi-pronged test-security strategy rather than a complete solution.

The procedures detect drift but do not diagnose the source. Drift due to exposure has different operational implications than drift due to population shift, and ideally the response would differ. Distinguishing these mechanisms requires additional information beyond what the GLR statistics provide.

Frequently asked questions

What is item parameter drift in IRT?

Item parameter drift is the change over time in an item’s calibrated IRT parameters (difficulty, discrimination, guessing) when the item is used operationally. Drift can result from item exposure (examinees learning answers), item leaks (compromised forms), or shifts in the operational population. When drift is undetected, ability estimates derived from the affected items become biased.

How is sequential testing different from batch testing for item monitoring?

Batch testing accumulates a recent sample of administrations and re-estimates item parameters all at once, comparing them to calibration values. Sequential testing evaluates evidence after each new administration (or each block of administrations) and decides whether to flag, continue, or stop. Sequential procedures detect drift faster at matched false-alarm rates.

What does the Kang (2023) generalized likelihood ratio approach add to existing methods?

Two things. First, it uses a generalized likelihood ratio rather than Wald’s SPRT or Page’s CUSUM, so the alternative-hypothesis parameter values do not have to be specified in advance. Second, it monitors multiple item parameters jointly, which exploits the typical correlation structure of drift and yields more powerful detection than monitoring each parameter independently.

How quickly does the procedure detect drift?

Average run lengths to detection in the simulation studies are short enough to support practical operational response—item retirement, recalibration, or replacement—before substantial bias accumulates in ability estimates. Exact detection delay depends on the magnitude of the drift and the chosen threshold.

Can sequential GLR be applied to operational testing programs without re-calibration?

Yes. The procedure operates on existing calibrated parameters and incoming response data, so a publisher already running an item bank with conventional IRT calibration can layer GLR monitoring on top without recalibrating the bank. The main implementation cost is the computational overhead of updating monitoring statistics after administrations.

What does sequential GLR not detect?

The procedure detects parameter drift in items that are already being monitored, but it does not detect compromise of items that have not been flagged for monitoring, nor does it diagnose the source of drift (exposure vs. population shift vs. model misfit). Complementary techniques—response-time anomaly detection, person-fit statistics, sequential model-fit monitoring—are needed for a complete test-security strategy.

References

Related Research

Psychometric Testing and IQ Assessment

IQ Test Anxiety: How Stress Affects Your Score

You sit down for an IQ assessment. Your palms are sweating, your mind races, and the moment you see the first timed task, your thoughts…

Mar 22, 2026
Technological Advances in Psychology

Computerized Adaptive Testing Explained

If you've taken the GRE, GMAT, or certain professional certification exams, you may have noticed something odd: the questions seemed to adjust to your level.…

Feb 24, 2026
Statistical Methods and Data Analysis

Item Response Theory: How Modern Tests Work

Every time you take a standardized test — an IQ assessment, a college entrance exam, a professional certification — the questions have been calibrated using…

Nov 18, 2025
Psychological Measurement and Testing

Do IQ Tests Measure What They Claim?

IQ tests are among the most scrutinized instruments in all of psychology. Critics argue they are culturally biased, too narrow to capture real intelligence, and…

Aug 24, 2025
Psychological Measurement and Testing

Online IQ Tests vs. Professional Assessments

A quick search for "IQ test" returns dozens of websites promising to measure your intelligence in 10 minutes. Meanwhile, a professional cognitive assessment takes 2–3…

Apr 13, 2025

People Also Ask

What is interpreting differential item functioning with response process data?

Understanding differential item functioning (DIF) is critical for ensuring fairness in assessments across diverse groups. A recent study by Li et al. introduces a method to enhance the interpretability of DIF items by incorporating response process data. This approach aims to improve equity in measurement by examining how participants engage with test items, providing deeper insights into the factors influencing DIF outcomes.

Read more →
What are integrating sdt and irt models for mixed-format exams?

Lawrence T. DeCarlo’s recent article introduces a psychological framework for mixed-format exams, combining signal detection theory (SDT) for multiple-choice items and item response theory (IRT) for open-ended items. This fusion allows for a unified model that captures the nuances of each item type while providing insights into the underlying cognitive processes of examinees.

Read more →
What is group-theoretical symmetries in item response theory (irt)?

Item Response Theory (IRT) is a widely adopted framework in psychological and educational assessments, used to model the relationship between latent traits and observed responses. This recent work introduces an innovative approach that incorporates group-theoretic symmetry constraints, offering a refined methodology for estimating IRT parameters with greater precision and efficiency.

Read more →
What is simulated irt dataset generator v1.00 at cogn-iq.org?

The Dataset Generator available at Cogn-IQ.org is a powerful resource designed for researchers and practitioners working with Item Response Theory (IRT). This tool simulates datasets tailored for psychometric analysis, enabling users to explore a range of testing scenarios with customizable item and subject characteristics. It supports the widely used 2-Parameter Logistic (2PL) model, providing flexibility and precision for diverse applications.

Read more →
Why is background important?

The need for robust item monitoring has increased alongside the expansion of online and adaptive testing systems. Changes in item parameters, such as difficulty or discrimination, can undermine the validity of assessments. Kang’s work builds on established psychometric methodologies, enhancing them to meet the demands of real-time and high-frequency testing environments. Her approach leverages sequential testing to allow timely detection of parameter shifts.

How does key insights work in practice?

Methodological Innovation: Kang presents sequential generalized likelihood ratio tests as a reliable tool for monitoring multiple item parameters simultaneously. These methods outperform traditional monitoring techniques in accuracy and responsiveness. Empirical Validation: Using simulated and real-world data, the research demonstrates the effectiveness of these tests in maintaining acceptable error rates while identifying

📋 Cite This Article

Jouve, X. (2023, June 1). Sequential GLR Tests for Item Monitoring. PsychoLogic. https://www.psychologic.online/glr-tests-item-monitoring/

Leave a Reply