What is significance?

This research provides an innovative framework for addressing long-standing challenges in DIF analysis. The integration of response process data enhances the ability to identify and interpret item-level biases, particularly in contexts like the Programme for International Assessment of Adult Competencies (PIAAC). These findings have implications for the design of fairer assessments that reflect diverse cognitive and behavioral patterns across subgroups.

What are future directions?

Future research could explore additional response process features and their applications in other testing scenarios. Expanding the methodology to broader populations and assessment types may provide further insights into how response behaviors influence item performance. Additionally, refining predictive models could enhance the practical application of these techniques in educational and psychological measurement.

By combining response process data with advanced analytical methods, Li et al. contribute to a more nuanced understanding of differential item functioning. Their work underscores the importance of ongoing innovation in assessment design, ensuring fairness and equity in measuring abilities across diverse groups.

Ziying Li, Jinnie Shin, Huan Kuang, & A. Corinne Huggins-Manley. (2024). Interpreting Differential Item Functioning via Response Process Data. Educational and Psychological Measurement. https://doi.org/10.1177/00131644241298975

Interpreting Differential Item Functioning with Response Process Data

Published: December 16, 2024 · Last reviewed: May 4, 2026

📖1,612 words⏱7 min read📚3 references cited

A test item that scores differently for two groups of equally able examinees is called a differential item functioning (DIF) item, and identifying these items is now a routine part of large-scale assessment quality control. The hard part has never been the detection — statistical tests for DIF have been mature for thirty years — but the interpretation: knowing why an item flagged as DIF behaves the way it does. Expert content reviewers and statistical DIF flags often disagree, leaving test developers with a list of suspicious items and no clear story about what is driving the difference. A 2024 study by Li, Shin, Kuang, and Huggins-Manley shows that response process data — the digital traces of how examinees actually interact with computerized items — can fill in part of this missing layer.

What DIF is, and why detection isn’t enough

DIF occurs when examinees from two groups (commonly defined by gender, native language, or demographic background) who have the same level of the trait being measured nonetheless have different probabilities of getting an item right. The matching on ability is the key feature: a raw difference in pass rates is not DIF, because the groups may genuinely differ in ability. A conditional difference, after equating ability, is what flags the item.

Standard DIF methods — Mantel-Haenszel, logistic regression, IRT-based approaches — are good at producing this flag. What they cannot do is explain it. A flagged item might function differently because of:

genuinely construct-irrelevant content that one group encounters more readily;
differences in test-taking strategy, pacing, or familiarity with the response format;
statistical noise plus multiple testing across hundreds of items;
group differences on a secondary, correlated trait that the item happens to tap.

Expert reviewers — content specialists asked to read flagged items and judge whether the difference is meaningful — often produce low agreement with statistical flags and with one another. The result is a chronic gap between “this item is statistically suspicious” and “this is what the suspicion actually means.”

What response process data adds

Computerized assessments such as the Programme for International Assessment of Adult Competencies (PIAAC) generate detailed log files: time stamps for every action, sequences of clicks and keystrokes, time on each item, time per response option, and patterns of revisiting earlier items. Goldhammer, Hahnel, and Kroehne’s 2020 methodological treatment of PIAAC log files maps out how these traces can be transformed into structured features for analysis — moving from raw event streams to interpretable variables like total response time, dwell time on specific options, and the frequency of particular action sequences.

Unlike a simple right/wrong score, these features describe the process by which an examinee arrived at an answer. If two groups solve an item correctly through different routes, or if one group spends systematically more time on a specific tool or distractor, those patterns carry information that a final-score-only DIF analysis cannot see.

What Li and colleagues actually found

The 2024 Li et al. study used the PIAAC 2012 computer-based numeracy assessment to investigate gender DIF. Their analytic approach combined:

random forest models — to capture potentially non-linear and interactive effects of process features on DIF;
logistic regression with ridge regularization — to handle high-dimensional feature sets and produce more stable coefficient estimates;
variation in the assumed proportion of DIF items — to test how the methodology behaves under different prevalence assumptions, since real test forms vary.

Their core empirical claim is that the combination of timing features and action-sequence features is informative for distinguishing how the groups respond to flagged items. Neither timing nor sequence alone carried as much information as the combination. The most useful features were not pre-specified: the modeling approach surfaced them empirically from the log file data.

The interpretive payoff is that, for a given DIF-flagged item, the analyst can now describe not only that it functions differently but also which aspects of the response process differ between groups — for example, whether one group takes systematically longer, dwells on different distractors, or follows a different sequence of within-item actions. This is the missing bridge between statistical flag and substantive explanation.

Why low statistical-versus-expert-review agreement makes more sense in this light

The chronic disagreement between statistical DIF flags and expert content review can be partially explained by what each method has access to. The statistical flag knows the score outcomes; the content expert knows the item text. Neither has direct access to how examinees solved the problem. If the source of differential functioning lies in the process — e.g., one group skipping over a built-in calculator tool that helps with the item — neither the score nor the item text alone will reveal it.

Process data closes this gap by making the unobserved middle layer observable. The implication, consistent with Li et al.’s framing, is that some historic disagreements are not failures of either method but reflect a genuine information gap that process data can fill.

The ecological context: DIF is not only about items

Woitschach, Zumbo, and Fernández-Alonso (2019) make a complementary point at a different level of analysis. Treating DIF as purely an item-level property, they argue, ignores the multilevel structure of educational assessment data: students are nested in classrooms, schools, regions, and educational systems, and group-level context can drive differential functioning that no item-only analysis will resolve. Their multilevel modeling approach treats DIF as partly an emergent property of the testing context, not a pure item flaw.

Combined with the process-data approach, the picture is layered:

Statistical DIF detection identifies items where conditional pass rates differ across groups.
Expert review evaluates item content for plausible construct-irrelevant features.
Response process analysis (Li et al.) examines how groups actually engage with items.
Multilevel ecological analysis (Woitschach et al.) examines how group-level context shapes the patterns observed.

A complete account of why a particular item shows DIF will increasingly draw on more than one of these layers.

What this means for assessment practice

For test developers and large-scale assessment programs, several practical implications follow:

Process data should be retained and analyzed, not discarded. Many computerized assessments generate detailed log files that are then archived without systematic analysis. The Li et al. results suggest meaningful DIF interpretation information is sitting in those files.
Combined feature sets matter. Single-feature analyses (timing alone, or sequence alone) appear to leave information on the table.
The technique does not eliminate the need for expert review. Process features describe what differs; substantive judgment is still required to interpret whether the difference reflects construct-relevant variation or construct-irrelevant noise.
The method is not yet operational. Random-forest-and-ridge-regression analysis of log file features is research-grade rather than routine. Operationalization for production assessment programs would require methodological standardization that does not yet exist.

Limits of the current evidence

The Li et al. study is, at the time of writing, a single empirical investigation on one assessment domain (PIAAC numeracy, 2012 cycle) and one DIF dimension (gender). Several open questions remain:

Generalization across domains. Whether timing-and-sequence features carry similar information for literacy, problem-solving, or domain-specific tests is not yet established.
Generalization across DIF dimensions. Gender DIF may have different process correlates than language-based or age-based DIF.
Stability of findings. The features identified as most informative by random forest are not guaranteed to replicate. Different cycles or samples could highlight different features.
Causal interpretation. Process features describe correlations between behavior and DIF flags. They do not, on their own, establish that the process difference causes the score difference, only that it accompanies it.

Frequently Asked Questions

What is differential item functioning in plain language?

A test item that two equally able people, from different demographic groups, are not equally likely to get right. The “equally able” qualifier is what distinguishes DIF from a raw group difference in scores.

Why is interpreting DIF harder than detecting it?

Statistical detection produces a flag, but the flag does not tell you why the item is functioning differently. Reasons can range from substantive content bias to test-taking-strategy differences to statistical noise from multiple testing. Distinguishing these requires information that pure score data does not contain.

What is response process data?

The detailed digital trace of how an examinee interacts with an item: time stamps, click sequences, dwell times on options, use of in-item tools, and revisits. Computer-based assessments like PIAAC routinely produce this data even when it is not part of the scoring.

Does this mean DIF analysis should now always include process data?

Not yet operationally. The methods are research-grade and require analytical infrastructure that most assessment programs do not have in production. The direction of travel is toward incorporation, but the field is not there.

Could process-data DIF analysis introduce new biases?

Yes, in principle. Process features themselves can be confounded with examinee characteristics like familiarity with the response interface or testing motivation. Treating process data as an unbiased window onto cognition would be a mistake; it is one more layer of evidence, not the final answer.

Is gender DIF on numeracy items always meaningful?

No. Some flagged items reflect minor measurement noise; others reflect substantive content issues; others reflect process differences that may or may not be construct-relevant. The whole point of layered interpretation is that “flagged for DIF” is the start of an analysis, not the conclusion.

References

Educational and Psychological Measurement, 85(4), 783–813. https://doi.org/10.1177/00131644241298975
Goldhammer, F., Hahnel, C., & Kroehne, U. (2020). Analysing Log File Data from PIAAC. In D. B. Maehler & B. Rammstedt (Eds.), Large-Scale Cognitive Assessment (pp. 239–269). Springer. https://doi.org/10.1007/978-3-030-47515-4_10
Woitschach, P., Zumbo, B. D., & Fernández-Alonso, R. (2019). An ecological view of measurement: focus on multilevel model explanation of differential item functioning. Psicothema, 31(2), 194–203. https://doi.org/10.7334/psicothema2018.303

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Technological Advances in Psychology

Computerized Adaptive Testing Explained: How Modern Tests Adapt to You

If you've taken the GRE, GMAT, or certain professional certification exams, you may have noticed something odd: the questions seemed to adjust to your level.…

Feb 24, 2026

Statistical Methods and Data Analysis

A Beginner's Guide to Item Response Theory (IRT): How Modern Tests Work

Every time you take a standardized test — an IQ assessment, a college entrance exam, a professional certification — the questions have been calibrated using…

Nov 18, 2025

Psychological Measurement and Testing

Psychometrics: The Science of Psychological Measurement

Psychometrics, a specialized branch within psychology, is dedicated to the theory and methodology of psychological measurement. This discipline encompasses the development and refinement of testing…

Feb 27, 2025

Statistical Methods and Data Analysis

Integrating SDT and IRT Models for Mixed-Format Exams

Lawrence T. DeCarlo’s recent article introduces a psychological framework for mixed-format exams, combining signal detection theory (SDT) for multiple-choice items and item response theory (IRT)…

Dec 11, 2024

Statistical Methods and Data Analysis

Group-Theoretical Symmetries in Item Response Theory (IRT)

Item Response Theory (IRT) is a widely adopted framework in psychological and educational assessments, used to model the relationship between latent traits and observed responses.…

Oct 11, 2024

Interpreting Differential Item Functioning with Response Process Data

What DIF is, and why detection isn’t enough

What response process data adds

What Li and colleagues actually found

Why low statistical-versus-expert-review agreement makes more sense in this light

The ecological context: DIF is not only about items

What this means for assessment practice

Limits of the current evidence

Frequently Asked Questions

What is differential item functioning in plain language?

Why is interpreting DIF harder than detecting it?

What is response process data?

Does this mean DIF analysis should now always include process data?

Could process-data DIF analysis introduce new biases?

Is gender DIF on numeracy items always meaningful?

References

Related Research

Computerized Adaptive Testing Explained: How Modern Tests Adapt to You

A Beginner's Guide to Item Response Theory (IRT): How Modern Tests Work

Psychometrics: The Science of Psychological Measurement

Integrating SDT and IRT Models for Mixed-Format Exams

Group-Theoretical Symmetries in Item Response Theory (IRT)

People Also Ask

Leave a Reply Cancel reply

What DIF is, and why detection isn’t enough

What response process data adds

What Li and colleagues actually found

Why low statistical-versus-expert-review agreement makes more sense in this light

The ecological context: DIF is not only about items

What this means for assessment practice

Limits of the current evidence

Frequently Asked Questions

What is differential item functioning in plain language?

Why is interpreting DIF harder than detecting it?

What is response process data?

Does this mean DIF analysis should now always include process data?

Could process-data DIF analysis introduce new biases?

Is gender DIF on numeracy items always meaningful?

References

Related Research

People Also Ask

You may also like...

Popular Posts

Leave a Reply Cancel reply