Statistical Methods and Data Analysis

Interpreting Differential Item Functioning with Response Process Data

Interpreting Differential Item Functioning with Response Process Data
Published: December 16, 2024 · Last reviewed:
📖1,612 words7 min read📚3 references cited

A test item that scores differently for two groups of equally able examinees is called a differential item functioning (DIF) item, and identifying these items is now a routine part of large-scale assessment quality control. The hard part has never been the detection — statistical tests for DIF have been mature for thirty years — but the interpretation: knowing why an item flagged as DIF behaves the way it does. Expert content reviewers and statistical DIF flags often disagree, leaving test developers with a list of suspicious items and no clear story about what is driving the difference. A 2024 study by Li, Shin, Kuang, and Huggins-Manley shows that response process data — the digital traces of how examinees actually interact with computerized items — can fill in part of this missing layer.

What DIF is, and why detection isn’t enough

DIF occurs when examinees from two groups (commonly defined by gender, native language, or demographic background) who have the same level of the trait being measured nonetheless have different probabilities of getting an item right. The matching on ability is the key feature: a raw difference in pass rates is not DIF, because the groups may genuinely differ in ability. A conditional difference, after equating ability, is what flags the item.

Standard DIF methods — Mantel-Haenszel, logistic regression, IRT-based approaches — are good at producing this flag. What they cannot do is explain it. A flagged item might function differently because of:

  • genuinely construct-irrelevant content that one group encounters more readily;
  • differences in test-taking strategy, pacing, or familiarity with the response format;
  • statistical noise plus multiple testing across hundreds of items;
  • group differences on a secondary, correlated trait that the item happens to tap.

Expert reviewers — content specialists asked to read flagged items and judge whether the difference is meaningful — often produce low agreement with statistical flags and with one another. The result is a chronic gap between “this item is statistically suspicious” and “this is what the suspicion actually means.”

What response process data adds

Computerized assessments such as the Programme for International Assessment of Adult Competencies (PIAAC) generate detailed log files: time stamps for every action, sequences of clicks and keystrokes, time on each item, time per response option, and patterns of revisiting earlier items. Goldhammer, Hahnel, and Kroehne’s 2020 methodological treatment of PIAAC log files maps out how these traces can be transformed into structured features for analysis — moving from raw event streams to interpretable variables like total response time, dwell time on specific options, and the frequency of particular action sequences.

Unlike a simple right/wrong score, these features describe the process by which an examinee arrived at an answer. If two groups solve an item correctly through different routes, or if one group spends systematically more time on a specific tool or distractor, those patterns carry information that a final-score-only DIF analysis cannot see.

What Li and colleagues actually found

The 2024 Li et al. study used the PIAAC 2012 computer-based numeracy assessment to investigate gender DIF. Their analytic approach combined:

  • random forest models — to capture potentially non-linear and interactive effects of process features on DIF;
  • logistic regression with ridge regularization — to handle high-dimensional feature sets and produce more stable coefficient estimates;
  • variation in the assumed proportion of DIF items — to test how the methodology behaves under different prevalence assumptions, since real test forms vary.

Their core empirical claim is that the combination of timing features and action-sequence features is informative for distinguishing how the groups respond to flagged items. Neither timing nor sequence alone carried as much information as the combination. The most useful features were not pre-specified: the modeling approach surfaced them empirically from the log file data.

The interpretive payoff is that, for a given DIF-flagged item, the analyst can now describe not only that it functions differently but also which aspects of the response process differ between groups — for example, whether one group takes systematically longer, dwells on different distractors, or follows a different sequence of within-item actions. This is the missing bridge between statistical flag and substantive explanation.

Why low statistical-versus-expert-review agreement makes more sense in this light

The chronic disagreement between statistical DIF flags and expert content review can be partially explained by what each method has access to. The statistical flag knows the score outcomes; the content expert knows the item text. Neither has direct access to how examinees solved the problem. If the source of differential functioning lies in the process — e.g., one group skipping over a built-in calculator tool that helps with the item — neither the score nor the item text alone will reveal it.

Process data closes this gap by making the unobserved middle layer observable. The implication, consistent with Li et al.’s framing, is that some historic disagreements are not failures of either method but reflect a genuine information gap that process data can fill.

The ecological context: DIF is not only about items

Woitschach, Zumbo, and Fernández-Alonso (2019) make a complementary point at a different level of analysis. Treating DIF as purely an item-level property, they argue, ignores the multilevel structure of educational assessment data: students are nested in classrooms, schools, regions, and educational systems, and group-level context can drive differential functioning that no item-only analysis will resolve. Their multilevel modeling approach treats DIF as partly an emergent property of the testing context, not a pure item flaw.

Combined with the process-data approach, the picture is layered:

  • Statistical DIF detection identifies items where conditional pass rates differ across groups.
  • Expert review evaluates item content for plausible construct-irrelevant features.
  • Response process analysis (Li et al.) examines how groups actually engage with items.
  • Multilevel ecological analysis (Woitschach et al.) examines how group-level context shapes the patterns observed.

A complete account of why a particular item shows DIF will increasingly draw on more than one of these layers.

What this means for assessment practice

For test developers and large-scale assessment programs, several practical implications follow:

  • Process data should be retained and analyzed, not discarded. Many computerized assessments generate detailed log files that are then archived without systematic analysis. The Li et al. results suggest meaningful DIF interpretation information is sitting in those files.
  • Combined feature sets matter. Single-feature analyses (timing alone, or sequence alone) appear to leave information on the table.
  • The technique does not eliminate the need for expert review. Process features describe what differs; substantive judgment is still required to interpret whether the difference reflects construct-relevant variation or construct-irrelevant noise.
  • The method is not yet operational. Random-forest-and-ridge-regression analysis of log file features is research-grade rather than routine. Operationalization for production assessment programs would require methodological standardization that does not yet exist.

Limits of the current evidence

The Li et al. study is, at the time of writing, a single empirical investigation on one assessment domain (PIAAC numeracy, 2012 cycle) and one DIF dimension (gender). Several open questions remain:

  • Generalization across domains. Whether timing-and-sequence features carry similar information for literacy, problem-solving, or domain-specific tests is not yet established.
  • Generalization across DIF dimensions. Gender DIF may have different process correlates than language-based or age-based DIF.
  • Stability of findings. The features identified as most informative by random forest are not guaranteed to replicate. Different cycles or samples could highlight different features.
  • Causal interpretation. Process features describe correlations between behavior and DIF flags. They do not, on their own, establish that the process difference causes the score difference, only that it accompanies it.

Frequently Asked Questions

What is differential item functioning in plain language?

A test item that two equally able people, from different demographic groups, are not equally likely to get right. The “equally able” qualifier is what distinguishes DIF from a raw group difference in scores.

Why is interpreting DIF harder than detecting it?

Statistical detection produces a flag, but the flag does not tell you why the item is functioning differently. Reasons can range from substantive content bias to test-taking-strategy differences to statistical noise from multiple testing. Distinguishing these requires information that pure score data does not contain.

What is response process data?

The detailed digital trace of how an examinee interacts with an item: time stamps, click sequences, dwell times on options, use of in-item tools, and revisits. Computer-based assessments like PIAAC routinely produce this data even when it is not part of the scoring.

Does this mean DIF analysis should now always include process data?

Not yet operationally. The methods are research-grade and require analytical infrastructure that most assessment programs do not have in production. The direction of travel is toward incorporation, but the field is not there.

Could process-data DIF analysis introduce new biases?

Yes, in principle. Process features themselves can be confounded with examinee characteristics like familiarity with the response interface or testing motivation. Treating process data as an unbiased window onto cognition would be a mistake; it is one more layer of evidence, not the final answer.

Is gender DIF on numeracy items always meaningful?

No. Some flagged items reflect minor measurement noise; others reflect substantive content issues; others reflect process differences that may or may not be construct-relevant. The whole point of layered interpretation is that “flagged for DIF” is the start of an analysis, not the conclusion.

References

Related Research

Psychological Measurement and Testing

Psychometrics: The Science of Psychological Measurement

Psychometrics, a specialized branch within psychology, is dedicated to the theory and methodology of psychological measurement. This discipline encompasses the development and refinement of testing…

Feb 27, 2025
Statistical Methods and Data Analysis

Integrating SDT and IRT Models for Mixed-Format Exams

Lawrence T. DeCarlo’s recent article introduces a psychological framework for mixed-format exams, combining signal detection theory (SDT) for multiple-choice items and item response theory (IRT)…

Dec 11, 2024

People Also Ask

What is psychometrics: the science of psychological measurement?

The discipline of psychometrics emerged from two distinct yet complementary intellectual traditions. The first, championed by figures such as Charles Darwin, Francis Galton, and James McKeen Cattell, emphasized the study of individual differences and sought to develop systematic methods for their quantification. The second, rooted in the psychophysical research of Johann Friedrich Herbart, Ernst Heinrich Weber, Gustav Fechner, and Wilhelm Wundt, laid the foundation for the empirical investigation of human perception, cognition, and consciousness. Together, these two traditions converged to form the scientific underpinnings of modern psychological measurement.

Read more →
What are integrating sdt and irt models for mixed-format exams?

Lawrence T. DeCarlo’s recent article introduces a psychological framework for mixed-format exams, combining signal detection theory (SDT) for multiple-choice items and item response theory (IRT) for open-ended items. This fusion allows for a unified model that captures the nuances of each item type while providing insights into the underlying cognitive processes of examinees.

Read more →
What is group-theoretical symmetries in item response theory (irt)?

Item Response Theory (IRT) is a widely adopted framework in psychological and educational assessments, used to model the relationship between latent traits and observed responses. This recent work introduces an innovative approach that incorporates group-theoretic symmetry constraints, offering a refined methodology for estimating IRT parameters with greater precision and efficiency.

Read more →
What is simulated irt dataset generator v1.00 at cogn-iq.org?

The Dataset Generator available at Cogn-IQ.org is a powerful resource designed for researchers and practitioners working with Item Response Theory (IRT). This tool simulates datasets tailored for psychometric analysis, enabling users to explore a range of testing scenarios with customizable item and subject characteristics. It supports the widely used 2-Parameter Logistic (2PL) model, providing flexibility and precision for diverse applications.

Read more →
Why is background important?

DIF occurs when items in a test perform differently for subgroups, even when examinees possess similar ability levels. Traditionally, DIF has been evaluated using item response scores alone, but this can limit the ability to interpret why certain items function differently. Recent advancements in data collection methods, such as tracking response behaviors, have opened new opportunities to better understand these differences. The study by Li et al. applies response process data to identify patterns in how individuals from different groups interact with test items, offering a fresh perspective on assessing DIF.

How does key insights work in practice?

Incorporating Response Process Data: The study highlights how features like timing and action sequences provide valuable information about the ways individuals engage with test items, making it possible to uncover patterns that traditional DIF analysis might miss. Use of Advanced Techniques: Methods such as random forest models and logistic regression with

📋 Cite This Article

Jouve, X. (2024, December 16). Interpreting Differential Item Functioning with Response Process Data. PsychoLogic. https://www.psychologic.online/2024/12/16/differential-item-functioning-response-data/

Leave a Reply