What are future directions?

The study authors recommend further research to evaluate the NIHTB-CB’s ability to measure treatment-induced cognitive changes and to establish thresholds for clinically meaningful improvements in daily functioning. Understanding these links could enhance the tool’s application in practical and therapeutic contexts.

Shields et al. (2023) provide compelling evidence for the utility of the NIHTB-CB in tracking cognitive development in individuals with ID. By identifying both its strengths and areas for further exploration, this research lays the groundwork for its expanded use in clinical trials and intervention studies. This tool shows promise as a reliable and sensitive measure, particularly for diverse ID populations.

Shields, R. H., Kaat, A., Sansone, S. M., Michalak, C., Coleman, J., Thompson, T., McKenzie, F. J., Dakopolos, A., Riley, K., Berry-Kravis, E., Widaman, K. F., Gershon, R. C., & Hessl, D. (2023). Sensitivity of the NIH Toolbox to detect cognitive change in individuals with intellectual and developmental disability. Neurology, 100(8), e778-e789. https://doi.org/10.1212/WNL.0000000000201528

Evaluating the NIH Toolbox for Cognitive Change in Intellectual Disabilities

Published: February 22, 2023 · Last reviewed: May 4, 2026

📖2,132 words⏱9 min read📚15 references cited

The National Institutes of Health Toolbox Cognition Battery (NIHTB-CB) is a brief, iPad-based battery of cognitive tests that has become one of the most widely-used cognitive assessment tools in research over the past decade. It was originally developed for research on healthy populations across the lifespan, then adapted for use with clinical groups. A central question for the field has been whether the NIHTB-CB can serve as a sensitive outcome measure in clinical trials for treatments aimed at cognitive impairment — particularly in individuals with intellectual and developmental disabilities (ID). A 2023 longitudinal study by Shields and colleagues in Neurology answers a key part of that question: yes, the NIHTB-CB detects cognitive change over time in this population, and does so with sensitivity comparable to or exceeding that of established intelligence batteries.

Why measuring cognitive change in ID is hard

Most cognitive assessment tools were designed to measure differences between individuals — to identify whether a child’s cognitive functioning falls in a particular range and to support diagnostic classifications. Measuring change within an individual over time, particularly in response to treatment, is a different and harder problem. Several specific challenges arise in ID populations:

Floor effects. Many standardized tests are designed for typically developing populations. In individuals with severely limited cognitive abilities, scores may cluster at the low end of the scale, leaving little room to detect improvement.
Test-retest reliability under repeated administration. Cognitive batteries developed for one-time diagnostic use may not maintain measurement properties under serial administration.
Heterogeneity of ID etiologies. Down syndrome, fragile X syndrome, and other genetic and idiopathic ID conditions produce distinct cognitive profiles. A measure sensitive to change in one may be insensitive in another.
Construct validity at low cognitive levels. Tests designed to measure executive function or working memory in typical populations may, at very low ability levels, be picking up motivational, attentional, or instructional understanding effects rather than the targeted construct.

These issues have made cognitive endpoint selection one of the most contested aspects of clinical-trial design for ID treatments. Several pharmaceutical trials in fragile X syndrome and Down syndrome have failed to show significant treatment effects on behavioral and cognitive measures — a result that may reflect either ineffective treatments or insensitive outcome measures. Distinguishing these requires assessment tools whose change-detection properties are independently established.

The validation arc: 2016, 2020, 2023

The NIHTB-CB validation work for ID populations has unfolded across three sequential studies, each building on the last.

Hessl et al. (2016) in Journal of Neurodevelopmental Disorders reported three pilot studies with fragile X syndrome (n = 63), Down syndrome (n = 47), and idiopathic ID (n = 16) participants. The mean mental age across groups was approximately 5–6 years. Findings:

Feasibility was good to excellent (≥80% valid scores) for participants above mental age 4 for all tests except list sorting (working memory).
Convergent validity was comparable to or better than that observed in typically developing children for executive function and language measures.
Composite scores correlated moderately to strongly with adaptive behavior and full-scale IQ.
Group-specific cognitive profiles were detected: FXS and DS showed attention and inhibitory control deficits relative to idiopathic ID; FXS showed reading weakness; DS showed receptive vocabulary weakness.

This established that the NIHTB-CB could be administered to ID populations and produced interpretable scores — but did not yet establish that it could detect change.

Shields et al. (2020) in Neurology extended the work with n = 242 individuals with FXS, DS, or other ID, ages 6–25, retested at 1 month. The study demonstrated:

Excellent feasibility above mental age 5.0 across all tests.
Test-retest reliability ranging from moderate to strong.
Convergent validity ranging from moderate to strong.
Each test and the Crystallized and Fluid Composite scores correlated moderately to strongly with IQ.
Known-groups validity through detection of expected syndrome-specific deficits (executive function in FXS, receptive language in DS).

The conclusion was that the NIHTB-CB is reliable and valid for individuals with ID with mental age approximately 5 years and above. The remaining question — whether it could detect developmental and treatment-related change over longer intervals — required a longitudinal study.

Shields et al. (2023) in Neurology answered that question. Using 256 participants aged 6–27 with FXS, DS, and other intellectual disability, the team administered both the NIHTB-CB and the Stanford-Binet Intelligence Scales, Fifth Edition (SB5) at baseline and again two years later, then applied latent change score models to estimate group-level cognitive growth. Key findings:

The NIHTB-CB detected developmental gains comparable to or greater than the SB5. The sensitivity-to-change comparison favored the NIHTB-CB on most domains.
Idiopathic ID participants showed gains across most domains, with significant growth at age 10, continued growth at 16, and stability into early adulthood (22).
FXS participants showed delayed improvements in attention and inhibitory control — a different developmental trajectory than the idiopathic group.
DS participants showed slower growth in receptive vocabulary but notable gains in working memory and attention/inhibitory control during early adulthood.

The 2023 paper completes the validation arc: the NIHTB-CB is feasible (2016), reliable and valid at a single time point (2020), and sensitive to longitudinal cognitive change (2023). The three studies together establish it as a defensible outcome measure for clinical research in ID.

What “sensitivity to change” actually requires

A measure can be reliable and valid at a single administration without being a good change detector. The properties needed for a sensitive change measure include:

Adequate floor. Score distributions must extend below the typical performance level of the population so that improvement has somewhere to register.
Adequate ceiling. Conversely, scores cannot saturate at the top, particularly for higher-functioning individuals, or improvement above that point is invisible.
Linear scaling within the relevant range. A 5-point improvement at the low end of the scale should mean roughly the same as a 5-point improvement at the middle.
Low practice effects. Repeated administration of the same items teaches the test and inflates scores in ways that can mimic genuine improvement.
Sensitivity to small effects. Treatment effects in ID trials are typically small. A measure that requires a large effect to detect a signal will fail to reject the null even when treatment is genuinely working.

The 2023 Shields et al. study addresses several but not all of these properties. Its comparison with the SB5 establishes relative sensitivity but does not by itself establish that absolute change detection is at the level required for a treatment trial. The authors are explicit that further work is needed to (a) test the battery’s sensitivity to treatment-induced changes specifically and (b) establish what magnitude of NIHTB-CB score change corresponds to clinically meaningful improvement in daily functioning.

The general-population reference

The NIHTB-CB’s foundational validation work in adults — Heaton, Akshoomoff, Tulsky, Mungas, Weintraub, and colleagues’ 2014 paper in Journal of the International Neuropsychological Society, and Akshoomoff et al.’s 2013 monograph contribution — established the psychometric properties of the Crystallized, Fluid, and Total Composite scores in healthy populations. The Crystallized Composite combines vocabulary and reading subtests; the Fluid Composite combines the remaining tests measuring executive function, working memory, episodic memory, and processing speed. These composites form the backbone of NIHTB-CB interpretation and are the same metrics used in the ID-population work.

The translation from healthy-adult norms to ID populations is non-trivial: scores produced by an examinee with severe ID will fall well below the normative distribution, and the meaning of those scores in clinical-decision terms requires the additional validation work the Shields-Hessl group has conducted.

Practical implications for clinical research

Several practical implications follow from this body of work:

The NIHTB-CB is now defensible as a primary cognitive endpoint in ID clinical trials, particularly for participants with mental age approximately 5 years or above. The Shields 2023 study removes a major historical objection.
Etiology-specific developmental trajectories matter. Trial designs comparing treatment and control groups within a specific ID syndrome (FXS, DS, or other) need baseline cognitive profiles that account for the syndrome-specific developmental pattern. Pooling across etiologies risks averaging away the very effects a treatment may produce.
The Fluid Composite is often the recommended primary endpoint. For ID populations specifically, the Fluid Composite tends to show the strongest psychometric performance, partly because differentiation between specific subtests is limited at the lower end of the cognitive distribution.
Adaptations are needed for severe ID. The validation work establishes utility above mental age 5; below that level, additional adaptations or alternative measures are required. This is a real limitation for treatment trials including individuals with the most severe cognitive impairment.
Longer-term assessment intervals are likely better than short ones. The 2-year interval in Shields 2023 was sufficient to detect group-level developmental change. Shorter intervals may have insufficient signal for reliable change detection given the slow pace of cognitive change in this population.

What this body of work does not establish

Several limitations are worth emphasizing:

Group-level vs. individual-level change. The Shields 2023 paper used latent change score models to characterize group-level developmental trajectories. Whether the NIHTB-CB is sensitive to individual-level change of a magnitude relevant to clinical decisions is a separate question.
Treatment-induced vs. developmental change. Demonstrating that a battery can detect natural developmental gain over two years does not by itself demonstrate that it can detect treatment-induced gain over a shorter trial window.
Clinically meaningful thresholds. The amount of NIHTB-CB score change that corresponds to a clinically meaningful improvement in daily functioning is not yet established.
Generalization beyond FXS, DS, and idiopathic ID. Several other genetic ID conditions (Williams syndrome, Angelman syndrome, others) have not been included in the validation samples.
Adult lifespan beyond age 27. Cognitive aging trajectories in ID populations may differ from those in typical populations; the NIHTB-CB performance profile across older adult ID samples is less established.

Frequently Asked Questions

What is the NIH Toolbox Cognition Battery?

A brief, iPad-administered battery of seven cognitive tests measuring executive function, working memory, episodic memory, processing speed, attention/inhibitory control, vocabulary, and reading. It was developed under the NIH Blueprint for Neuroscience Research and is used widely across research and clinical applications.

Is the NIHTB-CB an IQ test?

Not in the diagnostic sense. It produces composite cognitive scores that correlate strongly with full-scale IQ but is designed primarily as a research and outcome-measurement tool, not as a clinical diagnostic IQ battery. It does not replace the WAIS, Stanford-Binet, or similar batteries for formal IQ assessment.

How long does it take to administer?

Approximately 30 minutes for the full battery in typically developing examinees. Administration time may extend somewhat in ID populations due to instructional supports and accommodations.

Why does it matter that the NIHTB-CB can detect change in ID populations?

Because clinical trials of treatments for FXS, DS, and other ID conditions need a cognitive outcome measure that is sensitive to small improvements. Without such a measure, treatments that genuinely work cannot be distinguished from those that do not.

Is the NIHTB-CB better than the Stanford-Binet for ID populations?

For change detection at group level over two-year intervals, the Shields 2023 data suggest comparable or superior sensitivity. For individual diagnostic classification, full-scale IQ batteries like the SB5 and WAIS retain their primary role.

What’s the lower limit of who can be tested?

The validation work establishes utility for individuals with mental age approximately 5 years and above. For more severely impaired individuals, alternative measures or additional adaptations are needed.

Can the NIHTB-CB detect treatment effects in clinical trials?

The validation arc establishes that the battery detects group-level developmental change. Whether it detects treatment-induced change in trials specifically — and at what threshold a change is clinically meaningful — is the next research priority.

References

Shields, R. H., Kaat, A., Sansone, S. M., Michalak, C., Coleman, J., Thompson, T., McKenzie, F. J., Dakopolos, A., Riley, K., Berry-Kravis, E., Widaman, K. F., Gershon, R. C., & Hessl, D. (2023). Sensitivity of the NIH Toolbox to Detect Cognitive Change in Individuals With Intellectual and Developmental Disability. Neurology, 100(8), e778–e789. https://doi.org/10.1212/WNL.0000000000201528
Shields, R. H., Kaat, A. J., McKenzie, F. J., Drayton, A., Sansone, S. M., Coleman, J., Michalak, C., Riley, K., Berry-Kravis, E., Gershon, R. C., Widaman, K. F., & Hessl, D. (2020). Validation of the NIH Toolbox Cognitive Battery in intellectual disability. Neurology, 94(12), e1229–e1240. https://doi.org/10.1212/WNL.0000000000009131
Hessl, D., Sansone, S. M., Berry-Kravis, E., Riley, K., Widaman, K. F., Abbeduto, L., Schneider, A., Coleman, J., Oaklander, D., Rhodes, K. C., & Gershon, R. C. (2016). The NIH Toolbox Cognitive Battery for intellectual disabilities: three preliminary studies and future directions. Journal of Neurodevelopmental Disorders, 8(1), 35. https://doi.org/10.1186/s11689-016-9167-4
Heaton, R. K., Akshoomoff, N., Tulsky, D., Mungas, D., Weintraub, S., Dikmen, S., Beaumont, J., Casaletto, K. B., Conway, K., Slotkin, J., & Gershon, R. (2014). Reliability and Validity of Composite Scores from the NIH Toolbox Cognition Battery in Adults. Journal of the International Neuropsychological Society, 20(6), 588–598. https://doi.org/10.1017/S1355617714000241
Akshoomoff, N., Beaumont, J. L., Bauer, P. J., Dikmen, S. S., Gershon, R. C., Mungas, D., Slotkin, J., Tulsky, D., Weintraub, S., Zelazo, P. D., & Heaton, R. K. (2013). NIH Toolbox Cognition Battery (CB): Composite scores of crystallized, fluid, and overall cognition. Monographs of the Society for Research in Child Development, 78(4), 119–132. https://doi.org/10.1111/mono.12038

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID