Psychological Measurement and Testing

Continuous Norming for Cognitive Tests

Improving Norm Score Quality with Regression-Based Continuous Norming
Published: April 14, 2021 · Last reviewed:
📖2,070 words⏱9 min read📚4 references cited
The standard practice in psychometric test publication is to develop norm tables by stratifying the standardization sample into age bands and computing percentile-rank tables within each band. The procedure is intuitive and has been the de facto industry standard for the better part of a century, but it has known statistical pathologies: norm scores are discontinuous at age-band boundaries, sampling noise within bands produces non-monotonic norms, missing data at extreme ages cannot be extrapolated, and the within-band sample size requirements are punishing. Lenhard and Lenhard’s (2020) Educational and Psychological Measurement paper compares semiparametric continuous norming (SPCN)—a regression-based alternative implemented in their open-source cNORM R package—against conventional banded norming across an extensive simulated test landscape. The result is consistent and methodologically consequential: SPCN matches or exceeds conventional norming with substantially smaller standardization samples, and the gap widens as the conventional approach hits its data-hungry failure modes.

What conventional norming actually does, and where it breaks

For a test administered across an age range, conventional norm-table construction proceeds in three steps. First, the standardization sample is stratified into age bands (often six-month or one-year bands for child tests, larger bands for adult tests). Second, within each band, raw scores are ranked and converted to percentiles, then to standardized norm scores (T-scores, IQ scores, scaled scores, etc.). Third, the resulting age-band-by-raw-score lookup tables are published and used to convert future raw scores into normed scores.

The procedure is exact when each band has a large, representative sample at every relevant raw-score value. In practice it rarely does, and three pathologies emerge. Boundary discontinuities: a child whose age places her at the upper boundary of one band gets one normed score; a child one day older is reassigned to the next band and may receive a substantially different score for the same raw performance. Within-band sampling noise: rare raw scores within a band may have only a handful of observations, producing percentile estimates that are noisy and sometimes non-monotonic (a higher raw score yielding a lower percentile, simply because of which other participants happen to be in the band). Extrapolation impossibility: if the standardization sample under-covers some age range, the conventional method has no principled way to extrapolate norms into that range; the test simply cannot be normed there.

The combined consequence is that conventional norming requires very large standardization samples—often 1,000 to 2,000 cases per age band for clinical tests—to suppress these pathologies to acceptable levels. Test publishers absorb this as the cost of doing business, but the cost is substantial and limits how often tests can be re-normed.

The continuous-norming alternative

Continuous norming approaches replace the within-band percentile-rank computation with a regression model that treats the norm distribution as a smooth function of the explanatory variable (typically age, but the framework generalizes to any continuous covariate). The published norms are then defined by the fitted model rather than by discrete tables, which automatically eliminates boundary discontinuities and within-band sampling noise.

The longest-established continuous-norming framework is the LMS method developed by Cole and Green (1992) for pediatric growth charts. LMS estimates three age-varying parameters of a Box-Cox transformation of the raw score: the skewness coefficient (L), the median (M), and the coefficient of variation (S). Each parameter is fitted as a smoothed function of age using cubic splines under penalized likelihood, and percentile curves are derived analytically from the fitted (L, M, S) functions. The LMS approach is dominant in growth-chart applications (height, weight, BMI by age) and has been extended to psychometric applications, but it makes a parametric assumption about the form of the raw-score distribution that is sometimes inappropriate for cognitive-ability data.

The GAMLSS framework introduced by Stasinopoulos and Rigby (2007) generalizes LMS by allowing arbitrary distributional families (not just Box-Cox), arbitrary link functions, and additive smoothing terms in any of the distributional parameters. GAMLSS is implemented in a comprehensive R package and is the most flexible parametric continuous-norming approach available.

Semiparametric continuous norming (SPCN), the Lenhard and Lenhard approach, takes a different route. Rather than parametrizing the raw-score distribution and then modeling parameters as functions of age, SPCN models the joint relationship of raw score, percentile, and the explanatory variable as a Taylor polynomial. The fitted polynomial is a three-dimensional surface from which any norm score can be read off for any combination of raw score and age. The method makes minimal assumptions about the raw-score distribution—the distribution is allowed to take whatever shape the data imply—at the cost of slightly higher computational complexity than parametric alternatives. The implementation is available in the open-source cNORM R package, which Lenhard and colleagues maintain on CRAN.

The 2020 simulation

Lenhard and Lenhard’s (2020) simulation generated a synthetic standardization population of approximately 840,000 cases using an item-response-theory model, with parameters chosen to span realistic test conditions. Test scales varied in number of items (short to long), item difficulty distribution, and discrimination. From this large population, the authors drew repeated standardization samples of varying size, applied both SPCN and conventional banded norming, and assessed how accurately each method recovered the population norm distribution.

The accuracy criteria were the standard psychometric ones: bias of the recovered norm score relative to the population value, root-mean-square error across the raw-score range, and consistency of the recovered norms across age values. The simulation also tested both methods on missing-data and incomplete-coverage scenarios that conventional norming handles poorly.

The findings, summarized:

  • SPCN reaches optimal accuracy with substantially smaller samples than conventional norming. For comparable RMSE in the recovered norms, SPCN required roughly half the sample size that conventional banding required across most simulated test conditions.
  • SPCN handles age-boundary regions and missing data without breakdown. Where conventional norming exhibits norm-score discontinuities at band boundaries and produces unreliable estimates at sparsely sampled ages, SPCN’s smooth regression interpolates and extrapolates with bounded error.
  • SPCN’s relative advantage is largest where conventional norming fails hardest. Small standardization samples, narrow age bands, missing age coverage, and tests with floor or ceiling effects—the regimes where banding produces the noisiest norms—are precisely the regimes where SPCN’s smoothing assumption pays off most.

The authors’ companion 2019 paper in PLoS ONE (Lenhard, Lenhard, & Gary) made a similar comparison between SPCN and parametric continuous norming (LMS-style approaches), finding that the two perform comparably in well-behaved cases but that SPCN is more robust when raw-score distributions deviate from the parametric assumptions of LMS or GAMLSS.

Why this changes test-development economics

The practical implication of the Lenhard and Lenhard results is that test publishers can achieve equivalent norm precision with roughly half the standardization sample if they adopt SPCN instead of conventional banding. For a clinical test that traditionally requires 2,000 cases for adequate norms, this is a difference of 1,000 cases worth of recruitment, administration, scoring, and quality-control effort. For tests targeting populations that are difficult or expensive to recruit—clinical samples, geriatric samples, low-incidence diagnostic populations—the sample-size reduction may be the difference between a publishable test and an under-normed one.

A secondary economic implication is that re-norming becomes more tractable. Tests are typically re-normed every 10-15 years to address Flynn effects and changing population characteristics. The standardization-sample cost of re-norming has been a barrier to more frequent updates; halving that cost makes more frequent re-norming feasible, which in turn improves the contemporaneousness of norms.

Constraints and open questions

SPCN is not a free lunch. The method’s polynomial smoothing assumption can introduce error if the true norm distribution has features (sharp inflections, distinctly multimodal regions) that polynomial smoothing cannot capture. The cNORM implementation includes diagnostic procedures for detecting such features, but the analyst still bears responsibility for inspecting fitted norm curves and identifying regions where the smoothing is over-aggressive.

The method also depends on the explanatory variable being continuous and reasonably well-distributed across the standardization sample. For tests in which the only meaningful stratifier is a discrete categorical variable (sex, language version), SPCN reduces to within-category banding and loses its advantage. For tests where the explanatory variable is continuous but heavily skewed (most of the sample at narrow ages, sparse coverage at others), SPCN performs better than banding but not optimally; the smooth interpolation degrades where data are sparse, just less catastrophically than band-based estimation does.

Finally, the SPCN method has not yet been formally compared to GAMLSS-based parametric continuous norming under a unified simulation framework. The 2019 PLoS ONE paper made a partial comparison; a comprehensive head-to-head benchmark across realistic cognitive-test conditions would strengthen the case for either approach as the field-default continuous-norming method.

The bigger methodological shift

The deeper argument running through Lenhard and Lenhard’s program is that norm-score derivation is a regression problem—the question is how raw scores map to population percentiles as a function of relevant covariates—and that treating it as anything else throws away information. Conventional banding implicitly imposes a step-function structure on a relationship that is in fact smooth, and the cost of that mis-specification is paid in standardization-sample size and in the boundary artifacts that clinicians have learned to tolerate as background noise.

If the field accepts the regression framing, the question is no longer whether to use continuous norming but which form of continuous norming to use. The Lenhard work positions SPCN as the distribution-free option in a toolkit that also includes LMS, GAMLSS, and other parametric formulations. The cNORM R package makes the choice operational rather than aspirational; subsequent test-development projects can pick a continuous-norming approach off the shelf rather than building one from scratch. The 2020 EPM paper is the empirical case for that toolkit choice being the right default for psychometric applications.

Frequently asked questions

What is continuous norming?

Continuous norming replaces conventional age-band norm tables with a regression model that treats the norm distribution as a smooth function of an explanatory variable, typically age. Published norms are defined by the fitted model rather than by discrete tables, which automatically eliminates boundary discontinuities and within-band sampling noise.

What are the failure modes of conventional banded norming?

Three pathologies emerge. Boundary discontinuities: a one-day age difference can move a child from one band to another and produce a substantially different normed score. Within-band sampling noise: rare raw scores within a band may have only a handful of observations, producing percentile estimates that are noisy and sometimes non-monotonic. Extrapolation impossibility: where the standardization sample under-covers some age range, the conventional method has no principled way to extrapolate norms.

What is semiparametric continuous norming (SPCN)?

SPCN, developed by Lenhard and Lenhard, models the joint relationship of raw score, percentile, and the explanatory variable as a Taylor polynomial. The fitted polynomial is a three-dimensional surface from which any norm score can be read off for any combination of raw score and age. The method makes minimal assumptions about the raw-score distribution and is implemented in the open-source cNORM R package.

How much does SPCN reduce sample-size requirements?

For comparable RMSE in the recovered norms, SPCN required roughly half the sample size of conventional banding across most simulated test conditions. For a clinical test that traditionally requires 2,000 cases for adequate norms, the sample-size reduction translates into 1,000 fewer cases of recruitment, administration, scoring, and quality control.

How does SPCN compare to LMS and GAMLSS?

LMS (Cole & Green, 1992) and GAMLSS (Stasinopoulos & Rigby, 2007) are parametric continuous-norming approaches that model raw-score distributional parameters as smooth functions of age. SPCN takes a semiparametric route, modeling the joint raw-score-by-percentile-by-age surface directly. The 2019 Lenhard, Lenhard, and Gary comparison found SPCN more robust when raw-score distributions deviate from the parametric assumptions of LMS or GAMLSS.

What does SPCN not solve?

SPCN’s polynomial smoothing assumption can introduce error if the true norm distribution has sharp inflections or distinctly multimodal regions. It also depends on the explanatory variable being continuous and reasonably well-distributed; for tests stratified only by a discrete category, SPCN reduces to within-category banding and loses its advantage. The cNORM implementation includes diagnostics for detecting these regions.

References

  • Cole, T. J., & Green, P. J. (1992). Smoothing reference centile curves: The LMS method and penalized likelihood. Statistics in Medicine, 11(10), 1305-1319. https://doi.org/10.1002/sim.4780111005
  • Lenhard, A., Lenhard, W., & Gary, S. (2019). Continuous norming of psychometric tests: A simulation study of parametric and semi-parametric approaches. PLOS ONE, 14(9), e0222279. https://doi.org/10.1371/journal.pone.0222279
  • Lenhard, W., & Lenhard, A. (2021). Improvement of norm score quality via regression-based continuous norming. Educational and Psychological Measurement, 81(2), 229-261. https://doi.org/10.1177/0013164420928457
  • Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized additive models for location, scale and shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1-46. https://doi.org/10.18637/jss.v023.i07

Related Research

Statistical Methods and Data Analysis

Attenuation-Corrected Reliability Estimators

Most psychometrics textbooks teach the classical "correction for attenuation" — Spearman's century-old technique for estimating what the correlation between two psychological constructs would be if…

Nov 1, 2022
Statistical Methods and Data Analysis

Missing Data Methods in Educational Testing

The study by Xiao and Bulut (2020) evaluates how different methods for handling missing data perform when estimating ability parameters from sparse datasets. Using two…

Oct 10, 2020
Statistical Methods and Data Analysis

The Role of Item Distributions in Reliability Estimation

Olvera Astivia, Kroc, and Zumbo’s (2020) study examines the assumptions underlying Cronbach’s coefficient alpha and how the distribution of items affects reliability estimation. By introducing…

Oct 2, 2020
Psychological Measurement and Testing

WISC-V Short-Form IQ Estimation

Administering the full Wechsler Intelligence Scale for Children, Fifth Edition (WISC-V) takes 60 to 80 minutes for the seven subtests that compose Full Scale IQ.…

Jun 24, 2020
Statistical Methods and Data Analysis

Rasch vs Classical Equating in Small Samples

Babcock and Hodge (2020) address a significant challenge in educational measurement: accurately equating exam scores when sample sizes are limited. Their study evaluates the performance…

Jun 2, 2020

People Also Ask

What are refining reliability with attenuation-corrected estimators?

Jari Metsämuuronen’s (2022) article introduces a significant advancement in how reliability is estimated within psychological assessments. The study critiques traditional methods for their tendency to yield deflated results and proposes new attenuation-corrected estimators to address these limitations. This review examines the article’s contributions and its implications for improving measurement precision.

Read more →
What are assessing missing data handling methods in sparse educational datasets?

In educational assessments, missing data can distort ability estimation, affecting the accuracy of decisions based on test results. Xiao and Bulut addressed this issue by comparing the performances of full-information maximum likelihood (FIML), zero replacement, and multiple imputations using classification and regression trees (MICE-CART) or random forest imputation (MICE-RFI). The simulations assessed each method under varying proportions of missing data and numbers of test items.

Read more →
What is the role of item distributions in reliability estimation?

Olvera Astivia, Kroc, and Zumbo’s (2020) study examines the assumptions underlying Cronbach’s coefficient alpha and how the distribution of items affects reliability estimation. By introducing a new framework rooted in Fréchet-Hoeffding bounds, the authors offer a fresh perspective on the limitations of this widely used reliability measure. Their work provides both theoretical insights and practical tools for researchers.

Read more →
What is evaluating short-form iq estimations for the wisc-v?

Short-form (SF) IQ estimations are often used in clinical settings to provide efficient assessments of intelligence without administering the full test. Lace et al. (2022) examined the effectiveness of various five- and four-subtest combinations for estimating full-scale IQ (FSIQ) on the Wechsler Intelligence Scale for Children-Fifth Edition (WISC-V). Their findings offer valuable guidance for clinicians selecting abbreviated assessment methods.

Read more →
Why is background important?

Norm scores are crucial in psychological and educational testing, providing a basis for comparing individual performance to standardized benchmarks. Traditional methods rely on norm tables derived from ranked data, which can introduce inconsistencies, particularly in small samples or with varying data distributions. Lenhard and Lenhard propose SPCN as a solution to these limitations, emphasizing its adaptability and statistical robustness.

How does key insights work in practice?

Performance Across Sample Sizes: Both SPCN and conventional methods improved with larger sample sizes, but SPCN achieved better results with smaller samples. Data Fit and Accuracy: Conventional methods struggled with data fit, especially in addressing age-related errors and handling missing values. SPCN demonstrated superior accuracy and adaptability in these scenarios. Statistical Modeling

📋 Cite This Article

Jouve, X. (2021, April 14). Continuous Norming for Cognitive Tests. PsychoLogic. https://www.psychologic.online/continuous-norming-methods/

Leave a Reply