What are common misinterpretations to avoid?

Having worked in psychometric assessment, these are the errors I see most frequently: 1. Treating IQ as permanent and fixed. IQ scores show good stability over time (test-retest correlations of .85–.95 over months, .70–.85 over years), but they can and do change, especially in children, after significant life events, or with major environmental changes. The Flynn Effect demonstrates that even population-level IQ changes over generations.

When should you seek retesting?

Retesting may be warranted when: Standard practice recommends waiting at least 12 months between administrations of the same test to minimize practice effects. If earlier retesting is needed, a different test (e.g., switching from WAIS-V to Stanford-Binet) can reduce practice effects.

How to Interpret IQ Test Results

Q: What is the Full Scale IQ (FSIQ) score?

The Full Scale IQ is the headline number — the single score that summarizes overall cognitive ability. On modern tests like the WAIS-V and WISC-V, it's set to a mean of 100 and a standard deviation of 15. This means: The FSIQ is useful as a general summary, but treating it as the only important number is one of the most common mistakes in IQ interpretation. It's an average of several distinct abilities — and averages can be misleading.

Q: Why are composite (index) scores more informative than FSIQ?

Modern intelligence tests don't measure a single ability. The WAIS-V, for example, breaks cognitive ability into five primary index scores: A person with an FSIQ of 105 might have a VCI of 120 and a PSI of 85 — very different from someone with a flat profile of 105 across all indices. The first person has a genuine verbal strength and a processing speed weakness; the second has uniformly average abilities. Their FSIQ number is identical, but their cognitive profiles — and the practical implications — are quite different.

Q: What are confidence intervals and why do they matter?

Every IQ score is an estimate, not an exact measurement. This is one of the most misunderstood aspects of psychological testing. If you took the same test twice (with no practice effects), you wouldn't get exactly the same score — you'd get a range of scores centered around your "true" ability level. The confidence interval quantifies this uncertainty. A 95% confidence interval for an FSIQ of 112 might be 107–117, meaning we're 95% confident the person's true FSIQ falls somewhere in that range.

Q: How should you interpret score differences between indices?

When one index score is notably higher or lower than others, this is called a discrepancy or scatter. But not all differences are meaningful. You need to consider both: Statistical significance: Is the difference large enough that it's unlikely to have occurred by chance? Typically, a difference of 10–15 points between index scores reaches statistical significance (p < .05). The test manual provides exact critical values.

Q: What do individual subtest scores tell you?

Below the index level, individual subtest scores (scaled scores with a mean of 10 and SD of 3) provide the finest grain of analysis. On the WAIS-V, for example: However, individual subtest scores have lower reliability than composite scores (typically .80–.90 vs. .95+ for composites). This means they carry more measurement error and should be interpreted cautiously. A single low subtest score could reflect genuine weakness, testing conditions (fatigue, distraction), or simply measurement noise.

Q: How do you determine if a score is a strength or weakness?

There are two frames of reference: Normative comparison (compared to the population): Is the score above or below the population average of 100 (for indices) or 10 (for subtests)? An index score of 115 is a normative strength; 85 is a normative weakness. Ipsative comparison (compared to the person's own average): Is the score above or below this individual's own mean? A person with an FSIQ of 125 who scores 105 on Processing Speed has an ipsative weakness — their processing speed is significantly below their own overall level, even though it's average compared to the population.

Published: March 15, 2026 · Last reviewed: May 6, 2026

📖2,625 words⏱11 min read📚8 references cited

You’ve received an IQ test report — for yourself, your child, or a client — and what should be a clean answer is a thicket of numbers, percentiles, confidence intervals, index scores, scaled scores, and qualitative descriptors. This guide walks through what each piece actually means and how a psychometrician reads them. The short version: a single Full-Scale IQ number is rarely the most useful piece of information in the report, score discrepancies need both statistical and base-rate scrutiny before they mean anything clinically, and almost every “IQ point” carries a margin of error larger than most readers assume.

What does Full-Scale IQ actually represent?

Full-Scale IQ (FSIQ) is the headline score on the major Wechsler instruments — the Wechsler Adult Intelligence Scale in its fourth or fifth edition, the WISC-V for children, and most other clinical batteries. It is set to a population mean of 100 and a standard deviation of 15, which produces the familiar bell curve described in our overview of the normal distribution of IQ scores.

FSIQ Range	Wechsler Classification	Percentile Range	Approximate Frequency
130+	Extremely High	98th and above	~2.2%
120–129	Very High	91st–97th	~6.7%
110–119	High Average	75th–90th	~16.1%
90–109	Average	25th–74th	~50%
80–89	Low Average	9th–24th	~16.1%
70–79	Very Low	2nd–8th	~6.7%
Below 70	Extremely Low	Below 2nd	~2.2%

Two things commonly misread: the labels are descriptive bands, not biological categories, and the cut points have changed across test editions and authors. The shift from “Very Superior” to “Extremely High” between WAIS-IV and WAIS-V is cosmetic, not substantive — the underlying score remains the same. Our analysis of WAIS-IV vs. WAIS-V changes documents the practical implications for score interpretation.

FSIQ summarizes overall performance, but it averages across abilities that may be substantially different. As Deary’s (2012) Annual Review of Psychology synthesis emphasizes, the general factor (g) is real and large, but it does not exhaust what cognitive batteries measure. The remaining variance is structured into broad ability domains that the indices below capture — and that is where most of the diagnostic information lives.

Why are index scores more informative than FSIQ?

Modern Wechsler batteries decompose FSIQ into five primary indices, mapping onto the broad abilities of Cattell-Horn-Carroll (CHC) theory described by McGrew (2009):

Verbal Comprehension Index (VCI): vocabulary, general knowledge, verbal abstraction — broadly, crystallized intelligence (Gc)
Fluid Reasoning Index (FRI): novel problem-solving, inductive and quantitative reasoning — fluid intelligence (Gf)
Visual Spatial Index (VSI): visualization, spatial relations, mental rotation (Gv)
Working Memory Index (WMI): holding and manipulating information in mind (Gwm)
Processing Speed Index (PSI): rapid visual scanning and decision-making under time pressure (Gs)

Two people with identical FSIQs of 105 can have radically different index profiles — one might score VCI 125 / PSI 85 (large strength in verbal abstraction, marked processing-speed weakness), the other 105 across the board. The single FSIQ number masks every clinically and educationally relevant pattern. The distinction between fluid and crystallized intelligence is particularly important to read separately, because they develop and decline on different schedules across the lifespan (Salthouse, 2010).

Canivez and Watkins (2010), in their factor analysis of WAIS-IV standardization data, demonstrated that the general factor accounts for the majority of common variance among Wechsler subtests, but that the four broad indices add measurable incremental information for individual diagnostic interpretation. Index scores are the right level for most reporting decisions; FSIQ is the right level for actuarial prediction.

What does a confidence interval mean on an IQ score?

Every IQ score is an estimate. If the same person took the same test under identical conditions, scores would not be identical — they would scatter around a “true” ability level due to measurement error. The confidence interval (CI) on a score report quantifies that scatter.

The Standard Error of Measurement (SEM) for the WAIS-V FSIQ is approximately 2.6 points, which generates a 95% CI of roughly ±5 points. So an FSIQ of 128 with a 95% CI of 123–133 means: we can be 95% confident the person’s true ability lies in that band. The single number 128 looks decisive, but it might not actually clear the gifted threshold of 130. This matters whenever scores are used to make threshold decisions about giftedness, intellectual disability, or eligibility cutoffs.

Three SEM facts that frequently get lost:

Index scores are less reliable than FSIQ. Indices are based on fewer subtests, so their SEMs are larger — typically 3.5–5 points. A VCI of 115 with a 95% CI of 108–122 carries genuine ambiguity.
SEMs grow at the score extremes. Wechsler standardization data show measurement error is larger for scores well above or below the mean, because there are fewer items at the relevant difficulty range.
Specialized tests report their own SEMs. The JCTI technical manual, for instance, documents an empirical reliability of approximately ρ = 0.87 for its computer-adaptive form (N = 1,003) and α = 0.95 for the fixed-length form (N = 1,020), with norms based on N = 8,297. Comparing the SEMs of the test you took to the score difference you’re trying to interpret is the only way to know whether the difference is signal or noise.

What counts as a meaningful score discrepancy?

When two index scores differ, the difference may or may not be clinically meaningful. Two separate questions need to be asked:

Is the difference statistically reliable? A 10–15 point difference between Wechsler indices is typically large enough to exceed measurement error at p < 0.05. Smaller differences are within the noise band and should not be interpreted as clinical patterns. The test manual provides the exact critical values for each pair of indices.

Is the difference unusual in the population? Statistical reliability is necessary but not sufficient. A 15-point VCI–PSI gap occurs in roughly 25% of normal adults — it is statistically reliable but not clinically rare. A 25-point gap occurs in only about 5–10% of the population and warrants closer attention.

Both the statistical-reliability question and the base-rate question are answered by tables in the test manual; a competent score report cites both. Watkins (2003), in Scientific Review of Mental Health Practice, argued that clinicians routinely confuse the two and over-interpret normal cognitive variation as pathological. His meta-analytic review concluded that subtest-scatter analysis adds only 2–8% incremental variance to predictions of achievement and learning behavior beyond general ability — far less than its prominence in clinical practice would suggest. Profiles with some scatter are the rule, not the exception; perfectly flat profiles are themselves rare.

How should subtest scores be interpreted?

Below the index level, individual subtests provide the finest grain of analysis. Wechsler subtests are scaled to a mean of 10 and SD of 3, so a Vocabulary score of 13 is one SD above the mean. Common subtests and what they tap:

Vocabulary: word knowledge — strongly affected by reading and education
Similarities: verbal abstraction — identifying conceptual relations
Block Design: visual-spatial construction
Matrix Reasoning: nonverbal fluid reasoning by visual induction
Digit Span / Letter–Number Sequencing: auditory working memory
Coding / Symbol Search: processing speed under time pressure

The trap is that subtest reliabilities are notably lower than composite reliabilities — typically α = 0.80–0.90 versus 0.95+ for FSIQ — which means individual subtest scores carry substantially more measurement error. A subtest score that deviates 2–3 points from the rest of the profile may be genuine, may reflect a momentary lapse (fatigue, distraction, an unfamiliar word), or may simply be measurement noise. Aggregating multiple indicators is what makes composite scores stable; that is also why short-form IQ estimates such as those studied for the WISC-V sacrifice precision when they drop subtests.

The honest position: a single low or high subtest is a hypothesis to investigate, not a finding to act on. Patterns require convergent evidence — the same weakness showing up across multiple subtests, behavioral observations, academic records, or other instruments.

Normative versus ipsative comparison

Two reference frames matter, and confusing them produces most of the interpretive errors lay readers make:

Normative comparison asks: where does this score sit relative to the general population? An FSIQ of 115 is one SD above the population mean — a normative strength regardless of who took the test.

Ipsative comparison asks: where does this score sit relative to the person’s own profile? A person with an FSIQ of 130 who scored 105 on Processing Speed has an ipsative weakness — their PSI is well below their own overall level — even though 105 is normatively average.

Both perspectives are useful for different questions. Normative comparison answers questions about absolute capacity (“Can this child handle grade-level work?”). Ipsative comparison answers questions about relative profile (“Why is this gifted student frustrated by timed tests despite excelling on untimed material?”). Reports that conflate the two — calling an ipsative weakness an absolute weakness, or dismissing an ipsative strength because it is not normatively exceptional — produce misleading clinical pictures.

How stable are IQ scores?

Stability depends on age and time interval. Adult test-retest correlations across a few years are typically r = 0.85–0.95; across decades they remain in the 0.60–0.75 range, making IQ one of the most stable psychological measurements known. Childhood scores are less stable — a young child tested at age 5 and again at 10 may show 5–15 point shifts that reflect genuine developmental change as much as measurement error.

Generational change matters too: the Flynn effect shifted scores upward by roughly three points per decade across most of the 20th century, meaning older norms inflate scores. A test normed in 1995 will score someone roughly 9 points higher than the same person tested on a 2024-normed instrument. Threshold decisions made against decades-old norms should be treated with skepticism.

Common misinterpretations to avoid

The errors that recur across school records, parent reports, and forensic evaluations:

Treating IQ as fixed and innate. Adult IQ is highly stable but not unchangeable — education adds roughly one to five points per year of additional schooling, and major environmental changes shift scores measurably. Population means have moved by more than a standard deviation across the 20th century.
Over-interpreting tiny differences. Three-point and seven-point index differences are usually within measurement error. Parents agonize over a child scoring 108 on one index and 103 on another — those scores are statistically indistinguishable.
Ignoring the testing context. Anxiety, fatigue, illness, medication, and noisy testing environments all shift scores, especially on timed indices (PSI, WMI). A score obtained while a child was fighting the flu is not the same datum as one obtained under optimal conditions.
Equating IQ with worth, potential, or destiny. IQ tests measure a specific cluster of cognitive abilities that predict academic and job-training outcomes robustly (Schmidt & Hunter, 1998; Strenze, 2007). They do not measure creativity, wisdom, social skill, motivation, or character — and the bulk of important life outcomes depend on combinations of these.
Comparing scores across different tests as if they were the same. A WISC-V FSIQ of 112, a Stanford-Binet 5 FSIQ of 112, and a 112 from a web-based “IQ test” use different norms, different subtests, and different populations. Our analysis of online IQ tests versus professional assessments documents how much the choice of instrument shifts the resulting number.

When does retesting make sense?

Testing conditions were clearly suboptimal (acute illness, severe anxiety, environmental disruption)
The score is grossly inconsistent with observed academic, occupational, or behavioral functioning
A significant intervening event has occurred (head injury, treatment of a mood or attention disorder, major educational intervention)
The previous test is more than 2–3 years old for a child whose abilities are still developing, or normed on a substantially older population (Flynn-effect drift)
A high-stakes decision (gifted placement, intellectual-disability classification, forensic evaluation) requires the highest-confidence estimate available

Best practice waits at least 12 months between administrations of the same test to minimize practice effects. When earlier retesting is needed, switching to a parallel instrument (for example, WAIS-V to Stanford-Binet 5) reduces the practice contamination.

Reading a report in order: a step-by-step approach

Note the FSIQ and its 95% confidence interval first. The interval tells you how seriously to take the point estimate.
Examine the index scores. Look for differences of 15+ points between the highest and lowest index. Large scatter means FSIQ may not adequately summarize the profile.
Apply both the statistical-reliability and base-rate tests to any discrepancy you flag — the manual provides both.
Drop to the subtest level only when index-level patterns warrant it. Do not start with subtest scores; their reliability does not support stand-alone interpretation.
Read the qualitative observations. A good examiner notes engagement, persistence, anxiety, language fluency, and other context that quantitative scores do not capture.
Integrate with non-test data. Academic records, behavioral observations, medical history, and concurrent assessments provide the validation a single test cannot.

A well-written psychological report has done all of this synthesis already. Understanding the principles lets you ask sharper questions of an opaque report and notice when oversimplification is doing the heavy lifting.

Frequently Asked Questions

Is FSIQ the most important number on the report?

Only if the index profile is relatively flat. When index scores diverge by 15+ points, FSIQ averages over real heterogeneity and the index scores carry the diagnostic information. For threshold decisions (giftedness, intellectual disability), the relevant index — typically FSIQ for global classification, but sometimes a specific index — is what should be checked against the cutoff alongside its confidence interval.

What does a 95% confidence interval on IQ mean?

If the test were administered repeatedly under identical conditions, 95% of the resulting scores would fall within the reported interval. A point estimate of 128 with a CI of 123–133 means the true score is most likely in that band — and might or might not exceed a 130 cutoff.

How much can my IQ change?

Adult test-retest correlations of 0.85–0.95 mean adult IQ is highly stable, but it is not immutable. Education adds roughly 1–5 points per year of schooling; major environmental changes (substantial nutrition improvement, treatment of a medical condition affecting cognition) can shift scores meaningfully. Childhood scores are appreciably less stable than adult scores. Population means have moved more than a standard deviation across the 20th century via the Flynn effect.

Why do I score differently on different IQ tests?

Different tests use different norms, different item pools, different administration formats, and different populations. A 5–10 point discrepancy across instruments is normal even when both tests are well-validated. Larger gaps usually point to differences in what the tests emphasize (verbal-heavy vs. nonverbal-heavy, timed vs. untimed) or to substandard norming on one of the instruments.

What is the difference between IQ classifications across editions?

The bands themselves shift cosmetically. WAIS-IV used “Very Superior” for 130+; WAIS-V uses “Extremely High.” The score boundaries are unchanged. Comparing classifications across editions can mislead a non-specialist; comparing the underlying scores is the right move.

Are subtest scores worth interpreting?

With caution. Subtest reliabilities are lower than composite reliabilities, so individual subtest scores carry more error. Watkins (2003) and the broader scatter-analysis literature show that subtest-profile interpretation adds only modest incremental information beyond general ability. A single low subtest is a hypothesis to investigate, not a finding to act on. Patterns require convergent evidence.

How do percentile ranks relate to IQ scores?

Percentile ranks are non-linear: a 10-point IQ difference near the mean (100 → 110) corresponds to a much larger percentile change than a 10-point difference at the extremes (130 → 140). This is why IQ ranges and the percentiles documented in our overview of high IQ ranges are read off the bell curve, not from arithmetic.

References

Canivez, G. L., & Watkins, M. W. (2010). Investigation of the factor structure of the Wechsler Adult Intelligence Scale—Fourth Edition (WAIS–IV): Exploratory and higher order factor analyses. Psychological Assessment, 22(4), 827–836. https://doi.org/10.1037/a0020429
Deary, I. J. (2012). Intelligence. Annual Review of Psychology, 63, 453–482. https://doi.org/10.1146/annurev-psych-120710-100353
Jouve, X. (2025). JCTI Technical Manual (Version 2025.3). Cogn-IQ. https://www.cogn-iq.org/methods/jcti-manual/
McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research. Intelligence, 37(1), 1–10. https://doi.org/10.1016/j.intell.2008.08.004
Salthouse, T. A. (2010). Selective review of cognitive aging. Journal of the International Neuropsychological Society, 16(5), 754–760. https://doi.org/10.1017/S1355617710000706
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274. https://doi.org/10.1037/0033-2909.124.2.262
Strenze, T. (2007). Intelligence and socioeconomic success: A meta-analytic review of longitudinal research. Intelligence, 35(5), 401–426. https://doi.org/10.1016/j.intell.2006.09.004
Watkins, M. W. (2003). IQ subtest analysis: Clinical acumen or clinical illusion? Scientific Review of Mental Health Practice, 2(2), 118–141.

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Cognitive Neuroscience and Brain Function

Traumatic Brain Injury and Cognition

Every year roughly 69 million people worldwide sustain a traumatic brain injury (TBI), and the question survivors and families ask first is rarely about scans…

Apr 3, 2026

IQ Scores and Ranges

What Is Mensa? Membership and Testing

Mensa. The name conjures images of genius-level intellects gathering to solve the world's hardest puzzles. In reality, the world's largest and oldest high-IQ society is…

Mar 25, 2026

Psychometric Testing and IQ Assessment

IQ Test Anxiety: How Stress Affects Your Score

You sit down for an IQ assessment. Your palms are sweating, your mind races, and the moment you see the first timed task, your thoughts…

Mar 22, 2026