What is significance?

The study provides valuable guidance for researchers choosing reliability measures for different types of data. It challenges the assumption that data must always be continuous and normally distributed for coefficient alpha to perform well, suggesting that these requirements may not be necessary under mild non-normality. For severely non-normal data, the authors recommend using scales with four or more points to improve reliability estimates.

What are future directions?

Xiao and Hau highlight the need for continued evaluation of reliability measures under diverse conditions. They note that no single reliability index is universally applicable and suggest that future research should investigate the effects of other factors, such as scale length and factor loadings, on reliability estimation. These efforts could lead to improved methodologies and tools for psychometric analysis.

This study underscores the importance of selecting appropriate reliability measures based on the characteristics of the data. By evaluating the performance of coefficient alpha and its alternatives, Xiao and Hau contribute to a deeper understanding of how non-normality affects reliability estimation. Their findings offer practical recommendations for researchers seeking accurate and meaningful reliability indices across varied contexts.

Xiao, L., & Hau, K.-T. (2023). Performance of Coefficient Alpha and Its Alternatives: Effects of Different Types of Non-Normality. Educational and Psychological Measurement, 83(1), 5-27. https://doi.org/10.1177/00131644221088240

Evaluating Coefficient Alpha and Alternatives in Non-Normal Data

Published: February 5, 2023 · Last reviewed: May 4, 2026

📖2,043 words⏱9 min read📚5 references cited

Cronbach’s coefficient alpha is the most-reported reliability statistic in psychology and educational measurement. It is also one of the most-misunderstood. The classical formula assumes that test items measure a single construct with equal factor loadings (tau-equivalence), uncorrelated errors, and continuously distributed scores. Real psychological measurement rarely meets all three assumptions: most scales use Likert responses (discrete), have items with unequal contributions to the construct (congeneric), and produce score distributions that depart from normality. The natural question is how badly alpha breaks under these violations and which alternatives perform better. A 2023 simulation study by Xiao and Hau in Educational and Psychological Measurement provides a systematic answer, with implications for the routine reliability reporting that fills psychometric methods sections.

What coefficient alpha is and what it actually measures

Coefficient alpha is, formally, an estimate of the reliability of a sum score under specific assumptions. When tau-equivalence holds (all items measure the same construct with equal loadings) and errors are uncorrelated, alpha is an unbiased estimate of reliability. When tau-equivalence is violated — which is essentially always, in practice — alpha is a lower bound on reliability. The true reliability is typically somewhat higher than what alpha reports.

This lower-bound property has been the source of decades of misinterpretation. A scale with alpha = .80 has reliability of at least .80 under the standard assumptions; it does not have reliability of exactly .80. Sijtsma’s 2009 paper in Psychometrika argued that this and related misconceptions render alpha “of very limited usefulness” — a sharp claim that prompted the companion response from Revelle and Zinbarg in the same issue, defending the role of alpha alongside more general omega-family coefficients.

The alternatives developed since the 1990s aim to remedy specific limitations:

Coefficient omega. Computed from factor analysis loadings and uniquenesses rather than item covariances. Unbiased under congeneric models (different loadings per item) where alpha is biased.
Omega hierarchical (ω_h). Specifically estimates the proportion of variance attributable to a single general factor, useful when the scale has multiple correlated subfactors.
Omega total / Revelle’s omega total (ω_t / ω_RT). Captures variance from the general factor plus group factors, providing a more general reliability estimate.
Greatest Lower Bound (GLB). Mathematically the largest possible lower bound on reliability; in principle always at least as informative as alpha.
Coefficient H. A construct reliability index that does not require tau-equivalence and is appropriate for unidimensional models with unequal loadings.
Ordinal alpha. Computed from polychoric correlations rather than Pearson correlations, addressing the discrete-data assumption violation common with Likert items.

The Xiao and Hau study compared alpha against all of these alternatives across systematically varied conditions of non-normality, scale strength, and discreteness.

What the simulation found

The simulation generated data under multiple distribution shapes (continuous normal, mildly non-normal, severely non-normal, exponential, binomial-beta) and multiple scale-strength conditions (strong, moderate, and weak factor loadings), then computed each reliability index and compared it to the true population reliability. The findings are nuanced.

For continuous data:

With strong scales (high factor loadings), alpha and its alternatives all performed well, with acceptable bias even under substantial non-normality. The historical concern that alpha “requires” continuous and normally distributed data is overstated for high-loading scales.
With moderate-strength scales, bias became noticeable and increased with the severity of non-normality.
With weak scales, bias was substantial across most indices.

The pattern means that scale strength matters more than data distribution in determining how well reliability estimates behave. A weak scale on normally distributed continuous data may produce more biased reliability estimates than a strong scale on non-normal Likert data.

For Likert-type discrete data:

With four or more response categories, most reliability indices performed acceptably under non-normal conditions. The exception was omega hierarchical, which showed problematic behavior in some Likert conditions.
More response categories produced better estimates, especially under severe non-normality. Five- and seven-point scales outperformed four-point scales when the distribution was extreme.

This is one of the most practically useful findings in the paper: using at least four response categories on Likert-type items is more important than choosing among alpha and its alternatives, given otherwise reasonable scale construction.

For exponential and binomial-beta distributions:

Exponentially distributed data: Omega RT (Revelle’s omega total) and GLB were robust; other indices showed more bias.
Binomial-beta distributed data: Most indices showed substantial bias. The authors are explicit that no single index they tested handled this distribution shape well.

Real-world data check: The authors examined items from a large-scale international survey and found their items to be at most moderately non-normal. The implication is that the most extreme conditions in the simulation are less common in practice than methodological alarm might suggest, and the working researcher is usually operating in a regime where most reliability indices perform acceptably.

The omega controversy: which omega is right?

A practical complication in moving from alpha to omega is that “omega” refers to a family of related coefficients with different assumptions and interpretations. Flora’s 2020 tutorial in Advances in Methods and Practices in Psychological Science — directly titled “Your Coefficient Alpha Is Probably Wrong, but Which Coefficient Omega Is Right?” — addresses this directly. The key distinctions:

ω (or ω_h) for unidimensional scales with congeneric loadings, computed from CFA, gives the proportion of variance attributable to the general factor.
ω_t (omega total) for multidimensional scales, capturing variance from a general factor plus group factors.
ω_h (omega hierarchical) specifically isolates the general-factor contribution from group-factor contributions.

Flora’s recommendation is that researchers fit a confirmatory factor analysis appropriate to the scale’s hypothesized structure and then compute the omega coefficient that matches the construct interpretation they intend. This is more work than reporting alpha but produces a defensible reliability estimate that respects the scale’s actual structure.

McNeish’s 2018 Psychological Methods paper, sharply titled “Thanks coefficient alpha, we’ll take it from here,” argues for routine replacement of alpha with omega in psychological research. The argument is that alpha’s assumptions are violated almost universally in practice; that alternatives are now well-developed and computationally accessible; and that retaining alpha as the default reliability statistic perpetuates a measurement practice the field has known to be inadequate for decades.

The counter-argument, most clearly articulated in Revelle and Zinbarg’s 2009 piece, is more measured: alpha remains a useful lower-bound estimate that requires no factor-analytic specification, and reporting alpha alongside omega-family coefficients is more informative than replacing one with the other.

Practical recommendations

Synthesizing across the Xiao-Hau simulation, the omega-family literature, and the general psychometric reliability literature:

Build strong scales first. The Xiao-Hau result that scale strength matters more than data distribution recommends investing in item development (multiple high-loading items per construct) over choosing among reliability indices.
Use at least four Likert response categories. Four or more points produced acceptable performance for most indices under most non-normality conditions; fewer points multiply the impact of distributional problems.
For unidimensional scales, report omega alongside alpha. Both are easy to compute in modern statistical software (R packages psych, semTools, and lavaan; Stata and Mplus equivalents), and reporting both signals competence and gives readers the information they need.
For multidimensional scales, choose omega total or omega hierarchical based on what the construct is. If the scale measures a hierarchical construct with a meaningful general factor, omega hierarchical isolates that factor’s contribution. If it measures a multifactor construct without a meaningful single factor, omega total is more interpretable.
For severely non-normal continuous distributions, consider omega RT or the GLB. The simulation evidence supports these for exponential-shaped distributions specifically.
Use ordinal alpha (or ordinal omega) for Likert data when the distributions are skewed. Polychoric-correlation-based estimates handle the discrete, non-normal nature of Likert data more cleanly than product-moment-correlation-based estimates.
Don’t over-interpret reliability differences in the second decimal place. Reliability point estimates have meaningful sampling variability; differences of .02–.03 between indices are usually within that variability and rarely indicate a substantive difference.

Limits of what reliability estimates tell you

Several broader points are worth keeping in mind:

High reliability does not imply validity. A scale can have alpha of .95 and measure something other than what its name claims. Reliability is necessary but not sufficient for valid inference.
Reliability is sample-specific. A scale’s alpha or omega is the reliability in the sample being analyzed, not a property of the scale itself. The same scale administered to a different population will produce different estimates.
Item-level reliability is not the only source of measurement error. Test-retest stability, inter-rater reliability for clinical instruments, and parallel-forms reliability all provide additional information that internal-consistency indices do not capture.
The single number conceals score-level variation. A reliability of .85 represents an average across the whole score distribution. McNeish and Dumas (2025, in Behavior Research Methods) have shown that reliability often varies substantially across the score range, with some scales reliable in the middle but unreliable at the extremes.

What the literature has not settled

Several questions remain open:

Default reporting standards. Whether journals and editors should require omega in addition to or instead of alpha is a policy question on which the field has not converged.
Best practices for two-point items. Dichotomous items remain a special case. KR-20 (the dichotomous-item analog of alpha) has its own properties under non-normality, and the Xiao-Hau simulation focused on Likert-type rather than dichotomous data.
Multilevel reliability. When data are nested (students within classrooms within schools), the appropriate reliability metric depends on the level at which inferences are made. Standard alpha and omega do not directly handle multilevel structure.
Reliability of difference scores and change scores. Internal-consistency reliability of a single administration tells you little about how reliably differences between individuals or changes within an individual can be detected.

Frequently Asked Questions

Is coefficient alpha actually wrong?

Not exactly. Alpha is mathematically defined and well-understood. It is widely misinterpreted: it is a lower-bound estimate of reliability under specific assumptions (tau-equivalence, uncorrelated errors), and those assumptions are routinely violated in practice. Reporting alpha is not wrong; treating it as the precise true reliability is.

Do I always need omega?

For most psychological research, reporting omega alongside alpha is a substantial improvement. For high-loading unidimensional scales with normally distributed scores, the practical difference may be small, but the reporting cost is also small.

What’s the difference between omega and omega hierarchical?

Omega (or omega total) is reliability under a factor model that may include multiple correlated factors. Omega hierarchical specifically isolates the proportion of variance attributable to a single general factor, useful when the scale has hierarchical structure with a meaningful overall construct above subfactor groupings.

What does “non-normality” actually do to reliability estimates?

For typical psychological data with reasonable scale strength and four or more response categories, mild-to-moderate non-normality produces small biases that are unlikely to change inferential conclusions. Severe non-normality (heavy skew or kurtosis, exponential-shaped distributions) produces larger biases, particularly for weak scales.

Should I use ordinal alpha for Likert items?

For Likert items with substantial skew or floor/ceiling effects, ordinal alpha based on polychoric correlations is generally a better-behaved estimate than standard alpha based on Pearson correlations. For items with reasonably symmetric distributions, the two are usually close.

What’s the GLB and when should I use it?

The Greatest Lower Bound (GLB) is the largest possible lower bound on reliability under standard assumptions. It is robust under exponential-shaped non-normal distributions in the Xiao-Hau simulation. Some authors warn that GLB can overestimate reliability in small samples.

If I report alpha and omega and they disagree, which do I trust?

The disagreement itself is informative. If alpha is substantially lower than omega, the scale likely has unequal item loadings (congeneric structure). Omega is generally the more appropriate estimate in that case, but reporting both and explaining the difference is more transparent than choosing one and hiding the other.

References

Xiao, L., & Hau, K.-T. (2023). Performance of Coefficient Alpha and Its Alternatives: Effects of Different Types of Non-Normality. Educational and Psychological Measurement, 83(1), 5–27. https://doi.org/10.1177/00131644221088240
Flora, D. B. (2020). Your Coefficient Alpha Is Probably Wrong, but Which Coefficient Omega Is Right? A Tutorial on Using R to Obtain Better Reliability Estimates. Advances in Methods and Practices in Psychological Science, 3(4), 484–501. https://doi.org/10.1177/2515245920951747
McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412–433. https://doi.org/10.1037/met0000144
Sijtsma, K. (2009). On the Use, the Misuse, and the Very Limited Usefulness of Cronbach’s Alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
Revelle, W., & Zinbarg, R. E. (2009). Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma. Psychometrika, 74(1), 145–154. https://doi.org/10.1007/s11336-008-9102-z

Xavier Jouve, Ph.D.PsychometricianPhD

Xavier Jouve, Ph.D., is a psychometrician and quantitative psychologist specializing in cognitive ability measurement, item response theory, and test development. He is Head of Research at Cogn-IQ, where he has designed and validated seven cognitive assessment instruments — including the JCTI (inductive reasoning), JCCES (crystallized intelligence), IAW (vocabulary), JCFS (figurative sequences), JCWS (verbal reasoning), GIE (general knowledge), and WN (logical inference) — collectively normed on over 13,000 examinees. His work applies 2PL IRT modeling, computerized adaptive testing, and advanced composite scoring methods (including the modified Tellegen & Briggs Formula 4 with cubic correction) to produce research-grade cognitive measures available online. ORCID: 0009-0006-1283-045X

ORCID

Related Research

Statistical Methods and Data Analysis

Interpreting Differential Item Functioning with Response Process Data

A test item that scores differently for two groups of equally able examinees is called a differential item functioning (DIF) item, and identifying these items…

Dec 16, 2024

Cognitive Neuroscience and Brain Function