Cronbach’s coefficient alpha is the most-reported reliability statistic in psychology and educational measurement. It is also one of the most-misunderstood. The classical formula assumes that test items measure a single construct with equal factor loadings (tau-equivalence), uncorrelated errors, and continuously distributed scores. Real psychological measurement rarely meets all three assumptions: most scales use Likert responses (discrete), have items with unequal contributions to the construct (congeneric), and produce score distributions that depart from normality. The natural question is how badly alpha breaks under these violations and which alternatives perform better. A 2023 simulation study by Xiao and Hau in Educational and Psychological Measurement provides a systematic answer, with implications for the routine reliability reporting that fills psychometric methods sections.
What coefficient alpha is and what it actually measures
Coefficient alpha is, formally, an estimate of the reliability of a sum score under specific assumptions. When tau-equivalence holds (all items measure the same construct with equal loadings) and errors are uncorrelated, alpha is an unbiased estimate of reliability. When tau-equivalence is violated — which is essentially always, in practice — alpha is a lower bound on reliability. The true reliability is typically somewhat higher than what alpha reports.
This lower-bound property has been the source of decades of misinterpretation. A scale with alpha = .80 has reliability of at least .80 under the standard assumptions; it does not have reliability of exactly .80. Sijtsma’s 2009 paper in Psychometrika argued that this and related misconceptions render alpha “of very limited usefulness” — a sharp claim that prompted the companion response from Revelle and Zinbarg in the same issue, defending the role of alpha alongside more general omega-family coefficients.
The alternatives developed since the 1990s aim to remedy specific limitations:
- Coefficient omega. Computed from factor analysis loadings and uniquenesses rather than item covariances. Unbiased under congeneric models (different loadings per item) where alpha is biased.
- Omega hierarchical (ω_h). Specifically estimates the proportion of variance attributable to a single general factor, useful when the scale has multiple correlated subfactors.
- Omega total / Revelle’s omega total (ω_t / ω_RT). Captures variance from the general factor plus group factors, providing a more general reliability estimate.
- Greatest Lower Bound (GLB). Mathematically the largest possible lower bound on reliability; in principle always at least as informative as alpha.
- Coefficient H. A construct reliability index that does not require tau-equivalence and is appropriate for unidimensional models with unequal loadings.
- Ordinal alpha. Computed from polychoric correlations rather than Pearson correlations, addressing the discrete-data assumption violation common with Likert items.
The Xiao and Hau study compared alpha against all of these alternatives across systematically varied conditions of non-normality, scale strength, and discreteness.
What the simulation found
The simulation generated data under multiple distribution shapes (continuous normal, mildly non-normal, severely non-normal, exponential, binomial-beta) and multiple scale-strength conditions (strong, moderate, and weak factor loadings), then computed each reliability index and compared it to the true population reliability. The findings are nuanced.
For continuous data:
- With strong scales (high factor loadings), alpha and its alternatives all performed well, with acceptable bias even under substantial non-normality. The historical concern that alpha “requires” continuous and normally distributed data is overstated for high-loading scales.
- With moderate-strength scales, bias became noticeable and increased with the severity of non-normality.
- With weak scales, bias was substantial across most indices.
The pattern means that scale strength matters more than data distribution in determining how well reliability estimates behave. A weak scale on normally distributed continuous data may produce more biased reliability estimates than a strong scale on non-normal Likert data.
For Likert-type discrete data:
- With four or more response categories, most reliability indices performed acceptably under non-normal conditions. The exception was omega hierarchical, which showed problematic behavior in some Likert conditions.
- More response categories produced better estimates, especially under severe non-normality. Five- and seven-point scales outperformed four-point scales when the distribution was extreme.
This is one of the most practically useful findings in the paper: using at least four response categories on Likert-type items is more important than choosing among alpha and its alternatives, given otherwise reasonable scale construction.
For exponential and binomial-beta distributions:
- Exponentially distributed data: Omega RT (Revelle’s omega total) and GLB were robust; other indices showed more bias.
- Binomial-beta distributed data: Most indices showed substantial bias. The authors are explicit that no single index they tested handled this distribution shape well.
Real-world data check: The authors examined items from a large-scale international survey and found their items to be at most moderately non-normal. The implication is that the most extreme conditions in the simulation are less common in practice than methodological alarm might suggest, and the working researcher is usually operating in a regime where most reliability indices perform acceptably.
The omega controversy: which omega is right?
A practical complication in moving from alpha to omega is that “omega” refers to a family of related coefficients with different assumptions and interpretations. Flora’s 2020 tutorial in Advances in Methods and Practices in Psychological Science — directly titled “Your Coefficient Alpha Is Probably Wrong, but Which Coefficient Omega Is Right?” — addresses this directly. The key distinctions:
- ω (or ωh) for unidimensional scales with congeneric loadings, computed from CFA, gives the proportion of variance attributable to the general factor.
- ωt (omega total) for multidimensional scales, capturing variance from a general factor plus group factors.
- ωh (omega hierarchical) specifically isolates the general-factor contribution from group-factor contributions.
Flora’s recommendation is that researchers fit a confirmatory factor analysis appropriate to the scale’s hypothesized structure and then compute the omega coefficient that matches the construct interpretation they intend. This is more work than reporting alpha but produces a defensible reliability estimate that respects the scale’s actual structure.
McNeish’s 2018 Psychological Methods paper, sharply titled “Thanks coefficient alpha, we’ll take it from here,” argues for routine replacement of alpha with omega in psychological research. The argument is that alpha’s assumptions are violated almost universally in practice; that alternatives are now well-developed and computationally accessible; and that retaining alpha as the default reliability statistic perpetuates a measurement practice the field has known to be inadequate for decades.
The counter-argument, most clearly articulated in Revelle and Zinbarg’s 2009 piece, is more measured: alpha remains a useful lower-bound estimate that requires no factor-analytic specification, and reporting alpha alongside omega-family coefficients is more informative than replacing one with the other.
Practical recommendations
Synthesizing across the Xiao-Hau simulation, the omega-family literature, and the general psychometric reliability literature:
- Build strong scales first. The Xiao-Hau result that scale strength matters more than data distribution recommends investing in item development (multiple high-loading items per construct) over choosing among reliability indices.
- Use at least four Likert response categories. Four or more points produced acceptable performance for most indices under most non-normality conditions; fewer points multiply the impact of distributional problems.
- For unidimensional scales, report omega alongside alpha. Both are easy to compute in modern statistical software (R packages psych, semTools, and lavaan; Stata and Mplus equivalents), and reporting both signals competence and gives readers the information they need.
- For multidimensional scales, choose omega total or omega hierarchical based on what the construct is. If the scale measures a hierarchical construct with a meaningful general factor, omega hierarchical isolates that factor’s contribution. If it measures a multifactor construct without a meaningful single factor, omega total is more interpretable.
- For severely non-normal continuous distributions, consider omega RT or the GLB. The simulation evidence supports these for exponential-shaped distributions specifically.
- Use ordinal alpha (or ordinal omega) for Likert data when the distributions are skewed. Polychoric-correlation-based estimates handle the discrete, non-normal nature of Likert data more cleanly than product-moment-correlation-based estimates.
- Don’t over-interpret reliability differences in the second decimal place. Reliability point estimates have meaningful sampling variability; differences of .02–.03 between indices are usually within that variability and rarely indicate a substantive difference.
Limits of what reliability estimates tell you
Several broader points are worth keeping in mind:
- High reliability does not imply validity. A scale can have alpha of .95 and measure something other than what its name claims. Reliability is necessary but not sufficient for valid inference.
- Reliability is sample-specific. A scale’s alpha or omega is the reliability in the sample being analyzed, not a property of the scale itself. The same scale administered to a different population will produce different estimates.
- Item-level reliability is not the only source of measurement error. Test-retest stability, inter-rater reliability for clinical instruments, and parallel-forms reliability all provide additional information that internal-consistency indices do not capture.
- The single number conceals score-level variation. A reliability of .85 represents an average across the whole score distribution. McNeish and Dumas (2025, in Behavior Research Methods) have shown that reliability often varies substantially across the score range, with some scales reliable in the middle but unreliable at the extremes.
What the literature has not settled
Several questions remain open:
- Default reporting standards. Whether journals and editors should require omega in addition to or instead of alpha is a policy question on which the field has not converged.
- Best practices for two-point items. Dichotomous items remain a special case. KR-20 (the dichotomous-item analog of alpha) has its own properties under non-normality, and the Xiao-Hau simulation focused on Likert-type rather than dichotomous data.
- Multilevel reliability. When data are nested (students within classrooms within schools), the appropriate reliability metric depends on the level at which inferences are made. Standard alpha and omega do not directly handle multilevel structure.
- Reliability of difference scores and change scores. Internal-consistency reliability of a single administration tells you little about how reliably differences between individuals or changes within an individual can be detected.
Frequently Asked Questions
Is coefficient alpha actually wrong?
Not exactly. Alpha is mathematically defined and well-understood. It is widely misinterpreted: it is a lower-bound estimate of reliability under specific assumptions (tau-equivalence, uncorrelated errors), and those assumptions are routinely violated in practice. Reporting alpha is not wrong; treating it as the precise true reliability is.
Do I always need omega?
For most psychological research, reporting omega alongside alpha is a substantial improvement. For high-loading unidimensional scales with normally distributed scores, the practical difference may be small, but the reporting cost is also small.
What’s the difference between omega and omega hierarchical?
Omega (or omega total) is reliability under a factor model that may include multiple correlated factors. Omega hierarchical specifically isolates the proportion of variance attributable to a single general factor, useful when the scale has hierarchical structure with a meaningful overall construct above subfactor groupings.
What does “non-normality” actually do to reliability estimates?
For typical psychological data with reasonable scale strength and four or more response categories, mild-to-moderate non-normality produces small biases that are unlikely to change inferential conclusions. Severe non-normality (heavy skew or kurtosis, exponential-shaped distributions) produces larger biases, particularly for weak scales.
Should I use ordinal alpha for Likert items?
For Likert items with substantial skew or floor/ceiling effects, ordinal alpha based on polychoric correlations is generally a better-behaved estimate than standard alpha based on Pearson correlations. For items with reasonably symmetric distributions, the two are usually close.
What’s the GLB and when should I use it?
The Greatest Lower Bound (GLB) is the largest possible lower bound on reliability under standard assumptions. It is robust under exponential-shaped non-normal distributions in the Xiao-Hau simulation. Some authors warn that GLB can overestimate reliability in small samples.
If I report alpha and omega and they disagree, which do I trust?
The disagreement itself is informative. If alpha is substantially lower than omega, the scale likely has unequal item loadings (congeneric structure). Omega is generally the more appropriate estimate in that case, but reporting both and explaining the difference is more transparent than choosing one and hiding the other.
References
- Xiao, L., & Hau, K.-T. (2023). Performance of Coefficient Alpha and Its Alternatives: Effects of Different Types of Non-Normality. Educational and Psychological Measurement, 83(1), 5–27. https://doi.org/10.1177/00131644221088240
- Flora, D. B. (2020). Your Coefficient Alpha Is Probably Wrong, but Which Coefficient Omega Is Right? A Tutorial on Using R to Obtain Better Reliability Estimates. Advances in Methods and Practices in Psychological Science, 3(4), 484–501. https://doi.org/10.1177/2515245920951747
- McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412–433. https://doi.org/10.1037/met0000144
- Sijtsma, K. (2009). On the Use, the Misuse, and the Very Limited Usefulness of Cronbach’s Alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
- Revelle, W., & Zinbarg, R. E. (2009). Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma. Psychometrika, 74(1), 145–154. https://doi.org/10.1007/s11336-008-9102-z
Related Research
Interpreting Differential Item Functioning with Response Process Data
A test item that scores differently for two groups of equally able examinees is called a differential item functioning (DIF) item, and identifying these items…
Dec 16, 2024Exploring Cognitive and Brain Development Through GALAMMs
Sørensen, Fjell, and Walhovd’s 2023 research introduces Generalized Additive Latent and Mixed Models (GALAMMs), a methodological advancement designed for analyzing complex clustered data. This approach…
Jun 30, 2023Refining Reliability with Attenuation-Corrected Estimators
Jari Metsämuuronen’s (2022) article introduces a significant advancement in how reliability is estimated within psychological assessments. The study critiques traditional methods for their tendency to…
Nov 1, 2022Assessing Missing Data Handling Methods in Sparse Educational Datasets
The study by Xiao and Bulut (2020) evaluates how different methods for handling missing data perform when estimating ability parameters from sparse datasets. Using two…
Oct 10, 2020The Role of Item Distributions in Reliability Estimation
Olvera Astivia, Kroc, and Zumbo’s (2020) study examines the assumptions underlying Cronbach’s coefficient alpha and how the distribution of items affects reliability estimation. By introducing…
Oct 2, 2020People Also Ask
What is interpreting differential item functioning with response process data?
Understanding differential item functioning (DIF) is critical for ensuring fairness in assessments across diverse groups. A recent study by Li et al. introduces a method to enhance the interpretability of DIF items by incorporating response process data. This approach aims to improve equity in measurement by examining how participants engage with test items, providing deeper insights into the factors influencing DIF outcomes.
Read more →What is cognitive and brain development through galamms?
Sørensen, Fjell, and Walhovd’s 2023 research introduces Generalized Additive Latent and Mixed Models (GALAMMs), a methodological advancement designed for analyzing complex clustered data. This approach holds particular relevance for cognitive neuroscience, offering robust tools for examining how cognitive and neural traits develop over time.
Read more →What are refining reliability with attenuation-corrected estimators?
Jari Metsämuuronen’s (2022) article introduces a significant advancement in how reliability is estimated within psychological assessments. The study critiques traditional methods for their tendency to yield deflated results and proposes new attenuation-corrected estimators to address these limitations. This review examines the article’s contributions and its implications for improving measurement precision.
Read more →What are assessing missing data handling methods in sparse educational datasets?
In educational assessments, missing data can distort ability estimation, affecting the accuracy of decisions based on test results. Xiao and Bulut addressed this issue by comparing the performances of full-information maximum likelihood (FIML), zero replacement, and multiple imputations using classification and regression trees (MICE-CART) or random forest imputation (MICE-RFI). The simulations assessed each method under varying proportions of missing data and numbers of test items.
Read more →Why is background important?
Reliability estimation is a cornerstone of psychometric research, and coefficient alpha has traditionally been one of the most commonly used indices. However, alpha assumes continuous and normally distributed data, conditions that are often violated in practice. Xiao and Hau's research addresses these limitations by evaluating alternatives such as ordinal alpha, omega total, omega RT, omega h, GLB, and coefficient H. Their findings offer practical guidance for researchers working with non-normal data, including Likert-type scales.
How does key insights work in practice?
Performance on Continuous Data: Coefficient alpha and its alternatives performed well for strong scales, even under non-normal conditions. Bias was acceptable for moderately non-normal data but increased significantly for weaker scales. Findings for Likert-Type Scales: For discrete data, indices generally performed acceptably with four or more points on the scale. Greater
Jouve, X. (2023, February 5). Evaluating Coefficient Alpha and Alternatives in Non-Normal Data. PsychoLogic. https://www.psychologic.online/2023/02/05/coefficient-alpha-alternatives-non-normal-data/

