Skip to content
Blazej Mrozinski

Ordinal Data

General
Ordinal Data

Ordinal data is the data type that most rating scales actually produce, and the data type that most rating-scale analyses pretend isn’t the data type they’re working with. The distinction between ordinal and interval data is a foundational point in measurement theory, and the consequences of ignoring it run through almost every applied psychometric setting — from NPS reporting to engagement-survey averages to leadership-assessment percentile bands. The convention of analyzing 1-to-5 Likert responses with means, standard deviations, and Pearson correlations treats the data as interval; the data is ordinal; and the analysis is wrong in ways that are usually small but sometimes consequential.

The textbook example is Net Promoter Score. The respondent picks a number from 0 to 10 indicating how likely they are to recommend the product. The convention treats the 0-10 response as interval-level data, splits it into three groups (0-6 detractors, 7-8 passives, 9-10 promoters), and reports the proportion of promoters minus the proportion of detractors. The threshold structure encodes the assumption that the difference between a 6 and a 7 is the same as the difference between an 8 and a 9 — which is the assumption ordinal data is defined to violate.

The Measurement Levels

The standard taxonomy, originating with Stevens 1946, organizes measurement into four levels with increasing information content:

Nominal. Categories with no inherent order. Examples: country, blood type, color preference. The only operation that makes sense on nominal data is counting frequencies; means and orderings are meaningless.

Ordinal. Categories with a meaningful order but no guaranteed equal spacing. Examples: Likert ratings (1=strongly disagree to 5=strongly agree), education level (high school, bachelor’s, master’s, doctorate), tournament rankings (first, second, third). Operations that preserve order are valid (median, percentiles, rank correlations); operations that assume equal spacing (mean, standard deviation, Pearson correlation) are technically not valid but are widely used in practice.

Interval. Ordered with guaranteed equal spacing, but no true zero point. Examples: temperature in Celsius or Fahrenheit, calendar years. Means, standard deviations, and Pearson correlations are valid; ratios are not (40°F is not “twice as warm” as 20°F).

Ratio. Ordered with equal spacing and a true zero point. Examples: length, weight, time duration, count data. All statistical operations are valid.

The line that matters for applied psychometrics is between ordinal and interval. Rating-scale data is ordinal by construction — the respondent who picks 4 over 3 is signaling “more” but not signaling “more by exactly the same amount that 5 is more than 4.” The literature has accumulated decades of evidence that this assumption fails empirically: the psychological distance between adjacent rating categories varies across the scale (the gap between 4 and 5 is usually smaller than the gap between 3 and 4) and across respondents (some people use the endpoints, some avoid them).

Why It Matters in Practice

The conventional defense for treating ordinal data as interval is that the conclusions usually don’t change much. Means, regressions, and ANOVA on ordinal data typically produce qualitatively similar results to the corresponding ordinal methods (medians, ordinal logistic regression, Kruskal-Wallis). For exploratory analysis or rough comparisons, the simplification is often defensible.

The defense breaks down in three recurring situations:

When the score distribution is non-uniform. Likert-scale responses are systematically piled up at the favorable end (4 and 5 on a 5-point scale). Treating the data as interval and computing a mean produces a number that compresses meaningful between-group differences. The same population means of 4.1 and 4.3 might reflect substantially different response distributions, and the comparison treats them as essentially equivalent.

When the analysis depends on thresholds. NPS is the cleanest case. The 0-6 / 7-8 / 9-10 thresholds for detractor / passive / promoter encode strong assumptions about where the meaningful breaks in the scale are, and the thresholds are not derivable from the data — they’re imposed. A response of 7 contributes nothing to the NPS, a response of 6 contributes -1 to the proportional calculation, and the difference between 6 and 7 is treated as larger than the difference between 7 and 9, which the ordinal nature of the data does not support.

When the scale has known ceiling effects or floor effects. When most responses are at one end of the scale, the interval-treatment of the data inflates the apparent equivalence between scores that are actually doing different things. A respondent who picks 5 on a 1-to-5 scale and a respondent who picks 4 might be expressing nearly identical levels of agreement (the scale is saturated at the high end), while a respondent who picks 3 versus 4 might be expressing a meaningfully different level (mid-range is where the scale is doing its discriminating work). Treating the data as interval imposes a uniform-distance assumption that the response distribution shape directly contradicts.

The IRT Alternative

Item response theory is the standard psychometric framework for handling ordinal data correctly. The Partial Credit Model (Masters) and the Rating Scale Model (Andrich) explicitly model the threshold structure between adjacent categories rather than assuming equal spacing. The output is a person ability estimate on a continuous latent scale, derived from the categorical responses without imposing the interval-data assumption.

The IRT approach addresses the ordinal-data problem at the cost of computational complexity and interpretability. The mean Likert score is intuitive; the IRT-derived ability estimate is not, and stakeholders who can read a “4.2 out of 5” usually can’t read a “theta = 0.82 on the logit scale.” This is one of the reasons IRT remains under-used in commercial assessment despite being the technically appropriate method — the conversion to a stakeholder-readable number erases the benefits of the more sophisticated estimation.

A middle path is to use IRT for the underlying scoring and conversion (producing the ability estimates) and then present the results on a transformed scale (percentile bands, stanines, T-scores) that is more interpretable while preserving the ordinal-appropriate measurement. This is the approach the more careful commercial instruments take, though it’s not universal.

What Goes Wrong When the Distinction Is Ignored

The interval-treatment of ordinal data produces several recurring failure modes:

Inflated apparent precision. Reporting a Likert mean to two decimal places (“employee engagement: 4.23”) suggests a level of precision the underlying ordinal data doesn’t support. The same data analyzed at the ordinal level might support only a coarser distinction (above-median, below-median, or quartile bands), and the two-decimal-place mean creates false confidence in small between-group differences.

Misspecified regression models. Using a Likert-scale outcome in a linear regression assumes the outcome is interval. When the outcome is ordinal and has ceiling effects, the linear regression’s coefficients are biased and the residual structure violates the model’s assumptions. Ordinal logistic regression or proportional-odds models are the correct alternatives but are used less often because the interpretation is less straightforward.

Misleading effect-size reporting. Cohen’s d on ordinal data assumes interval scaling and equal-variance distributions. When the data are ordinal and the distributions are skewed (as Likert distributions usually are), the d-value doesn’t have its conventional interpretation. Rank-based effect sizes (rank-biserial correlation, Cliff’s delta) are the correct alternatives but rarely appear in applied reports.

False discoveries from threshold artifacts. When the analysis depends on imposed thresholds (NPS, satisfaction-bands, top-2-box reporting), small changes in the underlying distribution can produce large changes in the threshold-derived metric. An organization’s NPS can swing 10 points because of small shifts in the proportion of responses at 6 versus 7, even when the overall distribution barely moved. The headline metric reports a dramatic change that doesn’t reflect a dramatic shift in actual sentiment.

When the Simplification Is Defensible

Not every analysis of ordinal data needs the full IRT treatment. The interval-treatment is defensible when:

  • The scale has many categories (7-point or more), the distribution is approximately symmetric and bell-shaped, and the analysis is exploratory or qualitative.
  • The summary statistic (mean, correlation) is being used for rough comparison rather than precise inference.
  • The audience cannot interpret the technically correct methods, and the simplified analysis won’t be used for decisions that depend on precise effect sizes.

The simplification is not defensible when:

  • The scale has few categories (5-point or fewer) and the distribution is skewed or has ceiling/floor effects.
  • The analysis depends on threshold-based scoring (NPS, top-2-box, satisfaction bands).
  • The analysis is being used to support a high-stakes decision (selection, promotion, organizational restructuring).
  • Small differences in the summary statistic are being treated as meaningful.

The honest reporting move is to either use ordinal-appropriate methods (median, rank correlations, ordinal regression, IRT) or to use interval methods and explicitly acknowledge the simplification and what it might miss.

Why the Distinction Is Often Ignored

The interval-treatment of ordinal data persists because it’s operationally easier and because the technically correct alternatives are usually unfamiliar to the stakeholders who consume the analysis. A CHRO asking “what’s the engagement score” doesn’t want a logit-scale ability estimate; they want a number that fits in a quarterly slide. The path of least resistance is to compute the mean, report it as if it were interval data, and accept the interpretive distortions because they’re usually small.

For the instruments I work on at Gyfted, the operating rule is to model the data ordinally underneath (using IRT or ordered-categorical models) and report the results on a converted scale that is intuitive enough to be useful while preserving the underlying measurement properties. The two-layer structure — ordinal-correct estimation, intuitive reporting — is more work than the single-layer approach but is the configuration that maintains both psychometric integrity and stakeholder usability. The configurations that flatten the structure into a single number reported as if it were interval data save effort up front and pay for it in misinterpreted results downstream.

Related on this site

See also