Skip to content
Blazej Mrozinski

Criterion Validity

Psychometrics
Criterion Validity

Criterion validity is the validity type that finally answers the question every stakeholder actually wants answered: does this test predict anything in the real world that the organization cares about. The other validity types — face, content, construct, convergent, discriminant — are all evidence about what the test measures internally. Criterion validity is the evidence that what the test measures connects to something outside the test. Without it, a beautifully constructed instrument is an internally coherent guess.

The example is concrete. A leadership-assessment vendor sells an instrument that produces a “leadership potential” score. The instrument has strong convergent validity with other leadership scales, strong discriminant validity from cognitive ability and the Big Five, and acceptable internal consistency. What it doesn’t have, in any published study, is evidence that the score predicts being promoted, being rated as effective by direct reports, or being retained in leadership roles two years later. The instrument is psychometrically sound and operationally empty. That gap is exactly what criterion validity is supposed to fill, and its absence is the most common quiet failure in commercially deployed assessments.

The Two Variants

The term covers two related but distinct designs:

Concurrent validity. The test and the criterion are measured at the same time. A new measure of job satisfaction is administered alongside an existing well-validated job-satisfaction measure, and the correlation between the two is reported. Concurrent validity is fast to run — both measurements happen in the same data collection — and is the workhorse design for instrument development. Its limitation is that it tells you whether your measure agrees with another measure, not whether it predicts behavior that hasn’t happened yet.

Predictive validity. The test is measured first, the criterion is measured later. A hiring assessment is administered to candidates, who are then hired and observed for a period (typically six months to two years), and the correlation between assessment score and performance rating is reported. Predictive validity is the design that matters for selection use cases, because the entire point of the selection assessment is to make a prediction about future behavior. It is also expensive — the data takes time to accumulate, attrition reduces sample size, range restriction biases the estimate downward — which is why a lot of selection instruments are sold on concurrent evidence with predictive validity assumed.

The two designs answer different questions and should not be presented as interchangeable. Concurrent evidence answers “does this measure track an established measure of the same thing.” Predictive evidence answers “does this measure forecast outcomes that haven’t happened yet.” For a hiring decision, only the second question is operationally relevant. For an instrument-development cycle, the first is usually the starting point and the second the eventual destination.

What Counts as a Criterion

The choice of criterion is where most criterion-validity studies quietly fail. The instrument can be psychometrically sound, the analysis correct, and the correlation real — but if the criterion was a noisy proxy for the actual outcome the organization cares about, the validity evidence is weaker than the number suggests.

The standard criteria in organizational psychology are:

Supervisor performance ratings. The most common criterion in selection-validation studies because it’s easy to collect. Also the most problematic — supervisor ratings have well-documented unreliability (low inter-rater agreement when multiple supervisors rate the same person), halo effects (a single overall impression contaminating rating dimensions), and leniency bias (most people getting ratings clustered at the high end of the scale). A correlation of 0.30 between an assessment and supervisor ratings is in the typical range for validated selection tools, and that number is partly limited by the unreliability of the rating, not the assessment.

Hard performance metrics. Sales numbers, production counts, customer-satisfaction scores tied to individual employees. More objective than supervisor ratings but available in fewer roles. Hard metrics have their own problems: they reflect circumstance (territory size, customer mix, season) as much as employee performance, and they are vulnerable to gaming when employees know they are being measured.

Turnover and tenure. Often used as a criterion for engagement and culture-fit assessments. Operationally meaningful (turnover is expensive) but multi-causal (people leave for compensation, family, market opportunities, manager problems), so correlations are usually modest.

Promotion rates and career progression. Used for leadership-potential assessments. Operationally meaningful but slow — predictive studies of leadership potential need three to five years to accumulate enough promotions to analyze, which is why this evidence is rare.

Training-program completion or grades. Used in educational and selection-for-development contexts. Fast to collect but measures a different construct than on-the-job performance.

A criterion-validity study reports a coefficient (typically a Pearson correlation, sometimes corrected for range restriction and criterion unreliability) between the assessment score and the criterion. The honest version reports the uncorrected coefficient, the corrected coefficient if applicable, the sample size, and the criterion’s reliability. The promotional version reports only the corrected coefficient.

What Counts as Strong Evidence

The interpretation of criterion-validity coefficients in organizational psychology is constrained by what’s achievable. The literature suggests:

  • Coefficients of 0.50 or higher are unusual in selection contexts and typically reflect either very narrow criteria (e.g., a knowledge test predicting test scores in a training program) or methodological inflation (small samples, correction artifacts, criterion contamination).
  • Coefficients of 0.30 to 0.50 are strong for selection assessments and put the instrument in the upper tier of validated tools. Cognitive ability tests typically fall here for general job performance.
  • Coefficients of 0.15 to 0.30 are typical for most personality-based selection assessments. The validity is real, the incremental prediction over alternative methods is meaningful, but the individual-level prediction is too noisy to use as a sole hiring criterion.
  • Coefficients below 0.15 are weak evidence and should not be used to justify selection use. Many off-the-shelf assessments quietly sit here.

The corrected versus uncorrected distinction matters. Corrections for range restriction (the assessment was used to make the hire, so the hired sample is non-random with respect to scores) and for criterion unreliability (supervisor ratings are noisy) routinely double the reported coefficient. A study that reports a corrected 0.50 might be reporting an uncorrected 0.25 — a difference that changes the operational decision about whether to deploy the assessment.

What Goes Wrong

Criterion-validity studies fail in three recurring ways:

The criterion is the wrong thing. A leadership-potential assessment is validated against supervisor ratings, which reflect supervisor preferences as much as leadership behavior. The assessment correlates with the ratings, but the ratings are correlated with politeness, agreement with the supervisor’s worldview, and similarity to the supervisor. The validity coefficient is real but it is partly measuring social compatibility, not leadership.

The criterion isn’t measured well. Performance ratings without inter-rater agreement statistics, sales numbers without territory normalization, training-completion rates without selection-into-training controls. A criterion that the instrument predicts is meaningless if the criterion itself is unreliable.

The sample is too restricted. The validation study is conducted on people who were already hired, which means the assessment-score range in the sample is truncated relative to the applicant pool. Range restriction biases the correlation downward in the validation study and overstates the corrected coefficient. The honest interpretation requires reporting both the restricted-sample and population-corrected estimates with their assumptions visible.

A separate failure mode is statistical: the criterion-validity study is run on a small sample (n < 100), produces a coefficient with wide confidence intervals, and gets reported as if the point estimate were precise. A coefficient of 0.32 with a 95% CI from 0.08 to 0.52 is not the same evidence as a coefficient of 0.32 from a study with n = 1,000 and a CI from 0.26 to 0.38, but they often get reported identically in vendor materials.

Where Off-the-Shelf Instruments Quietly Fail

Most commercial assessments report some criterion-validity evidence in their technical manuals. The quality is highly variable. Common gaps:

  • The criterion-validity studies are five to twenty years old, run on populations that no longer match the current user base.
  • The criterion is a different construct than what the buyer assumes (e.g., the assessment was validated against training-program performance and is being sold for on-the-job performance prediction).
  • The reported coefficients are corrected for both range restriction and criterion unreliability with assumed values, producing inflated numbers that don’t reflect what the buyer will see in their own data.
  • The sample sizes for the predictive (as opposed to concurrent) evidence are small, and most of the evidence base is concurrent.

The honest move for an organization adopting an assessment is to run a local validation study against the criteria that matter in that organization, not to rely on vendor-reported evidence. This is expensive and most organizations don’t do it, which is one reason the validity literature is dominated by published academic studies on convenience samples rather than operational evidence from the actual use sites.

Why It’s the One That Matters

Criterion validity is the validity type that the other types build toward. Construct validity without criterion validity is internally coherent measurement of something that doesn’t connect to outcomes. Convergent validity without criterion validity is agreement with other measures of the same internally-coherent-but-not-useful thing. Discriminant validity without criterion validity is a clean factor structure that doesn’t predict anything. The instrument can pass every other validity test and still fail at the only test the stakeholder cares about.

In the work I’ve done at Gyfted, the order of operations is to establish construct validity first (the measure has clean factor structure, internal consistency, discriminant validity from neighboring constructs) and then close on criterion validity against the specific outcomes the deploying organization cares about. The order matters because criterion-validity studies are expensive enough that running them on a poorly-constructed measure is wasteful, but they’re necessary enough that no organization should adopt an instrument without local criterion evidence. The construct work is the prerequisite; the criterion work is the contract.

Related on this site

See also