Skip to content
Blazej Mrozinski

Situational Judgment Test (SJT)

Psychometrics
Situational Judgment Test (SJT)

The fundamental problem with self-report personality assessment is that it asks respondents to introspect on traits in the abstract. Most people are not particularly good at that, and the people who are best at it are not necessarily the people who behave best in actual work situations. Situational judgment tests address the gap by skipping the abstraction. The respondent reads a scenario — a real-feeling moment from a real-feeling job — and decides what to do or what to evaluate as the best response. The data are richer because the question is closer to the territory the assessment is supposed to predict.

SJTs have been in steady use since the 1940s, formalized in industrial-organizational psychology research from the 1990s onward (Motowidlo, Dunnette, and Carter 1990 is the canonical paper that re-introduced the format to applied work). They are now standard in selection assessments for managerial roles, customer-facing roles, healthcare, and any context where practical judgment under realistic conditions matters more than abstract trait standing.

How an SJT Works

A typical item presents a paragraph of context — a specific workplace situation, with enough detail that the respondent can imagine being in it — followed by a set of possible responses to that situation. The item asks the respondent either to pick the best response (best-answer format), to pick best and worst (most-likely / least-likely format), or to rate each response on a scale (Likert format on each option).

A short example, simplified:

A team member misses a deadline that affects your work. They mention they have been struggling with personal issues. What would you most likely do?

A. Help them complete the work and discuss process improvements later. B. Escalate the missed deadline to your manager immediately. C. Privately check in on how they are doing before discussing the deadline. D. Continue your own work and let them handle the consequences.

Scoring is the part that distinguishes SJTs from ordinary multiple choice. There is rarely a single objectively correct answer in the way there is for a math problem. The “correct” answer is determined by some combination of subject-matter expert consensus, top-performer responses, theoretical best-practice frameworks, or empirically derived response keys based on validation against criterion outcomes.

What SJTs Actually Measure

This is the part that gets most contested. SJTs are often described as measuring “judgment” or “competence” in the targeted area, but psychometric research on the underlying construct shows it is multidimensional and not always cleanly separable from other constructs.

Tacit knowledge. The implicit, experience-based knowledge of how to act in specific situations that does not transfer well to abstract questioning. This is the construct most defensibly measured by well-designed SJTs.

Personality and values traits. SJT responses correlate measurably with conscientiousness, agreeableness, and emotional stability. The scenario-based format gives these traits a behavioral anchor, which often makes the SJT a better predictor of behavior than the underlying personality scale alone.

General cognitive ability. SJTs require reading, comprehension, and evaluation, which produces small-to-moderate correlations with cognitive ability. This is not necessarily a weakness — practical judgment in complex scenarios genuinely benefits from cognitive ability — but it does mean SJTs are not pure measures of personality or values uncontaminated by ability.

Job-relevant knowledge. When SJTs are role-specific, performance partly reflects what respondents know about the role’s typical situations. This is the construct an SJT is often designed to measure in selection contexts.

The mix of constructs is usually a feature rather than a bug: the SJT predicts performance because performance depends on all of those constructs, and the SJT format captures their interaction in ways that single-construct instruments do not. The cost is that interpreting an SJT score as a measure of any one specific construct is usually overreach. The score is best understood as a measure of role-relevant judgment, with the underlying contributions varying across SJTs and across populations.

Why They Predict Performance

The empirical case for SJTs is strong. Meta-analyses (McDaniel et al. 2001, Christian et al. 2010) show consistent operational validity coefficients in the 0.20-0.35 range against job performance, comparable to or better than personality assessment alone, and the addition of an SJT to a personality battery produces incremental validity over the personality measures.

The reason is that the SJT format reduces the gap between assessment and target behavior. Personality assessment asks about abstract traits; the prediction has to bridge from abstract trait to concrete behavior in a specific role. Cognitive assessment measures general ability; the prediction has to bridge from general ability to performance in a specific job. SJTs measure responses to scenarios that resemble the target behavior directly. The bridge is shorter, and the prediction tends to be sharper.

SJTs are also more resistant to faking than typical Likert personality measurement, though not as resistant as proper forced-choice formats. The scenario-based format makes “what should I say” less obvious to respondents, particularly when the response options are written so that multiple options are defensible and the differences are subtle. Studies on coaching effects and motivated responding show smaller score inflation under SJT formats than under transparent personality measurement (Hooper et al. 2006, Nguyen et al. 2005).

Construction Is the Hard Part

The format is straightforward. The construction is not. A defensible SJT requires:

Scenarios drawn from job analysis. The situations have to be representative of the role, not generic. Generic SJT scenarios produce generic SJT scores that don’t predict role-specific outcomes. Job analysis interviews with current incumbents and managers are the standard input.

Response options at calibrated effectiveness levels. Each scenario needs response options that span a range of effectiveness, ideally with one or two clearly best responses, one or two clearly poor responses, and one or two intermediate responses. Items where every option is similarly good or similarly bad don’t discriminate.

Empirical or expert-derived scoring keys. The scoring key is the SJT’s measurement instrument. Subject-matter experts rate the response options, top-performing job incumbents are surveyed, or both. The agreement among experts (typically reported as inter-rater reliability) is the basic check on whether the key is defensible.

Validation against criteria. The full construct-validity argument applies. SJT scores need to correlate with appropriate criteria (job performance, role-specific outcomes), to show discriminant validity from constructs the SJT is not supposed to measure, and to demonstrate generalizability across the populations the SJT will be used in.

The construction effort is the reason custom SJTs are expensive and most organizations rely on commercially available SJTs unless the role and decision are consequential enough to justify a custom build.

In the Custom Work I Have Built

In the gr8.tech values assessment I designed at Gyfted, the SJT layer sat alongside forced-choice and ranking formats. Forced choice and ranking pushed respondents to express priorities among values; the SJT pushed them to translate those priorities into behavior in concrete workplace scenarios calibrated to the company’s actual operating contexts. The combination distinguishes candidates whose stated values align with their behavioral instincts from candidates who can articulate the values without acting on them.

The construction work for that SJT layer was the most time-intensive part of the engagement. Job-analytic interviews, scenario drafting, response option calibration, expert review, pilot testing, scoring-key derivation, and item statistics had to happen sequentially before the SJT was production-ready. The payoff was an instrument that produced defensible behavioral predictions where a Likert version of the same content would have produced ceiling-saturated noise.

The general lesson: SJTs reward the construction effort with predictive validity that is hard to reach by other means. They do not reward shortcuts. A poorly constructed SJT — generic scenarios, untested keys, no validation evidence — looks the same on the surface as a well-constructed one, but it does not predict, and its scores carry the same false confidence as any other under-validated assessment. The format is good. The work to make it good in any specific application is the same work as building any other psychometric instrument: slow, methodologically rigorous, and not skippable.

Related on this site

See also