Skip to content
Blazej Mrozinski

Item Response Theory (IRT)

Psychometrics
Item Response Theory (IRT)

Most measurement frameworks treat a test score as a single number: you got 23 out of 30 right. Item Response Theory does something fundamentally different — it models the interaction between a person and each individual item, separately, and uses that to estimate where the person sits on an underlying trait continuum. The result is a more precise, more portable, and more informative picture of what a test actually measures.

What IRT Actually Does

IRT posits that the probability of a person answering an item correctly (or endorsing a statement, or choosing a particular response option) is a function of two things: the person’s level on the latent trait being measured, and the statistical properties of the item itself.

That “latent trait” might be numerical reasoning, conscientiousness, introversion, or anything else you’re trying to measure. IRT puts both people and items on the same scale — usually expressed as a theta (θ) score ranging roughly from -3 to +3 — which makes it possible to say something like “this item is well-calibrated for distinguishing people in the average range, but provides almost no information at the extremes.”

This is the core insight that separates IRT from Classical Test Theory: IRT treats the quality of measurement as a function of where on the trait scale you’re measuring, not as a single average property of the test.

How IRT Differs from Classical Test Theory

Classical Test Theory (CTT) works at the level of the whole test. It gives you reliability coefficients, item difficulty proportions, and item-total correlations. Those statistics are useful but they’re entangled with the sample you used to compute them. Administer a hard test to a high-ability group and it looks easy. The numbers shift.

IRT parameters, once estimated on a sufficiently representative sample, are theoretically sample-independent. An item’s difficulty parameter stays stable whether you’re running it on students in Warsaw or São Paulo, as long as you’re measuring the same construct. That stability is what makes large-scale standardized testing and adaptive assessment feasible.

The Key Parameters

IRT models vary in complexity. The most commonly used are:

1PL (Rasch model): Items differ only in difficulty (b parameter). All items are assumed equally discriminating. Clean, elegant, but often too constrained for real data.

2PL model: Adds a discrimination parameter (a). Some items are better than others at distinguishing people who are close together on the trait scale. A highly discriminating item produces a steep S-curve; a poorly discriminating item produces a flat one.

3PL model: Adds a pseudo-guessing parameter (c) — the lower asymptote of the curve. Useful for multiple-choice tests where a low-ability person still has some probability of getting the right answer by chance.

For personality and non-cognitive assessments, the Graded Response Model (GRM) and related polytomous models extend IRT to ordered response categories (e.g., Likert scales).

Item Characteristic Curves

Each item in an IRT framework is characterized by an Item Characteristic Curve (ICC) — an S-shaped function that shows the probability of a correct response at each level of theta. The steeper the curve, the better the item discriminates. The horizontal position of the inflection point marks the item’s difficulty.

These curves let you see immediately whether an item is well-targeted. An item whose inflection sits at theta = +2.5 is useful only for the top few percent of your test population. Using it to assess average-range respondents wastes everyone’s time.

Information Functions

One of the most practically useful features of IRT is the item information function. Each item contributes a specific amount of measurement precision at each point on the trait scale. Sum those functions across all items and you get the test information function — a curve showing where your test measures well and where it measures poorly.

A well-designed test concentrates information at the trait levels that matter most for your use case. If you’re trying to screen for minimum job competency, you want information concentrated at the threshold, not spread across the full range.

Adaptive Testing

Item information functions are the foundation of computerized adaptive testing (CAT). In a CAT, the algorithm selects the next item based on the current estimate of the respondent’s trait level — always choosing the item that provides the most information at that estimated level. The result is a shorter test that measures with equivalent or better precision compared to a fixed-length test.

This is not a theoretical nicety. In practice, CAT can reduce test length by 50% or more while maintaining measurement accuracy. For large-scale assessments, that translates directly into candidate experience, completion rates, and cost.

Applications in Real Assessment Work

I apply IRT regularly in assessment work at Gyfted and in tools like the HSE Career Quiz. The practical uses include:

  • Item bank development: Calibrating items so you can draw from a pool and assemble tests with known psychometric properties
  • Differential Item Functioning (DIF) analysis: Detecting items that behave differently across demographic groups — an essential fairness check
  • Score equating: Making scores comparable across different test forms
  • Trait estimation: Generating person scores that account for item-specific properties, rather than treating every correct answer as equivalent

IRT requires more data to estimate parameters than CTT — typically several hundred to a few thousand respondents per item, depending on the model. That’s a real constraint for early-stage tools. But once parameters are estimated and validated, IRT-based assessments are more precise, more portable, and more transparent about where they work and where they don’t.

Related on this site

See also