Discriminant Validity in Practice: How to Tell a New Construct From a Renamed One
Every few months the management literature names a new psychological construct. Trust ambiguity, agency erosion, AI anxiety, quiet quitting, grit, languishing. Each arrives with a definition, a prescription, and the implicit claim that it picks out something the existing vocabulary missed. Some of them do. Most of the durable constructs in psychology earned their place by surviving a specific test, and the test is older and more demanding than the rate of new construct names suggests.
The test is discriminant validity, and it answers a question that sounds trivial until you try to settle it with data: is this a new thing, or is it an old thing wearing a new name? The question matters well beyond academic psychology. If you run an engagement survey, build a hiring assessment, or read the leadership press for ideas worth acting on, you are constantly being offered new constructs to measure and manage. Knowing which ones are real is the difference between adding signal and adding a synonym.
What follows is what discriminant validity actually requires, the three procedures that test it, one construct that failed the test in public, and one current construct that has not been tested yet but invites it.
Table of Contents
- The jangle fallacy
- Construct validity has two halves
- The three tests for discriminant validity
- A worked numerical example
- A construct that failed in public: grit and conscientiousness
- A construct that has not been tested: trust ambiguity
- What to ask before you accept a new construct
- The discipline behind the question
The jangle fallacy
In 1927 the psychometrician Truman Kelley described two opposite mistakes that researchers make with construct names. The jingle fallacy is assuming that two things called by the same name are the same construct. The jangle fallacy is the reverse: assuming that two things called by different names are different constructs. The jangle fallacy is the one that produces the steady supply of fresh construct labels, because inventing a name is cheap and demonstrating distinctiveness is expensive.
A new construct that commits the jangle fallacy is not useless. It can still be a vivid way of pointing at something, and the prescriptions attached to it can still be sensible. The problem is narrower and more practical. If the new construct is empirically the same as one we already measure, then measuring both is redundant, any instrument built for the new one inherits the validity evidence of the old one rather than earning its own, and any claim that the new construct predicts an outcome “over and above” the old one is almost certainly false. Redundancy is not a moral failing. It is a measurement cost that someone pays downstream, usually the team that builds a survey around the new label and then cannot explain why it correlates 0.9 with the survey they already had.
Discriminant validity is the formal answer to whether a label has earned its independence. It was named in 1959 by Donald Campbell and Donald Fiske, in the paper that introduced the multitrait-multimethod matrix, and it has been a precondition for taking a construct seriously ever since.
Construct validity has two halves
Construct validity is the broad question of whether an instrument measures the theoretical thing it claims to measure. Cronbach and Meehl set the modern terms for it in 1955, with the idea that a construct lives inside a nomological net of expected relationships, and that you validate it by checking whether the data honor those expectations. Campbell and Fiske made one part of that net concrete and testable, and split it into two requirements that have to hold together.
Convergent validity is the requirement that measures which should be related actually are. Items meant to tap the same construct should hang together, and different measures of the same construct should correlate. If you write ten items for a construct and they do not cohere, you do not have a measure of one thing.
Discriminant validity is the requirement that measures which should be unrelated, or only modestly related, actually are. Your construct should correlate less with measures of different constructs than it does with measures of itself. This is the half that new constructs routinely skip, because it is the half that can embarrass the construct.
The two halves are not independent tests you can pass one at a time. A construct can show beautiful convergent validity, with items that cohere tightly, and still fail discriminant validity because the tight-knit thing those items measure turns out to be a construct that already has a name. Convergent validity tells you the items measure one thing. Discriminant validity tells you whether that one thing is new. You need both, and the second is the one that is usually missing.
The three tests for discriminant validity
There are three procedures in common use. They escalate in rigor, and a serious construct-validation effort runs more than one.
The Fornell-Larcker criterion. Proposed by Claes Fornell and David Larcker in 1981, this works from a quantity called the average variance extracted, or AVE. The AVE of a construct is the average of the squared standardized loadings of its items, which is the share of the items’ variance that the latent construct accounts for rather than measurement error. An AVE of 0.50 or higher is the usual convergent-validity threshold, because it means the construct explains more of its indicators than error does. The Fornell-Larcker criterion for discriminant validity is then simple to state: the square root of a construct’s AVE should be larger than that construct’s correlation with any other construct in the model. Put the other way, each construct should share more variance with its own indicators than it shares with any neighbor. When the square root of AVE drops below an inter-construct correlation, the two constructs are sharing more variance with each other than one of them shares with its own items, and the claim of distinctness is in trouble.
The HTMT ratio. The heterotrait-monotrait ratio of correlations, introduced by Jörg Henseler, Christian Ringle, and Marko Sarstedt in 2015, exists because the Fornell-Larcker criterion has a known blind spot: it frequently fails to flag discriminant-validity problems that are really there, especially when item reliabilities are uneven. HTMT estimates what the correlation between two constructs would be if both were measured without error. A value near 1.0 says the two constructs are, after correcting for measurement error, the same construct. Henseler and colleagues suggested a threshold of 0.85 for constructs that are supposed to be conceptually distinct, and a more lenient 0.90 for constructs that are conceptually close to begin with. The value of HTMT is that it catches the relabelings the Fornell-Larcker criterion lets through, so the two are often reported together.
Nested model comparison in confirmatory factor analysis. This is the most direct test, and it uses confirmatory factor analysis to ask the question as a model comparison. You fit one model in which the two constructs are allowed to correlate freely, and a second model in which their correlation is fixed at 1.0, which is the statistical statement that they are a single construct. If forcing the correlation to 1.0 makes the model fit significantly worse, by a chi-square difference test, the data prefer to keep the two constructs separate and you have evidence of distinctness. If fixing the correlation at 1.0 costs you almost nothing in fit, the data are telling you the two factors are one. James Anderson and David Gerbing laid this out as part of their two-step modeling approach in 1988, and a common refinement is to check whether the confidence interval around the freely estimated correlation excludes 1.0.
The three tests can disagree at the margins, which is itself informative. A construct that passes Fornell-Larcker but fails HTMT and the nested comparison is a construct in trouble, and the reason it scraped past the first test is usually the blind spot Henseler and colleagues built HTMT to cover.
A worked numerical example
Numbers make the procedure concrete. The figures in the table below are illustrative, chosen to show what a pass and a fail look like rather than drawn from any real study. Suppose a researcher proposes a new construct, call it Construct N, and measures it alongside an established neighbor, Construct E, and a third construct, Construct U, that nobody expects to be related to either.
| Construct pair | Correlation r | √AVE (lower of the two) | Fornell-Larcker | HTMT | Verdict |
|---|---|---|---|---|---|
| N with E (the suspected twin) | 0.86 | 0.75 | Fails (0.75 < 0.86) | 0.91 | Not distinct |
| N with U (the unrelated check) | 0.19 | 0.75 | Passes (0.75 > 0.19) | 0.22 | Distinct |
Read the first row. The new construct correlates 0.86 with its established neighbor. The square root of the new construct’s AVE is 0.75, which is below that 0.86 correlation, so the Fornell-Larcker criterion fails: Construct N is sharing more variance with Construct E than with its own items. HTMT comes in at 0.91, above even the lenient 0.90 threshold, which says that once you correct for measurement error the two constructs are effectively the same. A nested CFA on this data would almost certainly show that fixing the N-to-E correlation at 1.0 barely dents model fit. Three tests, one conclusion: Construct N is Construct E with a new label.
The second row is the control that keeps you honest. Against an unrelated construct, the same new measure behaves exactly as a distinct construct should, correlating weakly and clearing every threshold with room to spare. Discriminant validity is never a property of a construct in isolation. It is always a property of a construct relative to its specific neighbors, and the neighbors that matter are the ones a skeptic would propose as the thing your construct is secretly measuring.
A construct that failed in public: grit and conscientiousness
Grit is the cleanest real example, because the test actually got run. Grit, defined as perseverance and passion for long-term goals, became one of the most cited constructs in applied psychology in the 2010s, with its own scale and a large popular following. The discriminant-validity question was obvious from the start: how is grit different from conscientiousness, the long-established Big Five trait that already covers persistence, self-discipline, and industriousness?
In 2017 Marcus Credé, Michael Tynan, and Peter Harms published a meta-analytic synthesis of the grit literature in the Journal of Personality and Social Psychology. Pooling across studies, they found that the higher-order grit construct correlated with conscientiousness at roughly 0.84 once the correlation was corrected for measurement error, with the raw observed correlation closer to 0.66. The corrected value is the one that speaks to whether the constructs are the same thing, because it estimates the relationship net of the noise in each scale, and 0.84 is high enough that the two are difficult to defend as separate constructs at that level of measurement. They also found that grit’s relationship with performance outcomes was modest, and that the perseverance facet was doing most of the predictive work while the passion facet contributed little, which mirrors what conscientiousness research had already established. The practical conclusion was that grit, as a broad construct, was largely a relabeling of an existing trait, and that the incremental prediction it offered over conscientiousness was small.
This is the jangle fallacy caught in the act, with numbers. None of it makes the grit research worthless or the advice to persevere wrong. It means that an organization choosing what to measure for selection or development gains little by adding a grit scale on top of a well-validated conscientiousness measure, and that claims about grit predicting success “beyond” personality have to survive a discriminant-validity check the broad construct does not clearly pass. The construct was useful as a rallying idea and weak as an independent measurement target, and the only way to know which was to do the test.
A construct that has not been tested: trust ambiguity
The current example has no numbers yet, which is exactly why it is worth flagging. In their Harvard Business Review work on psychological safety in teams that include AI, Amy Edmondson and her co-author introduce a construct they call trust ambiguity, a state in which team members believe trust is warranted but do not actually feel it, and in which that gap feels undiscussable. I wrote about the measurement implications of that piece in my walk-through of three popular leadership constructs and the instruments that measure them, where the short version was that trust ambiguity sits suspiciously close to low psychological safety, which Edmondson herself defined and validated with a seven-item scale in 1999.
Here is what testing it would actually take, now that the procedures are on the table. You would administer the existing psychological safety scale and a new trust-ambiguity scale to the same teams, estimate the correlation between the two latent constructs, and run the three tests. If the square root of trust ambiguity’s AVE comes in below its correlation with psychological safety, if HTMT lands above 0.85, and if a nested CFA cannot tell the two factors apart, then trust ambiguity is the jangle fallacy again, and the right move is to keep using the psychological safety scale that already has twenty-five years of validation behind it. If trust ambiguity clears the thresholds, then it has earned its name and its own instrument, and the field has a genuinely new thing to measure.
I want to be careful about what I am claiming. Nobody has published this test for trust ambiguity, so I am not asserting that it would fail. The construct is plausible, and it is possible to define trust ambiguity so that it captures something psychological safety misses, for instance the specific case where the lack of trust is directed at an opaque system rather than a person. The point is narrower: the construct is being prescribed before the discriminant-validity work that would tell us whether it is new. That ordering is the norm in the leadership press, and it is the reverse of the ordering that produced the constructs worth keeping.
What to ask before you accept a new construct
When you encounter a new construct, in a journal or a vendor deck or an HBR piece, four questions tell you how much weight it can bear.
The first is which existing construct it most resembles. Every new construct has a nearest neighbor in the established literature, and naming that neighbor is the whole game. For trust ambiguity it is psychological safety. For grit it was conscientiousness. If the people promoting the construct cannot name the neighbor, they have not done the work.
The second is whether anyone has measured the correlation between the new construct and that neighbor. A correlation in the 0.8s is a strong signal that you are looking at one construct with two names. A correlation in the 0.3s leaves room for the new construct to be real.
The third is whether the new construct predicts anything over and above its neighbor. This is incremental validity, and it is the practical payoff of distinctness. A construct that is statistically distinguishable but adds no predictive power over the measure you already have is a curiosity, not a tool.
The fourth is whether the prescription depends on the construct being new. A great deal of useful advice survives regardless of how the measurement question resolves. The advice to build trust on AI-augmented teams is sound whether or not trust ambiguity is distinct from psychological safety. Knowing that the prescription is robust to the construct question lets you take the advice without buying the construct.
The discipline behind the question
Discriminant validity is unglamorous, and that is most of why it gets skipped. Coining a construct is generative and quotable. Testing whether the construct survives a correlation with its nearest neighbor is the part that can end with the construct dissolving back into something we already knew. The incentives in both academic publishing and management writing favor the coining and not the testing, which is why the supply of new construct names runs so far ahead of the supply of demonstrated distinctness.
The habit worth building is the one that treats a new construct as a hypothesis rather than a discovery. The hypothesis is that the construct is distinct, and the hypothesis has a standard test with three established procedures and clear thresholds. Run the test, or find the study that ran it, before you build a measurement program on the name. I have made versions of this argument across this series, in the critique of the Q12 engagement instrument, in the analysis of what Net Promoter Score loses by collapsing the top of its scale, and in the account of leader traits like tolerance for ambiguity that have scales the advice ignores. The thread connecting all of them is the same: a claim about people is downstream of a measurement, and the measurement is the part you have to actually examine.
Grit had its test, and the test told us something specific and useful about what grit is and is not. Trust ambiguity has not had its test yet. The constructs that end up mattering are the ones whose promoters were willing to find out, and discriminant validity is how they found out.
Related reading
Tolerance for Ambiguity, Locus of Control, and What These Leader Traits Actually Measure
Two HBR articles invoke leader traits as advice. The scales exist: IUS-12, tolerance for ambiguity, locus of control. What they actually predict for executives.
Three Popular Leadership Constructs Through the Lens of Their Actual Scales
Wise empathy SJT, Edmondson's psychological safety scale in human-AI teams, and the contested ALQ. What three leadership constructs actually measure.
What Net Promoter Score Loses by Collapsing the Top of the Rating Scale
Marcus Buckingham called Net Promoter Score problematic in HBR. The reason is a rating scale measurement argument worth unpacking. What NPS loses and what to do.
Custom or Off-the-Shelf Psychometric Instrument?
When does a custom psychometric instrument earn its multi-week build cycle, and when is off-the-shelf personality assessment for hiring enough?