The Interpretation of Double-Normed Tests

A Psychometric "Theory of Relativity"

©2009 by Daniel Bruns, PsyD and John Mark Disorbio, EdD

All Rights Reserved.




One of the unique features of some health psychology tests like the BHI™ 2, BBHI™ 2 and the P-3® is that, unlike most other psychological inventories, they are double-normed. This means that there are two reference or norm groups. At first glance, it may seem somewhat confusing that each scale produces two different scores. However, thisis not nearly as complex as it may first seem to be. Once this approach is understood, it can be a powerful tool to extract more accurate information from a psychological test.

Psychometrics 101

Let's begin with a brief reminder about T-scores. A T-score is a standardized score, based on a normal curve. A T-score of 50 is equal to the average score of some comparison or norm group. Approximately two-thirds of the scores in that norm group will fall between a T score of 40 and a T score of 60. This is the average range. The relationship of T scores to percentiles is shown below:

Score Level
T Score
Percentile Rank
Very High
> 70
> 97% of the reference group
> 60
> 84% of the reference group
50th percentile of the reference group
< 40
< 16% of the reference group

With tests like the BHI 2, BBHI 2 or P-3, this whole process was done twice. At this point, sensible people are probably asking, "Why would anybody want to do this?" Actually, there is a method to this madness, which makes a great deal of intuitive sense. As opposed to starting out with a psychometric approach, which can be a little confusing at first, it will be clearer if we begin with an extreme example. After the inherant advantages of this process are understood, we can return to the topic of the assessment of medical patients to see how it can be applied.

The Psychometrics Of Crime

To illustrate the advantages of a double-norming approach, let's use a hypothetical scale as an example. Suppose we have created a scale to measure antisocial traits. Suppose the high score on this scale indicates that the person is very antisocial, while a low score on the scale indicates that the person is a good citizen. Suppose we then go on to administer this scale to a group of psychopathic prison inmates, and also administer it to a group of people in the community. Using this data, we can now produce two separate sets of standardized scores: a psychopath T score and a community T score.

If someone gets a psychopath T score of 50, this tells us that the person's score is equal to that of the average psychopath. In contrast, if someone gets a community T score of 50, this tells us that the person's score is equal to that of the average person in our community. These are two average scores, but one would hope that they would be quite different. We might choose to use one T score or the other, depending on what specific question we were trying to answer.

For the purposes of the following discussion, we will assume that our Antisocial scale has been proven to have excellent validity and reliability.

Relatively Difficult Scenarios

Suppose we administer our new Antisocial scale to John, a new prison inmate. John has been referred for an evaluation to see how he might fare in a prison setting. Suppose that after testing him, we find that John's community T score equals 73. We could then say, relative to people in our neighborhood, he has a very high score (this would translate into the 99th percentile). This would indicate that he is more antisocial than 99% of persons in our community. However, this might not be very meaningful, as one would think that most persons in the penitentiary would have a higher antisocial score than people in our neighborhood.

Suppose we then compare John's score to the average prisoner, and find that he has a prisoner T-score of 60. We could then say that compared to the other prisoners, our subject is at the 84th percentile rank with regard to antisocial traits. This suggests that John is not only more antisocial than the average person in our community, but he also appears substantially more antisocial than the average prisoner. This could suggest that extra precautions might need to be taken in the prison. Given this particular question, the use of the prisoner T-scores is most useful.

Let's consider another scenario. Suppose we have a new neighbor named Bob, who is asking to borrow some money. To be safe, we first administer our antisocial scale to Bob, and find that his prisoner T-score is 40, which translates into the 16th percentile. This suggests that Bob is less antisocial than the average prisoner. This is good, but let's look further.

Suppose that we find that while Bob has a T-score of 40 relative to the average prisoner, he has a T-score of 60 relative to the average community member. This means that while Bob is less antisocial than the average prisoner, he is considerably more antisocial than the average person in our neighborhood. This is bad. In this case, the community based T score is the most useful.

Now lets suppose that Bob gets a prisoner T-score of 30, and a community T-score of 40. This means that he is not only less antisocial than the average prisoner, he is also much less antisocial than the average person in our community. He might be a good person to have as a neighbor.

These scenarios exemplify what we call the "psychometric theory of relativity." In the first example, John's score was very high relative to community scores, but this was to be expected. More importantly, his score was still high relative to prison scores. In the second example, Bob's score was low relative to prisoners, but high relative to persons in the community.

From these examples we can see that, depending on the question we might have, different relative comparisons might be useful. If the question we are asking is, "Is this an unusually antisocial prisoner that may require solitary confinement?" We will want to be referring to the prisoner scores. On the other hand, if we are administering this scale to our neighbors, and wondering, "Which neighbor do I trust my house key with when I go on vacation?" We might prefer to consider this person's score relative to our other neighbors.

Having a double norm group brings the meaning of standardized scores into sharp focus. A standardized score is always a score relative to some group. When one remembers this, then a double normed scale is not confusing, but rather offers great advantages.

Double Norms and Medical Patients

The BHI 2, BBHI 2 and P-3 are examples of double normed tests designed for medical patients. On these tests, the two bases of comparison are a community sample and a patient sample. As with the above examples, this allows for a dual basis of comparison. Any observed score can be assessed relative to the average person in the community or relative to the average patient. Let's look at an example of how this might be useful, using a Depression scale.

Suppose we administered this test to a patient with a back injury, and discovered that this patient's score on Depression, relative to other patients, is a T-score of 54. Thus we could say that this patient's Depression score is in the average patient in rehabilitation (41 to 59). We would conclude that, relative to other patients, this patient's Depression score is in the average range. In contrast, suppose we look at this same patient's profile, but instead look at the community T-score. Now we find this same patient has a community T-score of 60, which is mildly elevated. How are we to interpret this?

Based on a national sample of people in the community, and a national sample of patients in rehabilitation, the BHI 2 research has shown that the average patient produces an higher score on Depression and most other BHI 2 scales than does the average person in the community. Research done with the P-3 is also consistent with this, and this does not appear to be an artifact. It seems reasonable to conclude that the average patient, after suffering a significant illness or injury, is exhibiting more depressive traits than the average person in the community.

Which T-score then, 54 or 60, is the "correct one?" Both are correct. However, one may be more useful than the other depending on what question you are asking. In a clinical setting, one might be testing people entering a rehabilitation program, and asking which of these patients is unusually depressed and might need to be singled out for individual assistance. To answer this question, we need to ask, "What is this patient's T-score relative to other patients?"

In contrast, we may be asking a more basic question, simply, is this person depressed at all? In this case, the community T score of 60 indicates that some depression is probably present. Overall then, we can conclude that while our patient is not unusually depressed for a patient with that diagnoses, the patient shows more depressive features taht the average healthy person in the community, and mild depression appears to be present.

Another way of looking at this phenomenon is through the notion of "piggy back elevations." This means that a patient may be more depressed than the average patient in rehabilitation. However, it must also be remembered that the average patient in rehabilitation is more depressed than the average person in the community. Consequently, saying that a patient is depressed relative to the average patient implies the piggy back elevation, which also indicates that the person is much more depressed than persons who aren't injured. It is this effect that makes a patient elevation on a double-normed Depression scale so meaningful.

An important question to answer at this point is, do we really know that the higher Depression score seen in patients really means that they are more depressed, or could that simply be produced by some kind of artifact? It is our opinion at this point that this is not an artifact, and that patients are actually slightly depressed. This would seem to be a reasonable conclusion. A person who has had a substantial injury, which could jeopardize income and lead to a possible loss of career, not to mention having to contend with the pain and the aggravations of dealing with insurance companies, it would seem understandable that this person would be depressed. Additionally, when the BHI 2 scales were constructed, care was taken to avoid the use of medical symptoms as items. Instead, the items are comprised primarily of cognition, affect and behavioral traits. Logically, these would seem to be less likely to be the direct physical result of an illness or injury.

To use a colloquialism, testing medical patients with a psychological inventory that was normed on a non-medical population is like comparing apples to oranges. Using community norms when testing medical patients can produce false postives, because patients report more symptoms (that is why they are patients!).

In general, it is often more helpful to be able to compare apples to apples and oranges to oranges - or in this case, compare patients to other patients. In most cases, comparing patients to other patients offers a clearer comparison. If one is looking for a good apple or a bad apple, it is helpful to assess its quality relative to other apples: It is harder to find a bad apple when the only thing you have to compare it to are oranges. On the other hand, there are certain times when comparing apples to oranges is better - such as when you are trying to find the best kind of fruit. The best kind of comparison always depends on what your question is.

You are free to copy this document for educational use or for your personal informational use provided:

1) it not edited or modified in any way;

2) no fee or compensation is charged for these copies and

3) all copyright notices remain attached

 Please read Disclaimer