Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Overview

According to Holland (2008) in The First Four Generations of Test Theory, testing as a scientific enterprise is not more than 120 years old. Holland divides this enterprise into four overlapping generations. The first generation, which was influenced by concepts such as error of measurement and correlation that were developed in other fields, focused on test scores and saw developments in the areas of reliability, classical test theory, generalizability theory, and validity. This generation began in the early twentieth century and continues today, but most of its major developments were achieved by 1970. The second generation, which focused on models for item level data, began in the 1940s and peaked in the 1970s but continues into the present as well. The third generation started in the 1970s and continues into today. It is characterized by the application of statistical ideas and sophisticated computational methods to item level models, as well as models of sets of items.

The current fourth generation attempts to bridge the gap between the statistician/psychometrician role and the role of other components of the testing enterprise. It recognizes that testing occurs within a larger complex system and that measurement needs to occur within this larger context. In this paper, we will discuss one of Holland’s important contributions to the fourth generation of testing, the notion of tests as both blood tests and contests, and its implications for differential item functioning (DIF), which is a critical statistical procedure for ensuring fair measurement.

While the third generation was marked by statistical and computational advances, the work in this generation was too specialized. It seems as if modeling the item, and indirectly, the test, was the only concern of this generation of model builders. Examinees were needed to produce scores; if unavailable, the model could be used to simulate scores. In fact, simulations were more convenient and less hassle than grappling with unruly uncooperative data. That simulations inform reality is something of a fantasy. Holland (2008) noted that real tests exist in a complex world with test takers, test administrators, test score users, test developers, and legislation and policy issues.

A key feature of Holland’s (2008) fourth generation of test theory is that tests are only a part of a testing program. A test is a single instrument, but a testing program is a whole system of test production, administration, scoring, using, and interpreting test results that repeat in annual or other cycles and in many different sites.

Another aspect of the fourth generation of testing is the difference between what Holland (2008) called tests as blood tests and tests as contests. Users of test results often see tests as measurements in the same way that a blood test is a measurement of some aspect of an individual. In a remark appended to Cattell (1890) work, Galton wrote:

One of the most important objects of measurement…is to obtain a general knowledge of the capacities of a man by sinking shafts, as it were, at a few critical points. In order to ascertain the best points for the purpose, the sets of measures should be compared with an independent estimate of the man’s powers. (p. 380)

This is a vintage measurement view of testing. But the contest view should never be forgotten when it is relevant. As Holland (2008) noted, high stakes always make the contest perspective relevant. Test takers often see tests as contests in which they can be winners or losers. They want fairness.

Contest and the blood test views are sometimes in conflict. We address this conflict in the balance of this paper. In Sect. 14.2, we mention some of Holland’s major contributions to and influences on the fourth generation of testing. In Sect. 14.3, we apply contest/blood test thinking to the area of DIF. Section 14.4 is a recap of previous sections.

2 Briefly, Holland’s Contributions to Differential Item Functioning and Equating

Score equating and DIF are fourth generation activities that have been going on for decades. Holland has been active in both. He coauthored four books on these topics: Holland and Rubin (1982), von Davier, Holland, and Thayer (2004), and Dorans, Pommerich, and Holland (2007) about score linking and equating; and Holland and Wainer (1993) on DIF. The difference in number, 3 to 1, and the fact that that DIF is sandwiched in time between the equating books, reveal that Paul was more interested in equating than DIF.

2.1 Equating

Early in the 1980s, Paul and I tried to define the notion of score equatability. DIF (Holland & Wainer, 1993; Zieky, this volume, Chap. 8), renorming the SAT® (Dorans, 2002), 3,000 miles, and a variety of other issues kept us from doing so. In 2000, after Dorans grappled with concording the SAT and ACT (Dorans, Lyu, Pommerich, & Houston, 1997) and Holland chaired the Committee on Equivalency and Linkage of Educational Tests for the National Research Council that produced Uncommon Measures: Equivalence and Linkage Among Educational Tests (Feuer, Holland, Green, Bertenthal, & Hemphill, 1999), we finally got around to equatability again. The end result was Population Invariance and the Equatability of Tests: Basic Theory and the Linear Case (Dorans & Holland, 2000), which made the case for assessing equatability by checking assumptions associated with equating. Holland and Dorans (2006) contained within it the essence of our collaborative effort on score linking.

2.2 Differential Item Functioning

Shortly after Holland had completed Alderman and Holland (1981), an early foray into an area that would command much of his attention over the next decade, he introduced Dorans to direct standardization (Mosteller & Tukey, 1977). Dorans adapted this approach and introduced the standardization method, which was soon applied to the SAT program to assess item fairness (Dorans & Kulick, 1983, 1986).

A few years later, as noted by Zieky (this volume, Chap. 8), Holland was drawn deeply into the problem of developing an alternative to the so-called Golden Rule method. Dorans was pulled into this work as well, and a close collaboration on DIF issues occurred over the next several years as Holland spearheaded the implementation of Mantel-Haenszel (MH) and standardization procedures here at ETS. The MH approach became an industry-wide standard.

Holland’s book with Wainer (Holland & Wainer, 1993) was the apex of his work with DIF. Holland’s work on DIF was mostly reactive, with some notable exceptions. In Hackett, Holland, Pearlman, and Thayer (1987) and Schmitt, Holland, and Dorans (1993), he illustrated how experimentation could be used to advance our substantive understanding of DIF. As he left the DIF domain, Holland (1994) issued a challenge to the DIF community: DIF is a psychometric procedure that is carried out for contest reasons – the public needs to view test items as fair. We take up that challenge in the next section, which examines DIF from contest and measurement perspectives.

3 True-Score Estimates Are Really Observed Scores

In this section, I briefly describe the standardization method for DIF assessment and what is sometimes called its true-score version, SIBTEST (Shealy & Stout, 1993). Then I explore the fairness of the SIBTEST procedure.

3.1 Standardization

Dorans (1982) reviewed item bias studies that had been conducted on SAT data in the late seventies and concluded that these studies were flawed because either DIF was confounded with lack of model fit or contaminated by impact as a result of fat matching, the practice of grouping scores into broad categories of roughly comparable ability. A new method was needed. As noted above, Dorans and Kulick (1983, 1986) developed the standardization approach after consultation with Holland. The formulas in the following section can be found in these articles and in Dorans and Holland (1993).

3.1.1 Standardization’s Definition of Differential Item Functioning

An item exhibits DIF when the expected performance on an item differs for matched examinees from different groups. Expected performance can be operationalized by nonparametric item-test regressions. Differences in empirical item-test regressions are indicative of DIF.

The first step in the standardization analysis is to use all available data to estimate nonparametric item-test regressions in the reference group and in the focal group. The focal group is the focus of analysis while the reference group serves as a basis for comparison.

Let \( {\varepsilon_f}(Y|X) \) define the empirical item-test regression for the focal group f, and let \( {\varepsilon_r}(Y|X) \) define the empirical item-test regression for the reference group r, where \( Y \) is the item score variable and \( X \) is the matching variable. The definition of null DIF employed by the standardization approach implies that \( {\varepsilon_f}(Y|X) = {\varepsilon_r}(Y|X) \).

The most detailed definition of DIF is at the individual score level, m,

$$ D_m = {\varepsilon_f}(Y|X = m) - {\varepsilon_r}(Y|X = m), $$
(14.1)

where \( {\varepsilon_f}(Y|X = m) \) and \( {\varepsilon_r}(Y|X = m) \) are realizations of the item-test regressions at score level m. The \( D_m \) are the fundamental measures of DIF according to the standardization method. Plots of these differences, as well as plots of \( {\varepsilon_f}(Y|X) \) and \( {\varepsilon_r}(Y|X) \), provide visual descriptions of DIF in fine detail for binary as well as polytomously scored items. For illustrations of nonparametric item-test regressions and differences for an actual SAT item that exhibits considerable DIF, see Dorans and Kulick (1986).

3.1.2 Standardization’s Primary Differential Item Functioning Index

While plots describe DIF directly, a need was identified for some numerical index that targets suspect items for close scrutiny, while allowing acceptable items to pass swiftly through the screening process. Standardization has such an index, \( STD \; EIS \hbox{-} DIF \) (Dorans & Kulick, 2006), which uses a weighting function supplied by the standardization group. The standardization group supplies specific weights for each score level that are used in weighting each individual \( D_m \) before accumulating the weighted differences across score levels to arrive at a summary item-discrepancy index, \( STD \; EIS\hbox{-}DIF \), which is defined as:

$$ STD \; EIS\hbox{-}DIF = {\varepsilon_f}(Y) - {\hat{\varepsilon }_f}(Y) = \frac{{\sum\limits_{m = 1}^M {{N_{fm}} * {\varepsilon_f}(Y|X = m)} }}{{\sum\limits_{m = 1}^M {{N_{fm}}} }} - \frac{{\sum\limits_{m = 1}^M {{N_{fm}} * {\varepsilon_r}(Y|X = m)} }}{{\sum\limits_{m = 1}^M {{N_{fm}}} }}, $$
(14.2)

where \( {N_{fm}}/\sum\limits_{m = 1}^M {{N_{fm}}} \) is the weighting factor at score level \( X_m \) supplied by the standardization group to weight differences in item performance between the focal group \( {\varepsilon_f}(Y|X) \) and the reference group \( {\varepsilon_r}(Y|X) \).

In contrast to impact, in which each group has its relative frequency serve as a weight at each score level, standardization uses a standard or common weight on both \( {\varepsilon_f}(Y|X = m) \) and \( {\varepsilon_r}(Y|X = m) \), namely \( {N_{fm}}/\sum\limits_{m = 1}^M {{N_{fm}}} \). The use of the same weight on both \( {\varepsilon_f}(Y|X = m) \) and \( {\varepsilon_r}(Y|X = m) \) is the essence of the standardization approach.

Use of \( {N_{fm}} \) means that \( EISDIF \) equals the difference between the observed performance of the focal group on the item and the predicted performance of selected reference group members who are matched in ability to the focal group members. This difference can be derived very simply; see Dorans and Holland (1993).

For standardization, the definition of null-DIF conditions on an observed score,

$$ {\varepsilon_f}(Y|X) = {\varepsilon_r}(Y|X). $$
(14.3)

3.2 SIBTEST: A Model-Based Standardization Approach to Differential Item Functioning

Shealy and Stout (1993) introduced a general model-based approach to assessing DIF and other forms of differential functioning. They cite the standardization approach as a progenitor, but claim that SIBTEST was developed independently of standardization. From a theoretical perspective, SIBTEST is elegant. It sets DIF within a general multidimensional model of item and test performance. Unlike most item response theory (IRT) approaches, which posit a peculiar form for the item response model (e.g. a two-parameter logistic model), SIBTEST does not specify a particular functional form. In this sense, it is a nonparametric IRT model, in principle, in which the null definition of standardization is replaced by

$$ {\varepsilon_f}(Y|{T_x}) = {\varepsilon_r}(Y|{T_x}), $$
(14.4)

where \( T_x \) represents a true score for \( X \). As such, it employs a measurement invariance definition of null DIF, while standardization employs a prediction invariance definition (Meredith & Millsap, 1992).

Kelley (1927) provided a framework for true-score theory that introduced his formula relating observed test scores, true scores, and reliability. This research led to classical test theory that was eventually first codified by Gulliksen (1950) and later given a sound statistical basis by Lord and Novick (1968). Classical test theory decomposes an observed score for the ith person on occasion o, \( X_{io} \), into a systematic component \( T_{xi} \) and an error component \( E_{xio} \),

$$ {X_{io}} = T_{xi} + {E_{xio}}. $$
(14.5)

Note that this definition is at the level of the individual, and o in the classical definition could refer to replications of parallel tests. In this representation, \( T_{xi} \) is defined as the expected value for a single person i across parallel measurements; expectation is over tests,

$$ {T_{xi}} = {\varepsilon_o}({X_{io}}). $$
(14.6)

Holland (Holland & Hoskens, 2003) preferred to think of \( X_{io} \) as representing the score of an individual i from a subpopulation in which all individuals have the same true score or ability level. In this case, \( T_{xi} \) is defined as the expected value on a single test \( X_{io} \) across parallel people from subpopulation o,

$$ {T_{xo}} = {\varepsilon_i}({X_{io}}). $$
(14.7)

Because the tests are parallel in one case and the people are parallel in the other case, these two expectations yield the same answer, \( {T_{xi}} = {\varepsilon_o}({X_{io}}) = {T_{xo}} = {\varepsilon_i}({X_{io}}) \), when the tests are parallel and the people are parallel. Hence alternative conceptualizations of the true score exist: one at the level of the individual (across parallel tests), and one at the level of the test (across parallel people). But neither conceptualization can be realized in practice.

To make SIBTEST practical, Shealy and Stout (1993) resorted to Kelley’s (1927) equation for estimating true scores from observed scores. In essence, SIBTEST replaces the empirical item-test regression used by standardization with an adjusted regression that employs Kelley’s equation. The null definition of DIF for standardization as shown in (14.3) is replaced by this null-DIF hypothesis,

$$ {\varepsilon_f}(Y|X = m) + Ad{j_{mf}} = {\varepsilon_r}(Y|X = m) + Ad{j_{mr}} $$
(14.8)

where

$$ Ad{j_{mf}} = \frac{{{E_f}(Y|X = m + 1) - {E_f}(Y|X = m - 1)}}{{{{\hat{T}}_f}(X = m + 1) - {{\hat{T}}_f}(X = m - 1)}}\left( {\frac{{{{\hat{T}}_r}(X = m) - {{\hat{T}}_f}(X = m)}}{2}} \right) $$
(14.9)

and

$$Ad{j_{mr}} = - \frac{{{E_r}(Y|X = m + 1) - {E_r}(Y|X = m - 1)}}{{{{\hat{T}}_r}(X = m + 1) - {{\hat{T}}_r}(X = m - 1)}}\left( {\frac{{{{\hat{T}}_r}(X = m) - {{\hat{T}}_f}(X = m)}}{2}} \right). $$
(14.10)

Kelley’s (1927) correction comes into play at this point:

$$ {\hat{T}_g}(X) = re{l_g}(X)*X + (1 - re{l_g}(X))*{\varepsilon_g}(X). $$
(14.11)

This equation produces a subgroup-specific linear transformation of the observed score \( X \). It is not the true score, as defined above, which takes an expectation across parallel people or across parallel test forms. The expectation used by SIBTEST produces a mean for the focal group and a mean for the reference group. In SIBTEST, an observed score on \( X \) is treated differently depending on whether it is obtained by the reference group or the focal group. It is regressed to a different mean. This difference in regressed means leads to a higher item-test regression for the lower scoring group and a lower one for the higher scoring group. For example, SIBTEST’s effect on Black/White DIF would be to reduce the negative DIF against the Black group on the grounds that DIF indicated by standardization is inflated to the extent that the groups differ on the unreliable observed score matching variable.

Is the use of the differential regression corrections and its effect on the item-test regression defensible from a contest point of view? DIF, after all, is a contest activity. We examine this question in the next subsection.

3.2.1 A Dangerous Application

Wainer (2007) cites Kelley’s (1927) equation as a contender for the world’s most dangerous equation. According to Wainer, a dangerous equation is one that people are ignorant of and has serious implications for a wide variety of applications. Sometimes an equation can be dangerous if it is known and misused. Shealy and Stout (1993) were aware of Kelly’s formula, but they misused it. Estimated true scores are not true scores. Instead they are linear transformations of observed scores that are regressed toward a mean to a degree that reflects the uncertainty of the prediction. The use of different transformations for the reference and focal groups is tantamount to using subgroup specific linkings of observed scores that take into account subgroup means and standard deviations.

Shealy and Stout (1993) operate within the classical test theory framework, which is the core of Holland’s first generation of testing (Holland, 2008). Whenever the Kelley correction is used in practice, certain problems arise. First, as noted above, TSEs are not true scores. Tucker (1971) addressed this type of distinction in the context of factor scores.

An even more perplexing issue associated with using the use of the Kelley correction is the which group question. Each examinee is a member of a large number of groups. In DIF, race/ethnicity and gender are the groups of interest. For example, one test taker is male and White. Hence, he belongs to the group called White males, the group called male, and the group called White, as well as being a member of the total group that includes both gender groups and all ethnic/racial groups and those who choose not to identify themselves. As a White male, he has observed scores that can be regressed to the total group mean, the mean of Whites, and the mean of males or the mean of White males. The observed scores of an Asian American woman, on the other hand, could be regressed to the mean of Asian Americans, the mean of women, or the mean of Asian American women or the overall mean.

If SIBTEST regressed to the overall mean, it would be identical to standardization since the Kelley correction is simply a linear transformation of the observed scores and standardization results are invariant with respect to this linear transformation except for some clumping that might be introduced if scores were rounded. In order for SIBTEST to be different from standardization, it has to regress to different means, namely those of the focal and reference group.

3.2.2 SIBTEST True-Score Estimates (TSE): An Example

Consider the following example constructed from data in the public domain on a well-known math test. The SAT is a widely used admissions test with widely published statistical properties. In 2005, the average SAT-Math mean was 520. I chose 2005 because the mean that year is a reportable score; SAT scores ranged from 200 to 800 in steps of 10 (College Board, 2005). Let’s assume that the reliability of the test \( X \) is the same in both the focal and reference groups.

The leftmost column of Table 14.1 contains labels for each group. Alongside that column are the means for each group. Next come three pairs of columns. Each pair contains TSEs based on Kelley’s formula using a common reliability of 0.90 and the difference between the TSEs and the observed score that appears at the top of the each pair of rows. Three observed scores are considered: 420, which is just below the mean score of Black female test takers on SAT-Math; 520, the total group average; and 600, just above the average score for Asian American male test takers.

Table 14.1 True-score estimate (TSE) and difference between true-score estimate and observed score (OS) as a function of observed score and group mean (mean) for a test score reliability of 0.90

For an observed score of 420, using of one of the three Black group means (424, 431, or 442) leaves the score basically unchanged (420, 421 or 422), which means a Black examinee with a score of 420 would have an estimated true score close to 420. In contrast, the TSE for an Asian American examinee with a 420 would increase by 15 to 18 points (435, 436, or 438), depending on which of the three Asian American means (566, 580, or 595) were used. If the score of 420 were regressed to the total group mean of 520, all examinees with observed score of 420 would be regressed to 430, an increase of 10 points, regardless of which group they came from.

On the other hand, for a score of 600, regression to the total group mean of 520 would reduce the observed score by about 8 points to 592. If their subgroup specific version of the regression were used, Asian American test takers with 600 would be barely affected, while Black examinees with scores of 600 would be pulled down toward 580.

The average score of 520 would be pulled toward 510 for Black examinees and toward 530 for Asian American examinees.

If the reliability of the test decreases, the TSEs are regressed more toward the mean, and if separate regressions are employed, the estimates are pulled even more toward different means.

In the equal reliability case, the kernel of the SIBTEST correction for the unreliability of the matching variable is captured in the term \( \left( {{T_r}(X = m) - {T_f}(X = m)} \right)/2 \) which is half the difference in the TSEs for the focal and reference groups at \( {x_m} \). This term is added to the item-test regression for the focal group and subtracted from the one for the reference group. The ultimate effect is that relative to the observed item-test regressions used by standardization; these adjustments make the item look easier for the lower scoring group and harder for the higher scoring group. Hence a positive DIF item (favors lower-scoring focal group, e.g. Black examinees) under standardization would look even more positive under SIBTEST, while a negative DIF item (favors higher scoring reference group, e.g. White test takers) would look less negative under SIBTEST. Conversely, a positive DIF item (favors higher-scoring focal group, e.g. Asian American examinees) under standardization would look less positive under SIBTEST while a negative DIF item (favors lower scoring reference group, e.g. White test takers) would look more negative. In effect, SIBTEST would suggest less negative DIF for Black examinees and more negative DIF for Asian American examinees.

3.2.3 Which Group?

The use of TSEs in place of observed scores produces results that run counter to the purpose of DIF when examined from a contest perspective. For example, a Black examinee obtains a 600 but instead receives a true score estimate of about 580 (rounded estimate). An Asian American examinee keeps his or her score of 600. A second Black examinee keeps his or her score of 420, but a second Asian American test taker with the 420 gets a 440. A third Asian American test taker with a 520 receives a 530, while a third Black test taker with a 520 receives a lower score of 510. In essence, SIBTEST adjusts your score in the direction of the mean of the group you came from before assessing DIF, making an adjustment that seems to run counter to the intent of the DIF analysis.

Take an Asian American female examinee with a score of 420. Because she is in the Asian American group, she gets boosted past 430 towards 440. As a female, she only gets up to 430. What about the Black female test taker with a score of 600? She gets dropped almost to 580 as a member of the Black group and close to 590 as a female. Which TSE is better? One might argue that conditioning on both gender and race is better than using either one alone. Then the Asian American female test taker with a 420 gets close to 440, while the Black female test taker with a 600 gets close to 580. If we had more useful information, we could condition on that, and in an ideal world, we could reduce our uncertainty about the transformed observed score to an acceptably small level. Along the way, we would have a wide variety of estimates to choose from, none of which is a true score in the sense of an expected score over many parallel people as noted in the next section.

3.2.4 Constructing the Perfect True Score Estimate

The example cited above used a reliability of 0.9. While results from SIBTEST and standardization differ here, they don’t differ by much. Most of the literature that shows differences between these methods or between MH and SIBTEST involves tests with lower reliabilities. Standardization suffers when the matching variable is unreliable. SIBTEST attempts to fix the unreliability by regressing scores toward the focal and reference group means, respectively.

As noted earlier, the true score of real interest is the expected value of an examinee’s performance over many parallel forms of a test or the expected value of many parallel people on the test. The approach employed by SIBTEST of using subgroup-specific conversions of observed score that regress observe scores toward subgroup means does not achieve this goal.

As noted earlier, classical test theory decomposes an observed score into a systematic component \( T_{xi} \) and an error component \( E_{xio} \). Holland (Holland & Hoskens, 2003) preferred to think of \( X_{io} \) as representing the score of an individual i from a subpopulation in which all individuals have the same true score or ability level. In their case, \( T_{xi} \) is defined as the expected value on a single test \( X_{io} \) across parallel people from subpopulation o as shown in (14.7).

SIBTEST uses (14.11) to estimate true score for a given value of \( X \). Equation (14.11) represents a subgroup-specific linear transformation of the observed score \( X \). The expectation used by SIBTEST produces a mean for the focal group and a mean for the reference group. In SIBTEST, an observed score on \( X \) is treated differently. Depending on whether the score is obtained by a person in the reference group or the focal group, it is regressed to a different mean. It is not regressed to the true score defined in (14.7), which is an expectation across parallel people.

Hence SIBTEST, as operationalized, fails to achieve what it seeks as a measurement model. In addition, it introduces unfairness into a process that is all about fairness. It replaces an unbiased estimate of individual true score with a least squares estimate that depends on group membership. This is akin to regressing ice skaters’ scores, which exhibit some unreliability, towards the mean of ice skaters from their country instead of using their actual ice skating scores.

4 Keeping the Contest in Mind

Holland (2008) noted that the fourth generation of testing is characterized by an emerging view that testing should be aware of multiple perspectives, not all of which are compatible. Two important perspectives in high stakes testing are what he calls the contest and blood test perspectives. This paper has described the tests-as-blood-tests perspective as one that is more aligned with the interests of test users, while the tests-as-contests perspective is aligned with the interests of the test taker. To the extent that testing conditions are poor, such as when tests and anchors are unreliable and when matching variables and anchors are unrepresentative of the items and tests being studied, DIF and equating methods aligned with the contest and blood test perspectives will produce different results.

The methods most aligned with the tests-as-blood-tests perspective will replace data with model assumptions. Each method is based on underlying theories and assumptions that are likely to be incorrect when these methods are applied in these undesirable situations. The use of these methods with their strong reliance on measurement models to augment weak data should not be done blindly. Assumptions should be questioned.

The standardization DIF method employs regressions involving observables. It focuses on observed scores and employs regressions that match on an observed score in the same way in both populations of interest. Standardization assesses whether the item-test regressions are the same across focal and reference groups. It is a contest-oriented method. It has problems, however, when the matching variable is unreliable.

The SIBTEST DIF method appears to be more aligned with measurement models (i.e. blood tests). This method assumes that examinee group differences influence DIF or test form difficulty differences more than can be observed in unreliable test scores. The observed data are pulled toward what is suggested to be appropriate by the measurement model. The degree to which this pulling occurs depends on the extent that these data are unreliable. In the absence of reliable data on the individual, it will presume, for example, that a Black examinee would receive the average score obtained by all Black examinees, and a male examinee would receive the average score obtained by male examinees. SIBTEST regresses observed item data to what would be expected for the focal or reference group on the basis of ample data that show that race and gender are related to item performance. In essence, the SIBTEST method uses a subgroup-specific TSE as a surrogate for the true score that is defined in the classical test theory model.

SIBTEST results differ from standardization results because SIBTEST transforms raw scores differently across the different groups. It starts from the premise that the observed score is not only unreliable but biased against higher scoring groups. Instead of viewing a true score as an expectation over replications of parallel tests or parallel individuals, SIBTEST treats TSE as a prediction problem, introducing bias to reduce mean squared error.

Knowing a person’s gender, race, years of schooling, performance on similar tests, and so on should lead to TSEs with smaller mean squared error than test score alone does. But is it sensible to use this information in a process that exists to demonstrate that items behave consistently across subgroups? The author thinks the answer is no.

One of the requirements of test score equating is that equating functions are invariant across test groups. If X and Y are two parallel tests, the linking relationship between them would be invariant across subgroups. In addition, X = X holds in all subgroups because it is parallel to itself. Likewise, the relationship between the true score on X and X is the same in all subgroups.

When SIBTEST employs a subgroup specific transformation of X (or Y) toward a different mean, it implicitly states that the relationship of X to itself is subgroup dependent. Subgroup specific regressions have been rejected as means of equating tests for over 80 years (Kelley, 1927). Why employ these transformations prior to a DIF analysis? SIBTEST use of subgroup specific regressions seems to run counter to the purpose of producing a fair contest.

Poor reliability leads to poor assessment. SIBTEST does not provide a correct solution to the reliability problem. The solution is to marry measurement with contest. The most direct way of doing this is to ensure that the matching variable is reliable enough. Then the observed score approaches the true score. If a score is not reliable enough to support a DIF analysis, it probably is not reliable enough to be reported.

The fourth generation of testing should fully integrate both the contest and blood test perspectives. As Holland (1994) said in the context of DIF:

…tests are not just measuring instruments…that they are sometimes contests as well is the main reason that we care about fairness…

The measurement view can certainly inform the contest view (and I think that this is important to say to those who only subscribe to the contest view) but neither can replace the other. (p. 29)

The best way to resolve the contest/measurement conflict is not with measurement models that attempt to compensate for poor measurement, but with better measurement. Better measurement should lead to fairer and more useful contests. When the results of statistical procedures based on different perspectives converge, both fairness and measurement are served.