EFFECT OF THE ABILITY DISTRIBUTION SHAPE ON THE GENERALIZED MANTEL-HAENSZEL STATISTICS USED FOR DIF DETECTION

 ABSTRACT: The main objective of this simulation study was to explore the effect of shape of the  distribution on the generalized nominal and ordinal Mantel-Haenszel statistics used for detecting DIF in polytomous items. The variables manipulated were: trait (), distribution shape (normal, positively skewed, and platykurtic),  distribution difference between the reference and the focal group (equal and unequal), sample size (500/ 500 and 500/250 examinees in the reference/focal group), and DIF conditions (No DIF, constant and shift-high DIF patterns). The generalized ordinal Mantel-Haenszel statistic was calculated using integer and log-rank scores. The results show: a) a little impact of the  distribution shape on the performance of all the statistics, and b) the advantages of employing log-rank scores, especially when the items show a shift-high DIF pattern.


Introduction
The Mantel-Haenszel methods constitute one of the most popular nonparametric differential item functioning (DIF) detection procedures.In the case of polytomous items, generalizations of the MH chi-squared statistic 2 MH  have also been used for detecting DIF: the generalized Mantel-Haenszel test -GMH (MANTEL and HAENSZEL, 1959;ZWICK et al., 1993a) and the Mantel test (MANTEL, 1963;ZWICK et al., 1993a).Fidalgo and Madeira (2008) have showed a unified framework for the analysis of DIF using the generalized Mantel-Haenszel statistic proposed by Landis at al. (1978).As is pointed out there, the GMH test and the Mantel test are particular cases of the generalized nominal Mantel-Haenszel (QGMH(1)) statistic and the generalized ordinal Mantel-Haenszel (QGMH(2)) statistic, respectively.In the same article they showed the results of a little simulation study about the effect of the choice of scores assigned to the response variables on the QGMH(2) statistic.They found that the use of log-rank scores instead of the usual integer scores increased the power of QGMH(2) for detecting the shift-high.This topic has received little attention given that studies on DIF have routinely employed integer scores (ANKENMANN at al., 1999;SU and WANG, 2005;WANG and SU, 2004b;ZWICK and THAYER, 1996).
On the other hand, although there is a increased interest in the development of psychometric theories that allow work with nonnormal distributions (SAMEJIMA, 2000;BAZÁN et al., 2006) and research about the influence of the nonnormality over the parameter recovery (KIRISCI et al., 2001;REISE and YU, 1990; VAN DER OORD et al., 2003), there is little research about the effect of the nonnormality on the DIF detection procedures.
Bearing in mind the above, the main goal of the present study is to determine whether the capability of the QGMH(1) and QGMH(2) statistics for DIF detection is affected by the shape of the  distribution and, in the case of QGMH(2), for the choice of scores assigned to the ordinal variable.With this objective a test was constructed to replicate educational tests that contain both dichotomous and polytomous items.

Generalized Mantel-Haenszel statistics
Type Below, we briefly present the generalized Mantel-Haenszel statistics used in this study.The interested reader can find more comprehensive information on these statistics in Fidalgo (2005) and Fidalgo and Madeira (2008).Landis et al. (1978) proposed a generalized MH statistic for the analysis of Q: R × C contingency tables.The data structure for this general contingency table is shown in Table 1.
The standard generalized Mantel-Haenszel is defined by Landis et al. (1978) as: where nh, mh , Vh and Ah are, respectively, the vector of observed frequencies, the vector of expected frequencies, the covariances matrix, and a matrix of linear functions defined in accordance with the alternative hypotheses (H1) of interest.The null hypotheses of noassociation will be tested against different H1: a) general association (both variables are nominal), b) mean score differences (factor is nominal and response ordinal); and c) linear correlation (both variables are ordinal).From Table 1, these vectors and matrices are defined as: where ph* and ph* are, respectively, (R × 1) and (C × 1) vectors with the marginal row proportions (phi = Nhi/ Nh•• ) and the marginal column proportions (phj = Nhj/ Nh•• ),  denoting the Kronecker product multiplication, Dph* is a (C × C) diagonal matrix with elements of the vector ph*on its main diagonal, and Dph* is an (R × R) diagonal matrix with elements of the vector ph* on its main diagonal.
In this study we will use the generalized MH statistics employed to test the general association (QGMH(1) ) and the mean score difference hypotheses (QGMH(2) ).To obtain them we should resolve the equation 1 using different matrices Ah ( ).Briefly, these are.
-QGMH(2) or the Generalized Ordinal MH statistic (GOMH).Here Rh is the same as that used in the previous case and Ch = (ch1, … ,chC) being a (1 × C) vector, where chj is an appropriate score reflecting the ordinal nature of the jth category of response for the hth stratum.Under H0, QGMH(2) has approximately a chi-squared distribution with df = (R-1).For the special case of 2 factor levels, QGMH(2) is identical to the extended MH test proposed by Mantel (1963).
Calculation of the QGMH(2) statistic requires selecting the scores (Ch) that will be applied to the response variable to compute the row mean scores used for comparing the factors across strata.In the DIF literature, integer scores are the most common choice (ANKENMANN et al., 1999;WANG and SU, 2004;ZWICK and THAYER, 1996), although selection of the values of Ch admits different possibilities as, for example, rank scores.3, in stratums where that can occur, a value of 0.5 was added to each column marginal frequency.In the first stratum, as there are observed ties, the corresponding log-rank scores for breaking ties are: 0.190, -0.629, -1.629 and -1.629.
In the present simulation study, we will employ integer and log-rank scores.From the general contingency table shown in Table 1, we obtain log-rank scores using the equation: To avoid zero denominators in Equation 2, in stratums where that can occur, a value of 0.5 was added to each column marginal frequency.In the present study, log-rank scores were calculated in two different ways.In the first of these, ties were broken assigning to ties the average of the values for the corresponding log-rank scores (KOCH at al., 1985).This strategy results in a conservative influence on test statistic.For this reason, in the second way, ties were not broken.In order to show how the log-rank scores were computed, Table 2 shows the frequencies tables with the responses given to a 4-point item that presents a shift-high DIF pattern and the corresponding log-rank scores.These responses were simulated using the partial credit model (PCM) (MASTERS, 1982) described below.

Data generation
Item parameters.An artificial test was constructed having 20 dichotomous items and seven polytomous items with four ordinal response categories.The item parameter values were selected so as to be representative of items found in applied testing setting, and were the same as those used by Chang et al. (1996).The generating parameters of the reference group are presented in the Table 3.The modelling of a mixed format test (20 dichotomous items and 4 polytomous items in the matching test) was intended to resemble the currently common practice of combining multiple-choice and constructed-response items in a single test administration.This type of tests have been used in other simulation studies, for example, see Ankenmann at al. (1999); Su and Wang (2005) and Zwick et al., 1993).
where s is equal to 0.20 and 0.40 under the constant and the shift high DIF pattern, respectively.
 parameter.Normal, positively skewed and platykurtic  distributions shapes were used to generate the data.Under all the distribution shapes, the ability of the reference group was a univariate distribution with mean 0 and standard deviation 1.Moreover, two focal groups were simulated: the first had the same ability distribution as the reference group, and the second had a mean one standard deviation below the reference group mean.It should be noted that we obtain strict skewed (skewness = 0.0, kurtosis = 0.0) and platikurtic distributions (skewness = 0.0, kurtosis = -1.0)using the Fleishman's power function only when Z is a normal variable with mean 0 and standard deviation 1.This is the case for the reference group, and the focal group with the same ability distribution as the reference group.In the other focal group, we used in the power function a normal variable with mean -1 and standard deviation 1.Therefore, in this case, we will obtain non-normal distributions with different values of skewness and kurtosis.Figure 1    Models.Datasets were generated for the 20 dichotomous items from a threeparameter logistic IRT model (3PLM) and from the partial credit model (PCM) (MASTERS, 1982) for the 4-point polytomous items.In the 3PLM, the probability of a correct response on item i, for an examinee with latent trait , is defined by Refere nce (0,1) F ocal (-1 ,1)

Ability
where ai is the item discrimination parameter, bi is the item difficulty parameter, ci is a pseudo guessing parameter, and D is a scaling factor equal to 1.7.
The polytomous items were generated using the PCM.As it is known, in this model, the probability of scoring x on item i with K categories (from k =1 to K), given , is defined by , where bik is the kth item difficulty parameter.By definition Sample size.We selected the group size pairs based on their proximity to real data.The larger sample size (500/500 examinees in the reference/focal groups) is consistent with those found in practice and has been used in numerous simulation studies (PENFIELD and ALGINA (2003); SU and WANG (2005); WANG and SU (2004b); ZWICK et al., 1993).Smaller sample size (500/250) is found in many precalibration stage DIF studies.

Form of calculating the generalized MH statistics
The MH statistics compare two or more groups conditional on a measure of ability evaluated by the test (the matching variable).The majority of the applications use the total score in the test as an estimation of ability.In such cases, in order to avoid contamination of the matching variable by the items with DIF, it is essential to apply this methodology in two stages (HOLLAND andTHAYER, 1985, FIDALGO et al., 2000;WANG andSU, 2004a, 2004b).With the aim of comparing the manipulated variables in an optimum situation, we calculated the generalized MH statistics using for calculation of total test score only the non-DIF items (Items 1 to 24; see Table 2).The item under analysis was always included in the matching variable.Furthermore, the software developed to calculate the generalized MH statistics was programmed to automatically exclude those levels of the matching variable in which there was only an examinee (Nh••=1).Intervals of one unit in the scale of scores were employed for matching examinees.

Design
The following factors were manipulated to study the effect on Type I error rate and power of the DIF-detection statistics:  distribution shape (normal, positively skewed, and platykurtic),  distribution difference between the reference and the focal group (equal and unequal), sample size (500/ 500 and 500/250 examinees in the reference/focal group), and DIF conditions (No DIF, constant and shift-high DIF patterns).

Results
Type I error rates were computed as average number of rejections of the studied items generated under H0 (Non-DIF) over the 1,000 replications.The Bradley (1978) liberal criterion was used in this study to assess the robustness of the MH statistics.A test fulfils his liberal criterion at α = 0.05 if the Type I error rate is between 0.025 and 0.075 (0.5α ≤n≤ 1.5α).The proportion of correctly-identified each DIF item in 1,000 data sets was used as a power estimate.The average Type I error rate and power of the generalized MH statistics under each  distribution shape are given in Table 4 (normal), Table 5 (platykurtic), and Table 6 (positively skewed).
In the case of QGMH(2), the estimations obtained averaging log-rank scores for tied observations are shown in square braked.As it can be seen in Table 4 thought 6, applying log-rank scores as if there were no ties give a more sensitive result than averaging log-rank for ties.Therefore, from now on all mention of the power and Type I error rate will refer to the result of log-rank scores computed without breaking ties.
In our study all three MH statistics yielded Type I error rates for the null-case that fulfilled Bradley's liberal robustness criterion under all the simulated conditions.
As expected, sample size had a great effect on power.The power of all the MH statistics for detecting DIF is very low under the smallest sample size, ranging from .26 to 0.42 (constant DIF) and 0.11 to 0.32 (shift-high DIF), across all  distribution shapes and between groups ability distributions.In the largest sample size (500/500) the power found ranged from 0.37 to 0.57 (constant DIF) and 0.17 to 0.44 (shift-high DIF), across all the conditions (see Table 4, 5 and 6).
An analysis of tables 3 through 5 reveals that the impact of the  distribution shape on the power of QGMH(1) and QGMH(2) is small.Normal and platykurtic distributions have very similar results.On the other hand, the positively skewed distribution shows a small increased in power in relation to the other distributions only when the DIF pattern was shifthigh.This increase ranged from 2% to 8%.This result is explained for the higher number of examinees in the response category where the DIF is present because of the skewed distribution of  compared with the normal and platykurtic distributions (see Figure 1).The results about the DIF patterns corroborate what has been found in the literature comparing and QGMH(2)-integer: a) QGMH(1) had much more power for detecting the shift-high DIF than QGMH(2), and b) QGMH( 2) is more powerful than QGMH(1) for detecting the constant DIF.It should be noted, however, that the difference between QGMH(1) and QGMH(2) for detecting the shift-high DIF was drastically reduced when QGMH(2) was computed using log-rank scores.This finding support the results of Fidalgo and Madeira (2008) Finally, in general, all the statistics yielded more power under the equal ability distribution conditions than under the unequal conditions.Moreover, as it can be seen in Tables 4 through 6, that this difference was considerably higher under the shift-high DIF pattern (always over the 9% across all the simulated conditions and statistics) than under the constant pattern.

Conclusions
The main topics investigated by the present research was: (a) how the  distribution shapes affects the performance of both QGMH(1) and QGMH(2) statistics; and (b) the effect of log-rank score assigned to the response variable on the QGMH(2) statistic.Other variables manipulated in the simulation study were the effect of the different  distribution between groups and the sample size.
One very clear finding is that the statistics applied [QGMH(1), QGMH(2)-INTEGER , QGMH(2)-LOG-RANK] yielded controlled Type I error rates for the null case, fulfilling Bradley's liberal robustness criterion under all the simulated conditions.
A second finding is that the  distribution shapes have a small impact on the performance of the generalized MH statistics used for DIF detection.In this respect the DIF pattern was the most important factor in order to predict the performance of QGMH(1) and QGMH(2).As pointed out in Fidalgo and Madeira (2008), QGMH(2) increases the statistical power with respect to QGMH(1) for detecting that the mean responses differ across the factor levels, whereas QGMH(1) offers the possibility of detecting more complex patterns of association than QGMH(2).Thus, it is small wonder that the QGMH(2) statistic yields higher power than QGMH(1) under the constant DIF pattern whereas QGMH(1) yields higher power that QGMH(2) for detecting the shift-high DIF.
A third finding is that, as expected, all the MH statistics, on increasing the sample size, increased their power under all the simulated conditions.On the other hand, the between groups ability distribution had a differential effect depending on the pattern of DIF.In all cases, detection rates decreased when the ability distributions were unequal.However, when the DIF was shift-high, there was a drastic difference in power between the equal and unequal ability distribution conditions.
Finally, from a practical point of view, the most relevant finding is that, as it was hypothesized, the type of score used to compute QGMH(2) influences its capability for detecting DIF.It should be stressed that the question here is not to choose a score that faithfully describe the "true" distances between ordered categories, but finding a score that allow us to detect the association pattern of interest.Specifically, we have obtained that, when the pattern of DIF is constant, used integer or log-rank scores yield very similar results.However, when the pattern of DIF is shift-high, using log-rank scores offer higher power than the usual integer scores.Moreover, given that the performances of QGMH(2)-LOG-RANK and QGMH(1) are very similar under the shift-high DIF pattern and QGMH(2)-LOG-RANK his fairly superior to QGMH(1) for detecting the constant DIF, QGMH(2) )-LOG-RANK is recommended if a single method is to be used for detecting the DIF patterns simulated.Unfortunately, in real tests we cannot know, in advance, what type of DIF the items have, and since QGMH(1) is capable to detect more complex pattern of association than QGMH(2), QGMH(1) can be a more realistic option (FIDALGO and BARTRAM, 2010;FIDALGO and SCALON, 2012).Os resultados mostram: a) um pequeno impacto da forma de distribuição de  no desempenho de todas as estatísticas e b) as vantagens de empregar pontuações de log-rank, especialmente quando os itens mostram um padrão de DIF com deslocamento elevado.
Fleishman's (1978)  power function Y = a + b + c Z 2 + d Z 3 was used to generate the skewed deviate, were Y is the positively skewed deviate, Z is a standard normal variable, and the rest of the constants were equal to a =0.1736, b =1.1125, c = 0.1736, and d = -0.0503.To generate the platykurtic distribution, the constant values in the Fleishman's power function were set to a = 0.0, b = 1.2210, c = 0.0, and d = -0.0802.Identical values have been used by Kirisci at al. (2001).
show the general shapes of the  distributions used based on 500 θ values.

Figure 1a -Figure 1b -
Figure 1a -Shapes of the θ distributions used to generate the data set (Normal).

Figure 1c -
Figure 1c -Shapes of the θ distributions used to generate the data set (Positively skewed).

Table 1 -
Data structure in the hth stratum _____________________________________________________

Table 2 -
Frequency tables with the responses given by focal and reference groups to a 4-: Log-rank scores are computed using Equation 2. To avoid zero denominators in Equation Note

Table 3 -
Item parameters.The last three items are the studied items(items 25, 26 and 27)

Table 4 -
Average type I error rate and power for the studied items under the normal ability distribution conditions.The estimations in square brackets were obtained

Table 5 -
Average type I error rate and power for the studied items under the platykurtic ability distribution conditions.The estimations in square brackets were obtained

Table 6 -
Average type I error rate and power for the studied items under the skewed ability