Highlight, copy & paste to cite:

Sukirno & Siengthai, S. (2010). The Comparison of Graded Response Model and Classical Test Theory in Human Resource Research: A Model Fitness Test, Research and Practice in Human Resource Management, 18(2), 77-90.

The Comparison of Graded Response Model and Classical Test Theory in Human Resource Research: A Model Fitness Test

Sukirno & Sununta Siengthai


When multiple four point interval scales are available, researchers usually assign successive integers to the response categories and then simply sum up the raw scores on each item to estimate the true score of each person on the underlying dimension. This approach usually is classified as classical test theory. Classical test theory has been often criticised for its assumption of equal weight for all items and of equal interval between ordinal response categories. This article aims to provide evidence on the model fitness using item response theory called graded response model as a new approach in measuring latent variables compared to classical test theory in human resource management (HRM) research. This simulation study generated four different sample sizes of 113, 226, 339 and 452 observations based on a study in HRM. This study finds that graded response model is more precise in estimating statistical parameters in a big sample size and structural equation model analysis than its counterpart. Meanwhile classical test theory is still a reliable statistical tool in which regression analysis and a small sample size are embedded. Consequently, human resource researchers are encouraged to precisely choose a measurement model which fits with their research designs.


The quality of a research depends on several factors and one of them is the quality of measurement (Zagorsek 2000). If various aspects of phenomena are not measured properly, wrong conclusions may be drawn about relationships among them. Accurate and reliable measurement is a principle foundation of any research. In psychological and HRM research, an interval scale response set is ubiquitous (McBride 2001, Edwards 2009). When multiple interval scale items are used, researchers often employ classical test theory (CTT) by assigning successive integers (usually named as SSI, Sum of Successive Integers) to the response categories, and then simply sum up the raw scores on each item to estimate the true score of each person on the underlying dimension. This approach has been often criticised for its assumption of equal weight for all items and of equal interval between ordinal response categories (Chan 1996, Wiberg 2004).

Since the beginning of the 1970s, Item Response Theory (IRT) has more or less replaced the role CTT had and is now the major theoretical framework used in this scientific field. IRT presents an excellent methodology for evaluation of leadership instruments unlike CTT that does not equally assume the precision across the full range of possible test scores (Zagorsek 2000). A statistical procedure proposed by IRT to analyse an interval scale data is graded response model (GRM), a branch of IRT. The GRM is properly used in instances where responses are ordered, for example, as in an interval scale or multiple point grading scale, rather than dichotomous, as in the single, double, and three parameter models. This model attempts to extract more information from the individual responses beyond the question of whether they give correct or incorrect responses (McBride 2001).

GRM developed from IRT generally performs the best and most robust against skewness compared to other approaches (Chan 1996) and GRM is the item response model which is most often applied to an interval scale data (Lautenschlager, et al. 2006). Despite theoretical differences between GRM and CTT in HRM research, there is a lack of empirical knowledge about how, and to what extent, the GRM and CTT behave differently. This study aims to examine the empirical relationships between IRT and CTT in HRM research and to find which model produces the fittest statistical model based on the HRM simulation data.

Theory and Hypotheses

In general terms, measurement is the observation of any characteristics and expression of those observations with numbers or other symbols. Without measurement, there will not be any sciences (Anil 2008). Further, if something exists in nature, it exists in definite quantity and if something exists in definite quantity, it should be measurable. In HRM research, phenomena that might profit from measurement, there may not be variable that are directly observable. Hence, examination of the truthfulness of theoretical relevance in social sciences may be based on observation and measurement sensitivity.

The most commonly practiced theory in HRM studies is the CTT model. In this model, determination of psychological characteristic degree that an individual owns is acquired by taking total reaction of each individual to the measurement tool items that are prepared in order to measure that characteristic. In other words, raw score of a person is the indicator of the degree, which that person possesses (Anil 2008). Revee (2002) mentions that IRT has a number of advantages over CTT methods in estimating leadership competencies or skills. Application of IRT in survey type of organisational, managerial, leadership and education research has several advantages over classical test theory (Rubio, et al. 2007, Reeve 2002, Zagorsek 2000, Santor & Ramsay 1998).

Further Santor and Ramsay (1998) explain that the expected score in CTT is computed from the responses to each item while the IRT estimated score is sensitive to differences among individual response patterns and is a better estimate of true value on the continuum data than CTT. The ratings on all variables will be grouped and summed to generate a single composite score (Smeenk 2008). The most significant difference between CTT and IRT in the present context is concerned with the standard error of measurement (Oishi 2005). Whereas the standard error of measurement is assumed to apply to the whole sample in CTT, the standard error of measurement in IRT varies depending on the latent trait score (typically, there is less reliability for those with extreme latent scores).

Table 1
Comparison of IRT and CTT
Classical Test Theory Item Response Theory
Measures of precision fixed for all scores Precision measures vary across scores
Longer scales increase reliability Shorter, targeted scales can be equally reliable
Test properties are sample dependent Test properties are sample free
Latent variable level estimated directly from the raw scores. Latent variable level estimated from the raw scores, score patterns and item properties (thresholds and difficulties).
Comparing respondents requires parallel scales Different scales can be placed on a common metric
Summed scores are on ordinal scale Scores on interval scale

CTT’s trait scores were assigned by successive integers to the response categories and then simply sum up the raw scores on each item divided by the number of items to estimate the trait score of each person. Sekaran (1992) proposed a simple formula to calculate trait score based on the CTT by using this following formula.

X bar (mean) = Sum of I (total score of aspect) / N (number of items)
x̄ = Mean of observed score
I = Total score on an aspect being measured
N = Number of items

IRT was originally developed in order to overcome the problems with CTT. CTT and IRT are widely perceived as representing two very different measurement frameworks. However, few studies have empirically examined the similarities and differences in the parameters estimated using the two frameworks especially in HRM studies. IRT has a great potential for solving many problems in testing and measurement.

IRT is a statistical approach for linking respondent scores to the items on a trait or latent variable. The linking is represented in a mathematical equation between those item scores and the trait. Trait is usually a latent variable measured by some items (Chan 1996). For instance, a managerial performance variable is measured by asking some items to respondents about managerial performance in planning, organising, directing and evaluating. This concept is related to variable concepts used in structural equation modelling (SEM). There are two main variables in SEM, observed and latent variable. A measured variable is a variable that can be observed directly and is measurable. Measured variables are also known as observed variables, indicators or manifest variables. A latent variable is a variable that cannot be observed directly and must be inferred from measured variables. Latent variables are implied by the covariance among two or more measured variables. They are also known as factors (i.e., factor analysis), constructs or unobserved variables.

IRT typically has been developed for scaling dichotomous and polytomous data onto an equal interval scale. For dealing with ordinal polytomous data, there are at least three models available. These are partial credit model, rating scale model and graded response model. Polytomous IRT models like the graded response model can handle an unlimited number of score categories for items. Best known of the polytomous response IRT models is GRM developed by Samejima which is a part of IRT (Muraki & Bock 1997). This model is believed to be the best and most robust against skewness compared to other approaches (Chan 1996). The model has been widely used in psychological research to estimate respondent ability based on an interval scale questionnaire. For ordered polytomous data, that is questions with three or more response categories, GRM is most frequently used (Rubio, et al. 2007, Weijters 2006, Zagorsek 2000). Further, for polytomous responses, GRM has fitted the data reasonably well in most studies (Rubio, et al. 2007).

GRM is appropriate when response options for items are sequentially ordered (e.g., interval scale questionnaire) and are most appropriate for attitude and personality measures, but not limited to these domains. The GRM is also a polytomous IRT model which is a materialisation of Thurstone’s method, which was used for measuring attitude by pair comparisons (Hachey 2008). According to the type of scale used and its response format, GRM is often selected. The reasons for this choice instead of others were: 1) GRM was one of the first models for graded polytomous items, 2) it is appropriate for items with different parameters, 3) it is a natural model for rating scales, and 4) more published studies exist on parameterisation over the GRM than any other polytomous models so the conditions for obtaining good estimations are well known (Rubio, et al. 2007).

Samejima fitted a two-parameter logistic function to the probability of obtaining a particular score or a higher score on a rating scale. GRM is based on the logistic function giving the probability that an item response will be observed on the basis of response alternative. The formula for estimating trait scores in GRM is as follows:

Pix*(θ) = Cumulative boundary function of a trait score,
θ = Trait score,
D = 1.7 (a constant),
e = 2.71828,
a = Item discrimination, and
b = Location parameter or extremeness.

The cumulative boundary for each respondent in the study is based on the way he/she rates items. The trait score ranges from -4 to a 4. Items that are rated very low or very high will be treated as unestimated items (outliers) and will be estimated with 999.000 (Hambleton & Rogers 1989). An item discrimination parameter indicates how well a given item captures the latent trait that it is supposed to measure. This is conceptually equivalent to item total score correlation in CTT and item factor correlation in factor analysis. Item discrimination parameter differentiates higher respondents’trait scores from those who have lower trait scores. This scores usually range from 0 to 2 (Hambleton & Rogers 1989). The steeper the slope, the greater is the ability of the item to differentiate between people.

The location parameter indicates the likelihood of passing the item, given different levels of the latent score (Oishi 2005). It represents the location parameter for score X and denotes the point on the θ. It is also named as pseudofactors or extremeness by Brown (2006). Hambleton and Rogers (1989) explained that it is represented in a z-score metric and usually ranges from -3 to +3 although theoretically it might also have unlimited range (for instance for the extreme scores). The higher value of the location parameter, the less likely the respondents rate items in maximum scale.

Ewing, et al. (2005) urged that the IRT model offers a different perspective of measurement. The study which compares the difference between CTT and IRT approach across samples and test forms in chemistry was conducted by Magno (2009). It was found that study 1) IRT estimated parameters do not change across samples compared to the CTT estimated ones with inconsistencies, and 2) IRT had significantly less measurement errors than the CTT approach. A theoretical comparison of different approaches may help researchers decide with which one they feel more comfortable. Based on those different styles of how the trait score is assigned, then this study aims to examine the following two hypotheses.

H1: The latent trait scores estimated by CTT and GRM will significantly correlate.

H2: GRM will produce fitter model in estimating statistical parameters than CTT when an interval scale is employed in the research.


Sample and Site

Based on the a small scope research conducted in Indonesia, four different sample sizes of 113, 226, 339 and 452 observations were generated based on four point interval scales ranging from 1 = never to 4 = always. To maintain the similarity of data distribution, all sample sizes were generated from a data set in which its distribution is normal (See Table 2).

Table 2
One-sample Kolmogorov-Smirnov test for normality test
Description Job
N 113 113 113
Normal parameters Mean 4.000 4.000 4.000
Std. deviation 1.005 1.005 1.005
Most extreme differences Absolute 0.090 0.075 0.112
Positive 0.087 0.075 0.112
Negative -0.090 -0.053 -0.099
Kolmogorov-Smirnov Z 0.959 10.193 0.793
Asymp. Sig. (2-tailed) 0.317 0.116 0.556

Referring to the normality assumption testing, Table 3 again confirms that even though sample size is increased significantly, the mean and standard deviation values do not respond significantly. Therefore, it can be assured that the characteristics of distribution is fixed while various sizes of sample are employed in this study.

Table 3
Descriptive statistics
Variable N = 113 N = 226 N = 339 N = 452
Mean SD Mean SD Mean SD Mean SD
JS_CTT 3.112 0.694 3.118 0.581 3.118 0.539 3.128 0.507
MP_CTT 2.725 0.700 2.724 0.590 2.726 0.553 2.730 0.535
OC_CTT 2.407 0.639 2.412 0.534 2.412 0.495 2.413 0.473
JS_GRM 4.000 1.004 4.000 1.002 4.000 1.001 4.000 1.001
MP_GRM 4.000 1.004 4.000 1.002 4.000 1.001 4.000 1.001
OC_GRM 4.000 1.004 4.000 1.002 4.000 1.001 4.000 1.001

Note: JS_CTT = Job Satisfaction Classical Test Theory, MP_CTT = Managerial Performance Classical Test Theory, OC_CTT = Organisational Commitment Classical Test Theory, JS_GRM = Job Satisfaction Graded Response Model, MP_GRM = Managerial Performance Graded Response Model, and OC_GRM = Organisational Commitment Graded Response Model.


Before applying IRT model, it is necessary to check if the assumptions on which latent trait constructs were either strictly unidimensional, or as a practical matter, dominated by a general underlying factor. To test the dimensionality of items Lautenschlager, et al. (2006) wrote that loading factor index should be at least .40. Inspection of the scree plot for each variable might also be used to justify the dimensionality of the variable. A more appropriate method for assessing the unidimensionality of a test is factor analysis (Hambleton & Rovinelli 1986). A reliability test was conducted to corroborate the consistency of a set of measurements, and to confirm if loading values of items is exactly nested in the right components, an exploratory factor analysis was employed.


A relatively simple model was developed involving the relationship between three variables in a HRM research. It consists of two independent variables and one dependent variable. Job satisfaction (JS) and organisational commitment (OC) are treated as the independent variables and managerial performance (MP) is positioned as the dependent variable. To simplify the process of parameters estimation for CTT and GRM, only a few of items were embedded in this simulation study. The items with loading values more than 0.5 were chosen for the further analytical procedures (Hair, et al. 2006).

Short form of Minnesota Satisfaction Questionnaire adopted by Martin and Roodt (2008) was used to measure job satisfaction. This instrument was chosen for three reasons. First, Dunham and Herman (1975) found that the convergent validity of this measurement produces the highest compared with other measurements. Second, the MSQ is more comprehensive measure specific aspects of job satisfaction. And finally, Scarpello and Campbell (1983) stated the MSQ has a better ability in predicting job satisfaction than on other instruments.

Organisational commitment instrument developed by Mowday (1998) was adopted. It comprises of eleven items regarding employee commitment and they are elaborated from three characteristics of organisational commitment. These characteristics include (a) strong support of the organisation’s value system, (b) the will to exert considerable effort on behalf of the organisation, and (c) the intention to remain associated with the organisation.

Managerial performance was measured using a self rating instrument adopted by Leach-López, et al. (2009) from Mahoney, et al. eight dimensions of performance were employed to measure managerial performance. Govindarajan (1986) stated that the instrument has two advantages compared with other measurements. These advantages are (a) satisfactory validity and reliability, (b) reveals the dimensions of managerial performance in a more tangible way thereby, eliminating the problems inherent in the multidimensional measurement.


Pearson correlation was used to examine whether the latent traits correlation between CTT and GRM exist. If the correlations were high and significance values were the same as or less than 0.05 then it suggests that the latent traits between both measurement models were significantly correlated. Regression and structural equation modelling (SEM) were applied to test the impact of CTT and GRM on the fitness.

Two statistical approaches consisting of regression and structural equation model are used in examining regression coefficients for each measurement model. Regarding the regression, Claudia and Raju (2002) noted that regression coefficients link the items to their latent constructs and that equivalence of the intercepts should be tested as well as the regression coefficients in examining measurement equivalence. In this study, regression parameters such as correlation (r), R square, adjusted R square and F value are used as a basis of decision making. The higher the values the fitter the model will be. Secondly, SEM is treated to compare fitness indices between both models. Since Chi square test is questionable because it is sensitive to sample size and test length (Schlessman 2009), therefore, in this study not only Chi square, but also goodness of fit index (GFI) and adjusted goodness of fit index (AGFI) and root mean square error (RMSEA) are employed as the indicators of model fitness (Hooper, et al. 2008, Breaux 2004, Joreskog & Sorbom 1989). In general, if Chi square and RMSEA value are smaller and the values of GFI and AGFI are greater, then the model will be fitter.


Preliminary Analyses

Table 4 indicates that the total variance explained by the three factors are more than 60 per cent and all factor loading values are more than 0.5 as suggested by Hair, et al. (2006). Based on the Kaiser Mayer Olkin Measure of Sampling Adequacy (KMO MSA) test, as a measure of sampling adequacy, the results show that KMO for overall variables are greater than 0.50, so then exploratory factor analysis can be continued. The probability associated with the Bartlett test for this research is p < 0.00 less than the level of significance (0.05) as it is required. All the Cronbach Alpha coefficients are higher than 0.60 suggesting that the instruments are reliable (Nunnally 1978, Hair, et al. 2006).

Table 4
Results of reliability and factor analysis
Item N = 113
N = 226
N = 339
N = 452
1 2 3 1 2 3 1 2 3 1 2 3
Job satisfaction
js1 -0.020 0.812 0.158 -0.014 0.808 0.106 -0.012 0.799 0.080 -0.005 0.793 0.059
js2 0.112 0.853 0.111 0.129 0.855 0.125 0.142 0.858 0.137 0.147 0.861 0.138
js3 0.179 0.860 0.089 0.184 0.848 0.065 0.189 0.842 0.044 0.190 0.838 0.032
Managerial performance
mp1 0.143 0.175 0.726 0.162 0.152 0.729 0.166 0.138 0.732 0.171 0.127 0.733
mp2 0.072 0.086 0.733 0.023 0.141 0.733 -0.010 0.174 0.730 -0.031 0.192 0.734
mp3 0.140 0.078 0.816 0.143 -0.011 0.819 0.142 -0.060 0.818 0.139 -0.092 0.817
Organisational commitment
oc1 0.832 0.054 0.031 0.836 0.097 0.007 0.837 0.121 -0.001 0.839 0.140 -0.001
oc2 0.826 0.031 0.199 0.843 0.023 0.172 0.851 0.017 0.161 0.856 0.015 0.160
oc3 0.761 0.221 0.022 0.758 0.231 0.003 0.758 0.235 -0.006 0.763 0.231 -0.009
oc4 0.758 0.019 0.227 0.714 -0.001 0.321 0.690 -0.010 0.373 0.672 -0.015 0.410
Alpha 0.818 0.671 0.818 0.781 0.629 0.784 0.761 0.618 0.768 0.743 0.613 0.761
Eigen value 3.469 1.866 1.373 3.416 1.813 1.476 3.384 1.803 1.528 3.377 1.818 1.551
Variance 34.686 18.663 13.735 34.160 18.128 14.758 33.844 18.028 15.275 33.771 18.181 15.514
Total var. explained 34.686 53.349 67.084 34.160 52.288 67.046 33.844 51.872 67.147 33.771 51.952 67.466
KMO-MSA 0.733 0.721 0.710 0.701
Bartlett’s Test 378.512 766.257 1,172.947 1,609.810
df. 45 45 45 45
Sig. 0.000 0.000 0.000 0.000

The Model Relationship

Table 5 below indicates that the correlation between CTT and GRM is very strong (r > 0.90, p = 0.00). This finding confirms that CTT and GRM almost produce identical latent trait scores. In details, different sample sizes do not yield a significant impact on the correlation scores. The impact of different sample sizes was consistent with the research findings of Reise and Yu (1990). They found that sample size had little impact on estimated person trait level. Nevertheless this finding contradicts with the previous research done by Simon (2008) and Duhachek and Iacobucci (2004). They revealed that sample size has larger effects on an ability estimation, a bias for Alpha and a standard error. Hence, the first hypothesis stating that the latent trait scores estimated by CTT and GRM will significantly correlate is statistically supported.

Table 5
Correlation matrix between CTT and GRM
Sample size JS OC MP
113 0.976* 0.997* 0.994*
226 0.973* 0.996* 0.995*
339 0.976* 0.995* 0.996*
452 0.978* 0.994* 0.996*
Average 0.976* 0.996* 0.995*

Note. * p = 0.000

This strong correlation between CTT and GRM is actually relevant to the previous findings (Piogar, et al. 2008, Wiberg 2004, Stage 2003, Crikickci & Rtali 2002, Fan 1998). In their various studies, they corroborated that the CTT and the IRT parameters were very comparable and that the CTT and the IRT item parameters showed similar invariant property when estimated across different groups of participants. Recently, Lin (2008) provided related evidence that CTT approach performed comparably with the IRT approaches in assembling parallel tests. In general several investigations failed to discredit the CTT framework with regard to its alleged inability to produce invariant item statistics (Progar, et al. 2008, Crikickci & Rtali 2002).

The Fitness Indices Comparison

Statistical parameters generated from regression analysis and structural equation model were used to determine how comparable is GRM with CTT. First, regression analysis was run to generate the values of correlation (r), R square, adjusted R square and F. These results are compared to justify which model shows the best fit parameters.

Based on the regression analysis, CTT does show statistical parameters (r, R Square, and Adjusted R Square, F value) higher than GRM (See Figure 1) when the sample size is small. Yet, the figure shows that the statistical parameters of both models become somewhat more comparable by increasing the sample size.This finding is in line with the previous study done by Courville (2004) that in large scale samples, the CTT based and IRT based estimates were very comparable. In conclusion, regardless of sample size, CTT produces fitter model than GRM when regression analysis is employed.Therefore, the second hypothesis stating that GRM will produce a fitter model in estimating statistical parameters than CTT is unsupported.

Figure 1
Comparison of Regression Parameters
Comparison of Regression Parameters

Secondly, structural equation model analysis was conducted to provide model fitness indices based on CTT and GRM approaches. Eventually, fitness comparison between both models are based on the values of Chi square, GFI, AGFI and RMSEA.

Figure 2 and Figure 3 consecutively compare fit indices of CTT and GRM for several different sample sizes. While SEM is treated, CTT works better in small sample size (N = 113) than GRM. It is suggested by the fitness indices that CTT indices are fitter than GRM. Therefore, it can be roughly inferred that in case of small sample size, CTT will perform better in estimating statistical parameters than GRM. In contrast, when sample size is bigger, GRM performs better than CTT. In this study, the results suggest that the fit indices of GRM are better than CTT when the sample size is 226 or bigger. In this case, GRM developed from IRT overcomes the shortcomings of traditional summative scaling (i.e., CTT) and obtaining valuable information about the strengths and weaknesses of our measures. Here, GRM proves to be more consistent with the data (Claudia & Raju 2002).

Figure 2
The Comparison of Chi square between CTT and GRM
The Comparison of Chi square between CTT and GRM

Figure 3
The Comparison of GFI, AGFI and RMSEA between CTT and GRM
The Comparison of GFI, AGFI and RMSEA between CTT and GRM

Sung and Kang (2006) mentions several parameter estimation models that can be applied for polytomous data, namely rating scale model (RSM), the partial credit model (PCM), the generalised partial credit model (GPCM), and the graded response model (GRM). In order to get the benefits of IRT, it is important to choose an appropriate estimation model which fits with the data well (Sung & Kang 2006). Progar, et al. (2008) stressed that parameters of GRM are in general empirically superior to CTT parameters, but only if the appropriate IRT model is used for modelling the data. To confirm the graphical based conclusion, Table 6 exhibits the t-test results of fitness indices between CTT and GRM.

Table 6
Fit indices comparison between CTT and GRM
Fit indices Model Mean Std. deviation Std. Error Mean t Sig.
Chi-Square CTT 16.686 10.360 5.180 0.118 0.910
GRM 15.929 7.496 3.748
GFI CTT 0.965 0.005 0.002 0.396 0.706
GRM 0.963 0.008 0.004
AGFI CTT 0.789 0.026 0.013 0.456 0.665
GRM 0.777 0.045 0.022
RMSEA CTT 0.227 0.023 0.011 -0.461 0.661
GRM 0.234 0.022 0.011

In contrast with the rough conclusion based only on the graphical presentation, Table 6 corroborates that there is no significant difference between CTT and GRM in all fitness indices (Sig. > 0.05). The general answer to the question is that the CTT approach performed as well as or better than the IRT. This finding is consistent with the previous research in which item difficulty of a test distribution was specified by CTT and IRT (Lin 2008). In the studies where item parameters are estimated within the IRT approach are not superior to the statistics derived within the CTT across groups (Stage 2003, Morales 2009). While classical test theory (CTT) and item response theory (IRT) methods are different in so many ways, results of the analyses using these two methods are less confirmatory. The overall conclusion of the studies is the agreement between results based on the regression and SEM analyses within the two different frameworks, namely, CTT and IRT were reasonably good (Stage 2003). This study encourages researchers to view IRT model as a parallel approach to CTT measurement model. Consequently, GRM from IRT provides an alternative to the standard model of testing and it is a useful supplement to CTT, because it extends the analytic process of measurement before item scores are condensed to new trait scores.

Both models give valuable information and should be included in an analysis of the theory test (Wiberg 2004). Hence, CTT can be seen as an abridged version of IRT modeling (Ewing, 2005).


This research provides evidence to a debatable measurement issue regarding the use of GRM and CTT in HRM research. A simulation study found three fundamental evidences in term of relationship and model fitness comparison of CTT and GRM. First, by comparing correlation scores of latent traits estimated by CTT and GRM, both measurement models show a strong relationship (r > 0.90, p = 0.000). Secondly, CTT yields a better result being a fitter model in regression analysis than its counterpart. Thirdly, GRM produces fitter model than CTT in estimating statistical parameters of an interval scale data when SEM is employed and sample size is sufficiently large.

Despite the theoretical advantages offered by IRT over the older CTT based methods of assessing instrument performance it is rarely used in the HRM research. Zagorsek (2000) has identified several reasons behind this phenomenon. First, IRT was developed within the framework of educational testing and so most of the literature and terminology is oriented towards that discipline. Secondly, major limitation of IRT is the complexity of the mathematical IRT models. Most researchers have been trained in classical test theory and are comfortable with reporting statistics such as summed scale scores, proportions correct, and Cronbach alpha values. Thirdly, beyond the mathematical formulas, there are the complexities of the numerous IRT models themselves as to what circumstances are appropriate for IRT use and which model to choose. Fourthly, there is not even a consensus among researchers as to the definition of measurement and whether IRT models fit that definition. Finally, Reeve (2002) added that the numerous available IRT softwares in the market are not user friendly and often yield different results.

Despite these limitations, practical applications of IRT in the field of human resource research may not be neglected. Only accurate instruments that correctly measure a phenomenon can enhance our understanding of the phenomenon. IRT does not make existing psychometric techniques obsolete (Zagorsek 2000). Consequently, implementing IRT and CTT in parallel in a research may allow us to not only obtain deeper insight into the nature and properties of a measurement instrument but also offer us a variety of statistical tools in our research designs.


The study findings suggest that researchers in HRM should be able to precisely choose measurement model which is best fit with their research designs. Moreover, to investigate if this result can be generalised to other conditions, and it is suggested for future research to examine the possible impact of different sample sizes, scale lengths, data distributions, number of raters, test lengths and measurement models on fitness indices as a number of parameters employed in the model may also affect fitness indices.


Sukirno is a lecturer at Accounting Department at Yogyakarta State University and a PhD candidate at School of Management Asian Institute of Technology Thailand. His research interest areas are in human resources management, educational measurement, financial accounting and education.


Sununta Siengthai is an associate professor in human resource management at School of Management Asian Institute of Technology. She got her MA and PhD degrees from the University of Illinois at Urbana Champaign, both are in the area of labor and industrial relations. Her research interests are in human resourcesmanagement, industrial relations,measurement and international human resourcesmanagement.



Anil, D. (2008). The prediction of item parameters based on classical theory and latent trait theory. Journal of Education, 34, 75-85.

Breaux, K. T. (2004). The effect of program commitment on the degree of participative congruence and managerial performance in a budgeting setting. Dissertation, Nicholls State University.

Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York: The Guilford Press.

Claudia P., & Raju, N. S. (2002). A comparison of measurement equivalence methods based on confirmatory factor analysis and item response Theory. Paper presented at National Council on Measurement in Education (NCME) Annual Meeting, New Orleans, Los Angeles.

Chan, J. C. (1996). Estimating the latent trait from likert type data: a comparison of factor analysis, item response theory, and multidimensional scaling. Dissertation, The University of Texas.

Courville, T.G. (2004). An empirical comparison of item response theory and classical test theory item/person statistics. Dissertation Texas A&M University.

Crikickci, N., & Rtali, D. (2002). A study of raven standard progressive matrices test’s item measures under classic and item response models: an empirical comparison. Journal of Faculty of Educational Sciences, 35(1/2), 71-79.

Duhachek, A., & Iacobucci, D. (2004). Alpha’s standard error (ASE): An accurate and precise confidence interval estimate. Journal of Applied Psychology, 89(5), 792-808.

Dunham, R. B., & Herman, J. B. (1975). Development of a female faces scale for measuring job satisfaction. Journal of Applied Psychology, 60(5), 629-631.

Edwards, M. C. (2009). An introduction to item response theory using the need for cognition scale. Social and Personality Psychology Compass, 3(4), 507-529.

Ewing, M.T., Salzberger, T., & Sinkovics, R.R. (2005). An alternate approach to assessing cross-cultural measurement equipment equivalence in advertising research. Journal of Advertising, 34(1), 17-36.

Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person parameters. Educational and Psychological Measurement, 58(3), 357-381.

Govindarajan, V. (1986). Decentralisation, strategy, and effectiveness of strategic business units in multibusiness organizations. Academy of Management Review, 11(4), 844-856

Hair, J. F. H, Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). Multivariate data analysis. Singapore: Pearson Prentice Hall Inc.

Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set of test items. Applied Psychological Measurement, 10(3), 287-302.

Hambleton, R. K., R. & Rogers, H. J. (1989). Detecting potential biased test items: Comparison of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, 2(4), 313-334.

Hooper, D., Coughlan, J., & Mullen, M. R. (2008). Structural equation modeling: Guidelines for determining model fit. The Electronic Journal of Business Research Methods, 6(1), 53-60.

Joreskog, K., & Sorbom, D. (1993). Lisrel 88: Structural equation modeling with the SIMPLIS command language. Hillsdale, New Jersey: Scientific Software International.

Lautenschlager, G. J., Meade, A. W., & Kim, S. H. (2006). Cautions regarding sample characteristics when using the graded response model. Paper presented at the 21st Annual Conference of the Society for Industrial and Organizational Psychology, Dallas, Texas.

Leach-López, M. A., Stammerjohan, W. W., & Lee, K. S. (2009). Budget participation and job performance of South Korean managers mediated by job satisfaction and job relevant information. Management Research News, 32(3), 220-238.

Lin, C-J. (2008). Comparisons between classical test theory and item response theory in automated assembly of parallel test forms. Journal of Technology, Learning, and Assessment, 6(8), 1-41.

Magno, C. (2009). Demonstrating the difference between classical test theory and item response theory using derived test data. The International Journal of Educational and Psychological Assessment, 1(1), 1-11.

Martin, A., & Roodt, G. (2008). Perceptions of organisational commitment, job satisfaction and turnover intentions in a post-merger South African tertiary institution. Journal of Industrial Psychology, 34(1), 23-31.

McBride, N. L. (2001). An item response theory analysis of the scales from the international personality item pool and the neo personality inventory-revised. Thesis, Virginia Polytechnic Institute and State University.

Mitchell, R.J. (1993). Path analysis: Pollination, Design and Analysis of Ecological Experiments. New York: Chapman and Hall, Inc.

Morales, R.A. (2009). Evaluation of mathematics achievement test: A comparison between CTT and IRT. TheInternational Journal of Educational and Psychological Assessment, 1(1), 19-26.

Mowday, R.T. (1998). Reflections on the study and relevance of organisational commitment. Human Resource Management Review, 8(4), 387-401.

Muraki, E., & Bock, D. R. (1997). PARSCALE: IRT item analysis and test scoring for rating-scale data [Computer software]. Chicago: Scientific Software.

Nunnally, J.C. (1978). Psychometric theory. New York: McGraw Hill.

Oishi, S. (2006). The concept of life satisfaction across cultures: An IRT analysis. Journal of Research in Personality, 40(4), 411-423.

Progar, S., Socan, G., & Slovejija, M. P. (2008). An empirical comparison of item response theory and classical test theory. Horizons of Psychology, 17(3), 5-24.

Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor test: Results and implications. Journal of Educational Statistics, 4, 207-230.

Reeve, B. B. (2002). An introduction to modern measurement theory. Bethesda, Maryland: National Cancer Institute.

Rubio, V. J., Aguado, D., Hontangas, P. M., & Hernández, J. (2007). Psychometric properties of an emotional adjustment measure: An application of the graded response model. European Journal of Psychological Assessment, 23(1), 39-46.

Santor, D. A., & Ramsay, J. O. (1998). Progress in the technology of measurement: Applications of item response models. Psychological Assessment, 10, 345-359.

Scarpello, V., & Campbell, J. P. (1983). Job satisfaction: Are all the parts there? Personnel Psychology, 36(3), 577-600.

Schlessman, B.R. (2009). Type I error reates and power estimates for multiple item response theory fit indices. Dissertation Wright State University.

Sekaran, U. (1992). Research methods for business. A skill building approach. (2end ed.). New York: John Wiley and Sons Inc.

Simon, M. K. (2008). Comparison of concurrent and separate multidimensional IRT of item parameters. Thesis, University of Minnesota.

Smeenk, S., Teelken, C., Eisinga, R., & Doorewaard, H. (2008). An international comparison of the effects of HRM practices and organizational commitment on quality of job performances among European university employees. Higher Education Policy, 21, 323-344.

Stage, C. (2003). Classical test theory or item response theory: The Swedish experience. Santiago: Centro de Estudios Públicos.

Sung, H. J., & Kang, T. (2006). Choosing a polytomous IRT model using Bayesian model selection methods. Paper presented at National Council on Measurement in Education Annual Meeting, San Francisco.

Weijters, B. (2006). Response styles in consumer research. Dissertation, Ghent University.

Wiberg, M. (2004). Classical test theory vs. item response theory. An evaluation of the theory test in the Swedish driving-license test. Santiago: Centro de Estudios Públicos.

Zagoršek, H. (2000). Using item response theory to analyze properties of the leadership practices inventory. Working paper No. 147, University of Ljubljana.