# RESEARCH AND PRACTICEIN HUMAN RESOURCE MANAGEMENT

Highlight, copy & paste to cite:

Sarkar, A., Mukhopadhyay, A. R. & Ghosh, S. K. (2011). Comparison of Performance Appraisal Score: A Modified Methodology, Research and Practice in Human Resource Management, 19(2), 92-100.

## Comparison of Performance Appraisal Score: A Modified Methodology

### Abstract

The performance of employees is evaluated annually based on a numeric scale for deciding on promotions and annual increments. The normalisation of scores is generally done to compare scores of performance individuals belonging to different functions in an organisation. However, in many situations, score normalisation may result in a lot of stirring effect leading to intensifying controversy instead of diffusion. The issues pertaining to the normalisation of scores are discussed and a modified methodology has been proposed. The methodology on score normalisation will contribute immensely to the formulation of the human resource management policies and its effective execution in contemporary organisations.

### Introduction

In today’s competitive environment employees are the biggest assets for any organisation. However, if not evaluated and motivated properly, the effectiveness and efficiency of the employees may become reduced and the asset may transform to a liability. Hence, a performance appraisal system is an important activity for an organisation, and performance appraisal is one way through which the efforts of employees can be aligned with the aims of any organisation and the employees can be motivated and supported (Desimone, Werner & Harris 2002). Since the economy at present is heavily contributed by the service industries, that are principally governed by human resources, the crucial importance of the process of appropriate performance appraisal is beyond doubt.

Performance appraisal (Murphy & Joyce 2004) may be defined as a structured periodic (generally annual) review of employees by management through various formats. Performance appraisal includes, but is not limited to, interview by seniors, fulfillment of the agreed targets, and assessment of knowledge through examination. The scores obtained in various formats are combined after giving appropriate weightage to each factor and based on the weighted score the employees are ranked. This rank is used to determine reward outcomes (i.e., promotions and/or increments). Usually the top ranked employees get the majority of available merit pay increases, bonuses, and promotions.

On the surface the performance appraisal system appears to be very simple within a function or vertical. However, the problem arises when the scores of various functions are combined, and thus, are used to compare the performances of employees of various functions. In many organisations, it is reported that a particular function gets more rewards than others. This effect could be a result of the scores provided by different examiners, or the degree of difficulty of questions/ procedures for assessment. This observation puts a serious doubt on an appraisal system and creates controversy within an organisation. In order to avoid the controversy and arrive at an impartial decision, the practice followed by most of the Human Resource (HR) functions is score normalisation. There are two major common categories of normalisation methods: linear normalisation method and nonlinear normalisation method (Liping, Yuntao & Yishan 2009). Generally, in performance evaluation, the nonlinear normalisation method is being used which is calculated as equation 1, where ‘s’ is the standard deviation.

$z=\frac{\left({x}_{i}-\overline{Χ}\right)}{s}$ (1)

As a result, all variables in the data set have equal means (0) and standard deviations (1), but they have different ranges. The basic assumption made here is that the distribution of average, good and excellent performers within a function is identical. Hence, after normalisation, the best or worst 10 per cent can be chosen. The normalisation works well when the assumed distribution of scores is normal, but the assumption is not generally verified. And when the assumption is incorrect, the normalisation process adds more confusion rather than diffusion. For example, 50 per cent of the employees will be above average (high performers) when the distribution is normal whereas only 37 per cent of the employees will be above average (high performers) if the distribution is exponential. This leads to a discrepancy in the sense that 50 per cent of the employees are being rewarded in one function while only 37 per cent of the employees are being rewarded in other functions. As a result, the score normalisation method is likely to be increasing the controversy. Hence, it is not surprising many people criticise it as the ‘Dead Man’s Curve’ (Meisler, 2003) or the ‘Forced Ranking System’ (Donaldson 2003). In this paper the authors discuss the issues in the context of a large scale organisation and suggest an approach to resolve the controversial issues.

### Nomenclatures

#### Normal Distribution

The normal distribution curve is used as a tool in measuring human capacities, pioneered by the leader Jack Welch, the former Chief Executive Officer of General Electric. When the distribution curve is drawn it looks like a bell and popularly known as a bell shaped curve. This bell shaped curve is symmetric about its mean. The normal distribution is the most common statistical distribution because approximate normality arises naturally in many physical, biological, and social measurement situations. Many statistical analyses require that data come from normally distributed populations.

The mean (μ) and the standard deviation (s) are the two parameters that define the normal distribution. The mean is the peak or centre of the bell shaped curve. The standard deviation determines the spread in the data. Approximately, 68.28 per cent of observations are within ± 1 standard deviation of the mean; 95.46 per cent are within ± 2 standards deviations of the mean; and 99.73 per cent are within ± 3 standard deviations of the mean (Goon, Gupta & Dasgupta 1998). These properties of the normal distribution are presented in Figure 1.

#### Score Normalisation or Z score

Due to the different ways of appraisal, one function’s score pattern is not the same as the other. So the employees evaluated have a high degree of variability and centering in their performance appraisal scores. The process of balancing this variability is called ‘normalisation’.

Converting an observation to a Z value is called normalisation. To normalise an observation in a population, subtract the population mean from the observation of interest and divide the result by the population standard deviation. The product of these operations is the Z value associated with the observation of interest is shown in equation 2.

${Z}_{i}=\frac{{x}_{i}-\overline{Χ}}{s}$ (2)

If Xi follows normal distribution with mean (μ) and standard deviation (s) then the Zi will also follow normal distribution with mean (0) and standard deviation (1). This makes the Z scores of various functions comparable.

#### Issues in Normalisation

1. Generally the normal distribution curve is forcibly fitted to the data and based on that marks are normalised. This is known as Z score. However if the distribution fit fails in statistical test, then Z score calculation and corresponding interpretation become meaningless.
2. The existence of outlier(s) makes the Z score irrelevant. If outlier exists and in spite of them the Z score can be calculated, then the distribution of Z score will differ from the theoretical Z distribution. Consequently, it will turn out the comparison based on Z is incorrect.

#### Outlier, Outlier Identification and its Effect

In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs (1969) defined an outlier as an outlying observation that appears to deviate markedly from the other members of the sample. For instance, suppose the scores of six executives are 9, 10, 10, 10, 11, and 50. The average is 16.7 and the standard deviation is 16.34. The score 50 appears to be quite different from other observations. If the observation 50 is not considered then the average and standard deviation become 10 and 0.71, respectively. The original Z scores are -0.47, -0.41, -0.41, -0.41, -0.35 and 2.04. Further examination of the Z scores, except one observation (17 per cent) all others are negative, which demonstrates a complete violation of normal distribution and any comparison based on these scores will be incorrect. After removal of the outlier the Z scores become -1.41, 0, 0, 0 and 1.41 which fit well with normal distribution and facilitate comparison.

Clearly, it is necessary to identify the existence of an outlier in a set of data. Outliers in appraisal scores exist within a function may be due to the existence of extraordinary performer or underperformer. These extreme situations need to be identified and remedial action taken.

A key question is how to find an outlier in a set of observations. Various approaches exist, but the most common one is carried out graphically using the Box Plot (Tukey 1977). In a box plot any observation beyond the limit is termed as an outlier.

${Q}_{1}-k\left({Q}_{3}-{Q}_{1}\right)\text{,}{Q}_{3}+k\left({Q}_{3}-{Q}_{1}\right)$

Q1 and Q3 are the first and third quartiles, respectively, and the value of k is generally taken as 1.5. However, the significance of outlier is not tested.

The Grubbs test is based on the assumption of normality. That is, one should first verify that the data can be reasonably approximated by a normal distribution before applying the test (NIST 2003), which detects one outlier at a time. This outlier is expunged from the data set and the test is iterated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test should not be used for sample sizes of six or less since it frequently tags most of the points as outliers. The outlier detection facility is provided in many standard statistical software packages and can also be carried out online with software.

#### Distribution Fitting

It is essential that for comparison that leads to a valid decision fits a theoretical distribution. Generally, it is assumed that the distribution is normal and no test is conducted. However, if the distribution is different, then the distribution of good, average and poor performer within a function will be different. For example 50 per cent of the observations will fall below mean when the distribution is normal. However, the percentage becomes 63.2 per cent, if the distribution is exponential.

Generally, it is assumed that the distribution of scores will be normal after removing the outliers. In case the normal distribution assumption fails, then the following approach can be used.

1. Transform the data and fit it for normal distribution.
2. Fit other distributional models.

Statistical software provides the ‘Normality Test’ options for testing whether the observed score is normal. The test provides the probability plot and goodness of fit test and based on the p value, the distribution is selected. Generally, in the normality test, if the p value is more than 0.05, it can be concluded that the observed score reasonably fits well to the normal distribution.

### Methodology

#### Sample and Site

As a part of the new promotion policy, the officers/candidates of different functions, in an Indian organisation, undergo a written technical test for 100 marks in their functional areas. The scores obtained in the test, carry 30 per cent weightage for promotion. These test papers are designed by different subject matter experts and, therefore, the difficulty level of the test papers varies. The difficulty level of test papers depends on the type of question, whether the question is of subjective, objective or multiple choice type as well as the time allocated to answer a question. Also the evaluation of answers to subjective questions depends on the evaluators and as a result scores vary from one evaluator to another evaluator. Attempts have not been made in this paper to standardise these issues. Hence the authors refrained from measuring the difficulty level associated with the scores. The scores obtained from 544 candidates representing 10 functions in an evaluation method have been collected.

#### Procedure

The present practice of comparison followed by many organisations is to combine all the scores of various functions, irrespective of outliers and calculate overall average and standard deviation. Then as per normal distribution the high performers are selected whose scores are more than (average + ‘Z’ times the standard deviation). The ‘Z’ value is the standardised normal variable or the Z score. For example, to identify the top 10 per cent of employees, then the Z score will be 1.28155.

The difficulty level impacts scoring patterns, and affects the promotion decision. It is possible that an individual belonging to a function encountered questions with lower difficulty level that resulted in more marks leading to increments/promotions. Other individuals in other function(s) may encounter a higher difficulty level. Hence, the issue is comparison of scores of different functions in the presence of varied difficulty level of assessment.

In order to overcome the issues in the existing comparison, the procedure followed in this paper is given hereunder.

1. Identify and remove the outlier from the marks in each function.
2. Fit normal distribution and if the fit fails normal distribution, use the Johnson transformation.
3. Calculate the Z score separately for each function and then pool them.
4. Transform the Z score to a comparable score using the appropriate change of origin and scale parameters as (comparable score = overall average + Z score × overall standard deviation).

#### Measures

The comparison of scores depends on identification of outlier as well as distribution fitting. The outlier is identified using the Grubbs test (NIST 2003). The distribution fitting is verified by using a probability distribution fitting method (Ryan & Joiner 1976) where the null hypothesis is that the observation follows a normal distribution. The corresponding alternate hypothesis is the observation does not follow normal distribution. The decision is taken based on the p value and the higher the p value compares to the standard value of 0.05, the better the fit.

#### Analysis

The descriptive statistics along with Grubbs test and distribution fit are provided in Table 1.

Table 1
Descriptive statistics and outlier test for different functions
Function Count Average SD Grubb’s test p of Normal No. of outliers
Purchasing 45 74.62 19.65 Outlier exists 0.005 5
Finance 58 58.17 18.76 Nil 0.005
HR 46 26.09 14.29 Outlier exists 0.005 2
Info system 62 37.12 16.04 Nil 0.010
Legal 16 63.44 11.57 Nil 0.234
Vendor mgt. 41 41.44 8.24 Nil 0.395
Pipeline 49 72.86 11.02 Nil 0.005
Engg. 57 30.51 16.52 Nil 0.005
Manufacturing 120 56.37 16.07 Nil 0.406
Retail sales 50 45.24 13.62 Nil 0.005
N = 544 50.15 21.65

Note. HR = Human Resources, Info system = Information system, Vendor mgt = Vendor management, and Engg = Engineer.

Table 1 provides the descriptive statistics, outliers’ test and the p value for testing the normal distribution. It can be seen that for the scores of Legal, LPG and Refinery functions, the normal distribution fits well as the corresponding p value is greater than 0.05. It is known that the distribution fits fail in the presence of outliers, and this has been demonstrated in case of Purchase and HR functions. Outliers identified through Grubbs test have been removed and then the normal distribution fit has been tried again, and wherever it failed to fit, data have been transformed using Johnson’s formula (Hanfeng & Grazyna 2001). The results can be seen by referring to Table 2.

Table 2
Distribution fit for different function after transformations
Function Count Average SD p of Normal Remarks
Purchase 40 80.78 7.044 0.290 Fits normal after outlier removal
Finance 58 58.17 18.76 0.608 Transformed data fits normal
HR 44 23.98 10.36 0.077 Fits normal after outlier removal
Info system 62 37.12 16.04 0.755 Transformed data fits normal
Legal 16 63.44 11.57 0.234 Fits normal as it is
Vendor mgt. 41 41.44 8.24 0.395 Fits normal as it is
Pipeline 49 72.86 11.02 0.496 Transformed data fits normal
Engg. 57 30.51 16.52 0.207 Transformed data fits normal
Manufacturing 120 56.37 16.07 0.406 Fits normal as it is
Retail sales 50 45.24 13.62 0.789 Transformed data fits normal

Note. HR = Human Resources, Info system = Information system, Vendor mgt = Vendor management, and Engg = Engineer.

Table 2 shows all the functional scores fit the normal distribution quite well, and hence, the Z score can be calculated for the purpose of comparison. The Z scores of each function thus, calculated have been combined. The outlier data have also been transformed to Z score using the average and standard deviation of that particular function. The overall average mark is found to be 50.15 and the corresponding pooled standard deviation of all functions taken together, after removing the outlier(s) is found to be 12.9244. Using these average and standard deviation values the comparable score of the participants can be calculated using the following formula can be assessed.

Comparable Score = 50.15 + 12.9244 × Z score

The Z score and the comparable score are provided to the management for all functions along with the original data. The outlier score is adjusted to 100 (for an outperformer if the score crossed above 100) or to 0 (for an underperformer if the score crossed below zero).

### Results

The top ranks (i.e., 1 to 54 representing 10 per cent), obtained by the old method and the proposed method have been compared and presented in Table 3.

Table 3
Function wise number of rank holder
Function Number of rank holder Total
Candidates
Old method Proposed method
Purchase 22 4 45
Finance 3 7 58
HR 0 6 46
Info system 2 6 62
Legal 0 0 16
Vendor mgt. 0 3 41
Pipeline 17 2 49
Engg. 0 7 57
Manufacturing 10 13 120
Retail sales 0 6 50
All 54 54 544

Note. HR = Human Resources, Info system = Information system, Vendor mgt = Vendor management, and Engg = Engineer.

Table 3 shows the old method is biased for the two functions (i.e., Purchase and Pipelines). The association of ‘number of candidates within top 10 per cent’ and functions are checked through c2 test of independence and the null hypothesis of no association is rejected as the p value is found to be ‘< 0.05’. From Table 3, it can be seen in the old method the number of candidates ranked within 10 per cent, for Purchase and Pipelines, is 39 out of 94 candidates, whereas in the proposed method it has been found to be only six. The association is verified and the p value is found to be 0.859, which illustrates that the proposed method provides the ranks, independent of the functions.

The obvious reason for this bias is the existence of relatively higher average scores of these functions compared to other functions caused by the lower difficulty level of test papers meant for assessments.

### Discussion

Based on the results it can be concluded that the new procedure suggested by the authors, is capable of removing the bias while selecting top ranking candidates from one function. This method will immensely help the practitioners as they will be in a position to compare the performances of employees belonging to different functions or in verticals in an organisation. However, this method needs to be propagated to all the concerned executives of an organisation for enlightening those candidates scoring high, but still not selected within the top 10 per cent.

The method demonstrates the case where the data (original/transformed) fits the normal distribution. However, the method has boundaries. If the distribution of the original or the transformed data does not adequately fit as per the normal distribution, the methodology will not work even though the chances are very low for encountering such a situation. In order to effectively handle such a situation, other distributions need to be fitted and the equivalent Z score calculated. The following steps can be employed.

1. Calculate the cumulative probability value of scores using the fitted distribution. The values will lie within 0 to 1.
2. Calculate the Z score using inverse cumulative probability function of above cumulative probability values obtained in the previous step. This will provide the equivalent Z score.

Although these two steps are relatively clear they require an indepth knowledge of the probability distribution. The human resource function executives would benefit by learning the distribution fitting method along with the distributional properties for an appropriate application of the proposed procedure. Alternatively, the organisation could hire statisticians to handle the process. Appropriate statistical software will also be helpful.

### Conclusion

The issues pertaining to the score normalisation method, practised nowadays in many organisations have been discussed. A case example was employed to exemplify the procedure. The proposed method can ably reduce the controversial issues by taking care of the complexity on variation of assessment method including the extreme cases of highly out performers and highly under performers.

The methodology on score normalisation that has been highlighted in this article will be able to give equal opportunity to all functions (purchase, finance, engineering) of an organisation. In the old methodology bias existed in the form of association between the rank of an individual and organisational function of that individual. In the proposed method this bias has been avoided by ensuring independence between the rank of an individual and the organisational function to which he or she belongs. It is expected that the proposed methodology will help reduce the animosity between various organisational functions and the persons belonging to these functions while apprising performance for giving promotions and fixing annual increments. By taking into account this aspect, the policy on human resource management can be suitably modified and propagated.

### Authors

is a Technical Officer in the Indian Statistical Institute, Mumbai. He has a rich experience in implementation of Quality initiatives (e.g., Six Sigma, Lean Six Sigma, SPC, Design of Experiments) in various organisations over a period of last two decades. His areas of research interest are issues pertaining to Implementation of Operations Management across any organisation.

Email: sarkar.ahsok@gmail.com

Dr. is working at present as the Senior Technical Officer in the Indian Statistical Institute, Kolkata, India. He earned his doctorate degree in Quality Engineering (Six Sigma) from Jadavpur University in Kolkata. He has published many articles and papers in different national and international journals of repute. His job consists of teaching, consultancy, training and applied research.

Email: armukherjee@yahoo.co.in

Dr. is a professor in the Mechanical Engineering Department of Jadavpur University. He earned his doctorate in engineering from Jadavpur University. He also acts as the coordinator for the activities of center for quality management system at Jadavpur University.

### Acknowledgements

The authors are grateful to the Editors, especially to Dr. Cecil A. Pearson and the Referees for their valuable comments and suggestions to enrich this paper.

### References

Desimone, R. L., Werner, J. M., & Harris, D. M. (2002), Human resource development (3rd ed.). Orlando: Harcourt College Publisher.

Donaldson, C.A. (2003), Performance management-forced ranking, Retrieved Dec 2, 2010 from: http://edweb.sdsu.edu/people/arossett/pie/Interventions/forcedranking_1.htm

Goon, A. M., Gupta, M. K., & Dasgupta, B. (1998). Fundamentals of statistics. Kolkata: World Press.

Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1-21.

Hanfeng, C., & Grazyna K. (2001), Fitting data to the Johnson system. Journal of Statistical Computation and Simulation, 70(1), 21-32.

Liping, Y., Yuntao, P., & Yishan, W. (2009), Research on data normalisation methods in multi-attribute evaluation. Paper presented at International Conference on Computational Intelligence and Software Engineering, Wuhan, China.

Meisler, A. (2003). Dead man’s curve. Workface Management, 81(7), 44-49

Murphy, T. H., & Joyce, M. (2004), Performance appraisals. Retrieved Dec 2, 2010 from: http://faculty.ksu.edu.sa/72395/studentsite/Documents/performance appraisal (2).pdf

NIST-SEMATECH (2003), e-handbook of statistical methods, Retrieved Dec 2, 2010 from: http://www.itl.nist.gov/div898/handbook/

Ryan, T. A., & Joiner, B. L. (1976). Normal probability plots and tests for normality. Technical Report, Department of Statistics: The Pennsylvania State University.

Tukey, J. W. (1977). Exploratory data analysis. Boston: Addison Wesley Longman Company.