Statistical calculators

Cohen's Kappa

Detailed description

Cohen’s kappa (Cohen 1960) is a generally robust measure of concordance for dichotomous data. It was originally devised as a measure of "inter-rater" agreement, for assessments using psychometric scales, but it serves well for the likes of presence-absence data for microbiological examination of drinking-water supplies. Because there is always some uncertainty as to how well the limited data we have describe the true situation, it is desirable to perform statistical testing of its value to account for the possibility that we got an unusually high value of kappa when in fact the agreement between the variables is not very good. In doing so a precautionary approach is taken. That is, let’s assume that the true value of kappa is in fact less than 0.6. According to criteria laid down by Landis and Koch (1977) the agreement would then be no better than "moderate".1 If we use our data to test that hypothesis (at the traditional 5% level), and it is rejected, we then have strong grounds for saying that the agreement is at least "substantial".2 Note that there is a view that kappa values of "0.75 or so" can signify excellent agreement (Fleiss, 1981, used by Whyte & Findlay 1985). For typical numbers of data this value of kappa is close to the value that would result in rejection of the tested hypothesis. However a single cut-off value such as 0.75 doesn't allow for the increased certainty that tends to come with a larger dataset, whereas this calculator’s testing procedure does. For further details, see McBride, G.B. (2005). Using Statistical Methods for Water Quality Management: Issues, Problems and Solutions. Wiley, New York, and also Statistical validation criteria for drinking-water microbiological methods.

Measuring agreement of N rater pairs

Rater A Rater B
present absent
present Npp Npa
absent Nap Naa

Npp = number of present/present ratings
Npa = number of present/absent ratings
Nap = number of absent/present ratings
Naa = number of absent/absent ratings
giving N = Npp + Npa + Nap + Naa

Maximum possible kappa = 1 (when Npa=Nap=0), signifying perfect agreement.
Minimum possible kappa is between 0 and -1 (when Npp=Naa=0), signifying that agreement is less than can be attributed to chance.
Kappa = 0 signifies that agreement is entirely attributable to chance.

Npp Npa
Nap Naa
Test value of kappa (must be ≥0 and <1):

1 Landis & Koch (1977) proposed this scale to describe the degree of concordance: 0.21-0.40, "Fair"; 0.41-0.60,"Moderate"; 0.61-0.80, "Substantial"; 0.81-1.00 "Almost perfect".

2 Testing at the 5% level means that it is very unlikely that we would obtain data (by good luck) that would lead to rejection of that hypothesis if in fact the true kappa value was below 0.6. The protection against this kind of error is manifest by the fact that in order to reject the hypothesis we need to obtain a kappa value for our samples some way above 0.6–just how far above depends on the number of data we have. If we took a permissive approach, as opposed to a precautionary approach, the situation would be reversed. That is, we would test the hypothesis that the true value of kappa is greater than 0.6. and we would reject that hypothesis only if the measured value of kappa was some way below 0.6.