The Kappa-Statistic Measure Of Interrater Agreement

Unfortunately, marginal amounts may or may not estimate the amount of the random evaluation agreement in uncertainty. It is therefore doubtful whether the reduction in the approval estimate made by the kappa statistics is actually representative of the amount of the random evaluation agreement. In theory, pr(s) is an estimate of the match rate when the evaluators guessed each position and guessed at rates similar to the marginal shares and whether the evaluators were completely independent (11). None of these assumptions are justified and, therefore, there are wide differences of opinion on the use of Kappa among researchers and statisticians. There are actually two categories of reliability when it comes to data collectors: reliability beyond multiple data collectors, which is the reliability of interraters, and the reliability of a single data collector, called intrarater reliability. For a single data collector, the question is: will an individual, in the same situation and phenomenon, interpret the data in the same way and record exactly the same value for the variable each time this data is collected? Intuitively, it might seem like a person behaving in the same way in relation to the same phenomenon every time the data collector observes this phenomenon. However, research shows the error of this hypothesis. A recent study on intrarater reliability in the evaluation of bone density X-rays found reliability coefficients of only 0.15 and 0.90 (4). It is clear that researchers are right to carefully consider the reliability of data collection as part of their concern for specific research results.

So far, the discussion has considered that the majority is correct and that the minority evaluators are wrong in their scores and that all the evaluators have made a deliberate choice of rating. Jacob Cohen understood that this hypothesis could be wrong. Indeed, he explicitly stated that «in the typical situation, there is no criterion of `accuracy` of judgments» (5). Cohen proposes the possibility that at least for some variables, none of the evaluators were sure of the score to enter and that they simply made random assumptions. In this case, the agreement reached is an erroneous agreement. Cohens Kappa was designed to address this concern. As Marusteri and Bacarea noted (9), there is never 100% certainty about research results, even if statistical significance is achieved. The statistical results for testing hypotheses about the relationship between independent and dependent variables become insignificant in the event of inconsistency in the evaluation of variables by the evaluators.

If compliance is less than 80%, more than 20% of the analyzed data is defective. With a reliability of only 0.50 to 0.60, it must be understood that 40% to 50% of the analyzed data is defective. If kappa levels are less than 0.60, the confidence intervals above the kappa received are so large that it can be assumed that about half of the data could be false (10). It is clear that statistical significance means little when there are so many errors in the results tested. Suppose you are analyzing data on a group of 50 people applying for a grant. Each request for assistance was read by two readers and each reader said «yes» or «no» to the proposal. Assuming that the data relating to the number of disagreements are as follows, A and B being readers, the data on the main diagonal of the matrix (a and d) count the number of matches and the data outside the diagonal (b and c) count the number of disagreements: historically, the percentage of concordance (number of match values/totals) has been used to determine the reliability interrater. . . .