Attribute agreement analysis can be a great tool for detecting sources of inaccuracies in a bug tracking system, but it should be used with great care, consideration, and minimal complexity, if used at all. The best way to do this is to audit the database and then use the results of that audit to perform a focused and optimized analysis of repeatability and reproducibility. For example, if repeatability is the main problem, evaluators are confused or undecided on certain criteria. If reproducibility is the problem, then evaluators have strong opinions on certain conditions, but those opinions differ. If the problems are shown by several evaluators, the problems are systemic or procedural. If the problems concern only a few evaluators, the problems may simply require a little personal attention. In both cases, training or work aids could be adapted either to specific individuals or to all evaluators, depending on the number of evaluators guilty of imprecise attribution of attributes. If the audit is indeed planned and designed, it may reveal enough information about the causes of accuracy issues to justify a decision not to use attribute agreement analysis at all. In cases where the audit does not provide sufficient information, the attribute agreement analysis allows for a more detailed analysis indicating the implementation of safer training and modifications to the measurement system. A bug tracking system that tracks errors in processes (or even products) – a database that is sophisticated enough that it actually tracks where the error occurred, in addition to the nature of the error – can provide powerful insights. It can be very helpful to find and prioritize potential opportunities for improvement.
But is the data trustworthy? Does the bug tracking system provide the right information? In this example, a repeatability assessment is used to illustrate the idea and it also applies to reproducibility. The point here is that many samples are needed to detect differences in an attribute analysis, and if the number of samples is doubled from 50 to 100, the test does not become much more sensitive. Of course, the difference that needs to be identified depends on the situation and the level of risk that the analyst is willing to take in his decision, but the reality is that with 50 scenarios, it will be difficult for an analyst to consider that there is a statistical difference in the repeatability of two evaluators with match rates of 96 percent and 86 percent. . . .