A typical problem when benchmarking a clinical system is establishing a ground truth. Let us assume a clinical support system that supports a physician in interpreting radiology images (e.g., by recognizing tumors). The only intuitive method we have to evaluate the performance of the system is to test it on a set of labeled images, or in other words, through asking questions we already know the answers to. However, creating this ground truth will require us to rely on an analysis through the very process we attempt to improve, namely the “manual” analysis by a physician.
The scenario can be frequently observed wherever we try to recognize patterns. Another example is research oriented extracting of knowledge from patient records, where we would attempt to recognize adverse drug effects or develop best practices. A note might be indicating the negative impact on the patients health, although it is not recognized as such be the medical expert that is the referee for the benchmark due complexity, illegibility or counterintuitive nature.
There are methods to damp the effect, such as increasing the number and competence of the referees or implementing a round of reconsideration of results (e.g., the physicians can be confronted another time with images that have been recorded as false positives during the automated tumor detection), but those methods are often expensive and time consuming or, in the worst case, just not available.
Therefore we have to keep in mind that when dealing with a highly complicated and often intuition driven field such as medicine we have to constantly account for possible human errors, and that there is always the possibility that we have our job well enough to outperform the quality of the human decision. Or to coin it less optimistically: sometimes even great results can become a problem.