The new decade opened with some intriguing news: the journal Nature reported that artificial intelligence was better at identifying breast cancers on mammograms than radiologists. Researchers at Google Health teamed up with academic medical centers in the United States and Britain to train an AI system using tens of thousands of mammograms.
But even the best artificial intelligence system can’t fix the uncertainties surrounding early cancer diagnosis.
To understand why, it helps to have a sense of how AI systems learn. In this case, the system was trained with images labeled as either “cancer” or “not cancer.” From them, it learned to deduce features from the images — such as shape, density and edges — that are associated with the cancer label.
Thus, the process is wholly dependent on starting with data that are correctly labeled. In the AI mammography study, the initial diagnoses were determined by a pathologist who examined breast biopsy specimens under a microscope after an abnormal mammogram. In other words, the pathologist determined whether the mammogram showed cancer or not.
Unfortunately, this pathologic standard is problematic. Over the last 20 years there has been a growing recognition that screening mammography has led to substantial overdiagnosis — the detection of abnormalities that meet the pathological definition of cancer, yet are not destined to ever cause symptoms or death.
Furthermore, pathologists can disagree about who has breast cancer — even when presented with the same biopsy specimens under the microscope. The problem is far less for large, obvious cancers — far greater for small (even microscopic), early-stage cancers. That’s because there is a gray area between cancer and not cancer. This has important implications for AI technology used for cancer screening.
AI systems will undoubtedly be able to consistently find subtle abnormalities on mammograms, which will lead to more biopsies. This will require pathologists to make judgments on subtler irregularities that may be consistent with cancer under the microscope, but may not represent disease destined to cause symptoms or death. In other words, reliance on pathologists for the ground truth could lead to an increase in cancer overdiagnosis.
The problem is not confined to breast cancer. Overdiagnosis and disagreement over what constitutes cancer are also problems relevant to melanoma, prostate and thyroid cancer. AI systems are already being developed for screening skin moles for melanoma and are likely to be employed in other cancers as well.
In a piece for the New England Journal of Medicine last month, we proposed a better way of deploying AI in cancer detection. Why not make use of the information contained in pathological disagreement? We suggested that each biopsy used in training AI systems be evaluated by a diverse panel of pathologists and labeled with three distinct categories: unanimous agreement of cancer, unanimous agreement of not cancer, and disagreement as to the presence of cancer. This intermediate category of disagreement would not only help researchers understand the natural history of cancer, but could also be used by clinicians and patients to investigate less invasive treatment for “cancers” in the gray area.
The problem of observer disagreement is not confined to pathologists; it also exists with radiologists reading mammograms. That’s the problem AI is trying to solve. Yet, while the notion of disagreement may be unsettling, disagreement also provides important information: Patients diagnosed with an early-stage cancer should be more optimistic about their prognoses if there were some disagreement about whether cancer was present, rather than all pathologists agreeing it was obviously cancer.
Artificial intelligence can’t resolve the ambiguities surrounding early cancer diagnosis, but it can help illuminate them. And illuminating these gray areas is the first step in helping patients and their doctors respond wisely to them. We believe that training AI to recognize an intermediate category would be an important advance in the development of this technology.
Adewole S. Adamson is a dermatologist and assistant professor of internal medicine at Dell Medical School at the University of Texas at Austin. H. Gilbert Welch is a senior researcher in the Center for Surgery and Public Health at Brigham and Women’s Hospital in Boston and author of “Should I Be Tested for Cancer? Maybe Not and Here’s Why.”