Deep learning in breast cancer screening
Breast cancer is the most common cancer form among women worldwide and the incidence is rising. When mammography was introduced in the 1980s, mortality rates decreased by 30% to 40%. Today all women in Sweden between 40 to 74 years are invited to screening every 18 to 24 months. All women attending screening are examined with mammography, using two views, the mediolateral oblique (MLO) view and the craniocaudal (CC) view, producing four images in total. The screening process is the same for all women and based purely on age, and not on other risk factors for developing breast cancer.
Although the introduction of population-based breast cancer screening is a great success, there are still problems with interval cancer (IC) and large screen detected cancers (SDC), which are connected to an increased morbidity and mortality. To have a good prognosis, it is important to detect a breast cancer early while it has not spread to the lymph nodes, which usually means that the primary tumor is small. To improve this, we need to individualize the screening program, and be flexible on screening intervals and modalities depending on the individual breast cancer risk and mammographic sensitivity. In Sweden, at present, the only modality in the screening process is mammography, which is excellent for a majority of women but not for all.
The major lack of breast radiologists is another problem that is pressing and important to address. As their expertise is in such demand, it is important to use their time as efficiently as possible. This means that they should primarily spend time on difficult cases and less time on easily assessed mammograms and healthy women. One challenge is to determine which women are at high risk of being diagnosed with aggressive breast cancer, to delineate the low-risk group, and to take care of these different groups of women appropriately. In studies II to IV we have analysed how we can address these challenges by using deep learning techniques.
In study I, we described the cohort from which the study populations for study II to IV were derived (as well as study populations in other publications from our research group). This cohort was called the Cohort of Screen Aged Women (CSAW) and contains all 499,807 women invited to breast cancer screening within the Stockholm County between 2008 to 2015. We also described the future potentials of the dataset, as well as the case control subset of annotated breast tumors and healthy mammograms. This study was presented orally at the annual meeting of the Radiological Society of North America in 2019.
In study II, we analysed how a deep learning risk score (DLrisk score) performs compared with breast density measurements for predicting future breast cancer risk. We found that the odds ratios (OR) and areas under the receiver operating characteristic curve (AUC) were higher for age-adjusted DLrisk score than for dense area and percentage density. The numbers for DLrisk score were: OR 1.56, AUC, 0.65; dense area: OR 1.31, AUC 0.60, percent density: OR 1.18, AUC, 0.57; with P < .001 for differences between all AUCs). Also, the false-negative rates, in terms of missed future cancer, was lower for the DLrisk score: 31%, 36%, and 39% respectively. This difference was most distinct for more aggressive cancers.
In study III, we analyzed the potential cancer yield when using a commercial deep learning software for triaging screening examinations into two work streams – a ‘no radiologist’ work stream and an ‘enhanced assessment’ work stream, depending on the output score of the AI tumor detection algorithm. We found that the deep learning algorithm was able to independently declare 60% of all mammograms with the lowest scores as “healthy” without missing any cancer. In the enhanced assessment work stream when including the top 5% of women with the highest AI scores, the potential additional cancer detection rate was 53 (27%) of 200 subsequent IC, and 121 (35%) of 347 next-round screen-detected cancers.
In study IV, we analyzed different principles for choosing the threshold for the continuous abnormality score when introducing a deep learning algorithm for assessment of mammograms in a clinical prospective breast cancer screening study. The deep learning algorithm was supposed to act as a third independent reader making binary decisions in a double-reading environment (ScreenTrust CAD). We found that the choice of abnormality threshold will have important consequences. If the aim is to have the algorithm work at the same sensitivity as a single radiologist, a marked increase in abnormal assessments must be accepted (abnormal interpretation rate 12.6%). If the aim is to have the combined readers work at the same sensitivity as before, a lower sensitivity of AI compared to radiologists is the consequence (abnormal interpretation rate 7.0%). This study was presented as a poster at the annual meeting of the Radiological Society of North America in 2021.
In conclusion, we have addressed some challenges and possibilities by using deep learning techniques to make breast cancer screening programs more individual and efficient. Given the limitations of retrospective studies, there is a now a need for prospective clinical studies of deep learning in mammography screening.
List of scientific papers
I. Karin Dembrower, Peter Lindholm, Fredrik Strand. A multi-million Mammography Image Dataset and Population-Based Screening Cohort for the training and Evaluation of Deep Neural Networks – the Cohort of Screen-Aged Women (CSAW). Journal of digital Imaging. 2020, vol 33 sid 408-413.
https://doi.org/10.1007/s10278-019-00278-0
II. Karin Dembrower, Yue Liu, Hossein Azizpour, Martin Eklund, Kevin Smith, Peter Lindholm, Fredrik Strand. Comparison of a deep learning risk score and standard mammographic density score for breast cancer risk prediction. Radiology. 2020:190872. vol 294:2 sid 265-272.
https://doi.org/10.1148/radiol.2019190872
III. Karin Dembrower, Erik Wåhlin, Yue Liu, Mattie Salim, Kevin Smith, Peter Lindholm, Martin Eklund, Fredrik Strand. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. The Lancet Digital Health. 2020, 2.9: sid 468-sid 474.
https://doi.org/10.1016/S2589-7500(20)30185-0
IV. Karin Dembrower, Mattie Salim, Martin Eklund, Peter Lindholm, Fredrik Strand. Implications for downstream workload and sensitivity based on calibrating an AI CAD algorithm by standalone-reader or combined reader sensitivity matching. [Manuscript]
History
Defence date
2022-04-01Department
- Department of Physiology and Pharmacology
Publisher/Institution
Karolinska InstitutetMain supervisor
Lindholm, PeterCo-supervisors
Smith, Kevin; Eklund, Martin; Strand, FredrikPublication year
2022Thesis type
- Doctoral thesis
ISBN
978-91-8016-533-4Number of supporting papers
4Language
- eng