Review
Abstract
Background: Several studies have explored the predictive performance of machine learning–based breast cancer risk prediction models and have shown controversial conclusions. Thus, the performance of the current machine learning–based breast cancer risk prediction models and their benefits and weakness need to be evaluated for the future development of feasible and efficient risk prediction models.
Objective: The aim of this review was to assess the performance and the clinical feasibility of the currently available machine learning–based breast cancer risk prediction models.
Methods: We searched for papers published until June 9, 2021, on machine learning–based breast cancer risk prediction models in PubMed, Embase, and Web of Science. Studies describing the development or validation models for predicting future breast cancer risk were included. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to assess the risk of bias and the clinical applicability of the included studies. The pooled area under the curve (AUC) was calculated using the DerSimonian and Laird random-effects model.
Results: A total of 8 studies with 10 data sets were included. Neural network was the most common machine learning method for the development of breast cancer risk prediction models. The pooled AUC of the machine learning–based optimal risk prediction model reported in each study was 0.73 (95% CI 0.66-0.80; approximate 95% prediction interval 0.56-0.96), with a high level of heterogeneity between studies (Q=576.07, I2=98.44%; P<.001). The results of head-to-head comparison of the performance difference between the 2 types of models trained by the same data set showed that machine learning models had a slightly higher advantage than traditional risk factor–based models in predicting future breast cancer risk. The pooled AUC of the neural network–based risk prediction model was higher than that of the nonneural network–based optimal risk prediction model (0.71 vs 0.68, respectively). Subgroup analysis showed that the incorporation of imaging features in risk models resulted in a higher pooled AUC than the nonincorporation of imaging features in risk models (0.73 vs 0.61; Pheterogeneity=.001, respectively). The PROBAST analysis indicated that many machine learning models had high risk of bias and poorly reported calibration analysis.
Conclusions: Our review shows that the current machine learning–based breast cancer risk prediction models have some technical pitfalls and that their clinical feasibility and reliability are unsatisfactory.
doi:10.2196/35750
Keywords
Introduction
Of all the cancers worldwide among women, breast cancer shows the highest incidence and mortality [
]. Early access to effective diagnostic and treatment services after breast cancer screening could have reduced breast cancer mortality by 25%-40% over the last several decades [ , ]. The development and implementation of risk-based breast cancer control and prevention strategies can have great potential benefits and important public health implications. Moreover, risk-based breast cancer control and prevention strategy is more effective and efficient than conventional screening based on model evaluation [ , ]. A prerequisite for the implementation of personalized risk-adapted screening intervals is accurate breast cancer risk assessment [ ]. Models with high sensitivity and specificity can enable screening to target more elaborate efforts for high-risk groups while minimizing overtreatment for the rest. Currently, the US breast cancer screening guidelines use breast cancer risk assessments to inform the clinical course, thereby targeting the high-risk population by earlier detection and lesser screening harms (eg, false-positive results, overdiagnosis, overtreatment, increased patient anxiety) [ ]. Nevertheless, there is no standardized approach for office-based breast cancer risk assessment worldwide.Traditional risk factor–based models such as Gail, BRCAPRO, Breast Cancer Surveillance Consortium, Claus, and Tyrer-Cuzick models have been well-validated and used commonly in clinical practice, but these models developed by logistic regression or Cox regression or those presented as risk scoring systems have low discrimination accuracy with the area under the receiver operating characteristic curve (AUC) between 0.53 and 0.64 [
- ] and these models show bias when applied to minority populations, accompanied by great variance in terms of the patients included, methods of development, predictors, outcomes, and presentations [ - ]. Other risk prediction models that incorporated genetic risk factors were also only best suited for specific clinical scenarios and may have limited applicability in certain types of patients [ ]. Recently, with the cross research between artificial intelligence and medicine, the development and validation of breast cancer risk prediction models based on machine learning algorithms have been the current research focus. Machine learning algorithms provide an alternative approach to standard prediction modelling, which may address the current limitations and improve the prediction accuracy of breast cancer susceptibility [ , ]. Mammography is the most commonly used method for breast cancer screening or early detection. Machine learning artificial intelligence models suggest that mammographic images contain risk indicators and germline genetic data that can be used to improve and strengthen the existing risk prediction models [ ]. Some studies claim that machine learning–based breast cancer risk prediction models are better than regression method–based models [ , ], but 1 study reported the opposite result [ ]. These controversial conclusions prompted us to review the performance and the weaknesses of machine learning–based breast cancer risk prediction models. Therefore, this systematic review and meta-analysis aims to assess the performance and clinical feasibility of the currently available machine learning–based breast cancer risk prediction models.Methods
Study Protocol
This systematic review and meta-analysis was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) statement [
], the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies, and the prediction model performance guidelines [ , ].Literature Search Strategy
Papers on machine learning–based breast cancer risk prediction models were searched in PubMed, Embase, and Web of Science by using the terms “machine learning OR deep learning” AND “mammary OR breast cancer OR carcinoma OR tumor OR neoplasm” AND “risk assessment OR risk prediction” published until June 9, 2021, and limited to papers published in English. The complete search strategy is detailed in
. Reviews in this field and references of the original papers were also manually checked to identify whether there were any missed studies.Inclusion and Exclusion Criteria
Studies describing development or validation models for predicting future breast cancer risk were included in our study. The inclusion criteria were as follows: (1) breast cancer risk prediction model developed using a machine learning algorithm, (2) mean follow-up period for cohort studies should be longer than 1 year, and (3) future breast cancer risk is the assessment result. The exclusion criteria were as follows: (1) review or conference or editorial or only published abstracts, (2) the original full text not available or incomplete information, and (3) studies with no AUC or C-statistic and its 95% CI. When papers included the same population, studies with larger sample size or longer follow-up periods were finally included.
Data Extraction and Study Quality
Two researchers independently collected data on the first author, publication year, geographic region, study design, study population, sample size, study period, age of participants, time point for breast cancer risk prediction, name of the risk prediction model, number of participants and cancer cases in test data set, input risk factors, development and verification methods, and AUC with its 95% CI. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to assess the risk of bias (ROB) and the clinical applicability of the included studies [
, ]. Any discrepancies were resolved by consensus or were consulted with the corresponding author.Statistical Analyses
The discrimination value was assessed by AUC, which measures the machining learning risk prediction model ability to distinguish the women who will and will not develop breast cancer. An AUC of 0.5 was considered as no discrimination, whereas 1.0 indicated perfect discrimination. We calculated the pooled AUC of the risk models by using DerSimonian and Laird’s random-effects model [
]. A head-to-head performance comparison of the studies that developed machine learning models and those that developed traditional risk factor–based models can help us understand the performance gain of utilizing machine learning methods in the same experimental setting. The Q test and I2 value were employed to evaluate the heterogeneity among the studies. High values in both tests (I2>40%, a significant Q test value with P<.05) showed high levels of inconsistency and heterogeneity. We also calculated an approximate 95% prediction interval (PI) to depict the extent of between-study heterogeneity [ ]. Sensitivity analysis was performed to assess the influence of each study on the pooled effects by omitting each study. The visualized asymmetry of the funnel plot and Egger regression test were assessed for the publication bias. Pooled effects were also adjusted using the Duval and Tweedie trim-and-fill method [ , ]. All statistical meta-analyses of the predictive performance were performed using the MedCalc statistical software version 20 (MedCalc Ltd).Results
Eligible Papers and Study Characteristics
A total of 937 papers were identified, and 8 studies with 10 data sets met our inclusion criteria and they were finally included in the meta-analysis (
) [ , - , - ]. The primary characteristics of the included studies are summarized in . Totally, 218,100 patients were included in this review. Most of these patients were from America and Europe; only 1 data set’s participants were from Taiwan, China. Six studies [ , , , - ] predicted short-term (≤5 year) breast cancer risk, while 2 studies [ , ] predicted long-term (future or lifetime) risk. The characteristics and performance of the machine learning–based breast cancer risk prediction models are summarized in . Most of the machine learning prediction models were development models; only 1 study [ ] used 3 different ethnic groups for external validation. Neural network was the most common machine learning method for the development of breast cancer risk prediction models. Only 1 neural network–based model incorporated genetic risk factors [ ] and 6 neural network–based models incorporated imaging features [ , , , ].Study ID | Study design | Study population, geographic location | Sample size | Age (years) | Study period | Breast cancer risk | Participants in test data set (n) | Cancers in test data set (n) |
Yala et al [ | ], 2021Retrospective study | Massachusetts General Hospital, USA | 70,972 | 40-80 | 2009-2016 | 5 years | 7005 | 588 |
Yala et al [ | ], 2021Retrospective study | Cohort of Screen-Aged Women, Karolinska University Hospital, Sweden | 7353 | 40-74 | 2008-2016 | 5 years | 7353 | 1413 |
Yala et al [ | ], 2021Retrospective study | Chang Gung Menoral Hospital, Taiwan | 13,356 | 40-70 | 2010-2011 | 5 years | 13,356 | 244 |
Ming et al [ | ], 2020Retrospective study | Oncogenetic Unit, Geneva University Hospital, Sweden | 45,110 | 20-80 | 1998-2017 | Lifetime | 36,146 | 4911 |
Portnoi et al [ | ], 2019Retrospective study | A large tertiary academic medical center, Massachusetts General Hospital, USA | 1183 | 40-80 | 2011-2013 | 5 years | 1164 | 96 |
Stark et al [ | ], 2019Prospective study | Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial data set, USA | 64,739 | 50-78 | 1993-2001 | 5 years | 12,948 | 269 |
Dembrower et al [ | ], 2020Retrospective study | Cohort of Screen-Aged Women, Karolinska University Hospital, Sweden | 14,034 | 40-74 | 2008-2015 | Future | 2283 | 278 |
Arefan et al [ | ], 2020Retrospective case-control study cohort | Health Insurance Portability and Accountability Act, USA | 226 | 41-89 | 2013 | Short-term | 226 | 113 |
Tan et al [ | ], 2013Retrospective study | University of Oklahoma Medical Center, USA | 994 | —a | 2006 | 12-36 months | 994 | 283 |
Saha et al [ | ], 2019Retrospective study | Duke University School of Medicine, USA | 133 | 27-76 | 2004-2013 | 2 years | 133 | 46 |
aNot available.
Study ID, model name | Statistical method | Development/validation model | Model input parameters | Incorporation of imaging features | AUCa (95% CI) | ||||||
Yala et al [ | ], 2021|||||||||||
Tyrer-Cuzick modelb | Logistic regression | —c | Age, weight, height, menarche age, given birth, menopause status, hormone replacement therapy usage, BRCA gene, ovarian cancer, breast biopsy, family history, hormonal factors | No | 0.62 (0.59-0.66) | ||||||
Radiolosit BI-RADSd modele | Logistic regression | Development model | Mammographic features | Yes | 0.62 (0.60-0.65) | ||||||
Image- and heatmaps model | Convolutional neural network | Development model | — | Yes | 0.64 (0.60-0.68) | ||||||
Imaged-only deep learning model | Convolutional neural network | Development model | Mammographic features | Yes | 0.73 (0.70-0.77) | ||||||
Hybrid deep learning model | Convolutional neural network | Development model | Age, weight, height, menarche age, given birth, menopause status, hormone replacement therapy usage, BRCA gene, ovarian cancer, breast biopsy, family history, hormonal factors | Yes | 0.72 (0.69-0.76) | ||||||
Mirai without risk factors modelf | Convolutional neural network | Development model | Mammographic features | Yes | 0.76 (0.73-0.79) | ||||||
Mirai with risk factors model | Convolutional neural network | Development model | Age, weight, height, menarche age, given birth, menopause status, hormone replacement therapy usage, BRCA gene, ovarian cancer, breast biopsy, family history, hormonal factors | Yes | 0.76 (0.73-0.80) | ||||||
Yala et al [ | ], 2021|||||||||||
Imaged-only deep learning model | Convolutional neural network | Validation model | Mammographic features | Yes | 0.71 (0.69-0.73) | ||||||
Mirai without risk factors modelf | Convolutional neural network | Validation model | Mammographic features | Yes | 0.78 (0.76-0.80) | ||||||
Yala et al [ | ], 2021|||||||||||
Imaged-only deep learning model | Convolutional neural network | Validation model | Mammographic features | Yes | 0.70 (0.66-0.73) | ||||||
Mirai without risk factors modelf | Convolutional neural network | Validation model | Mammographic features | Yes | 0.79 (0.75-0.82) | ||||||
Ming et al [ | ], 2020|||||||||||
BOADICEAg model | Logistic regression | — | Family pedigree, age, age at menarche, age at first live birth, parity, age at menopause, Ashkenazi Jewish ancestry, ovarian, prostate, pancreatic, contralateral, and lung/bronchus cancer diagnosis and age of onset, estrogen receptor status, progesterone receptor status, HER2 status, and BRCA/BRCA2 germline pathogenic variant | No | 0.639h | ||||||
Machine learning-Markov Chain Monte Carlo generalized linear mixed model | Markov Chain Monte Carlo | Development model | Family pedigree, age, age at menarche, age at first live birth, parity, age at menopause, Ashkenazi Jewish ancestry, ovarian, prostate, pancreatic, contralateral, and lung/bronchus cancer diagnosis and age of onset, estrogen receptor status, progesterone receptor status, HER2 status, and BRCA/BRCA2 germline pathogenic variant | No | 0.851 (0.847-0.856) | ||||||
Machine learning-adaptive boosting modele,f | Adaptive boosting | Development model | Family pedigree, age, age at menarche, age at first live birth, parity, age at menopause, Ashkenazi Jewish ancestry, ovarian, prostate, pancreatic, contralateral, and lung/bronchus cancer diagnosis and age of onset, estrogen receptor status, progesterone receptor status, HER2 status, and BRCA/BRCA2 germline pathogenic variant | No | 0.889 (0.875-0.903) | ||||||
Machine learning-random forest model | Random forest | Development model | Family pedigree, age, age at menarche, age at first live birth, parity, age at menopause, Ashkenazi Jewish ancestry, ovarian, prostate, pancreatic, contralateral, and lung/bronchus cancer diagnosis and age of onset, estrogen receptor status, progesterone receptor status, HER2 status, and BRCA/BRCA2 germline pathogenic variant | No | 0.843 (0.838-0.849) | ||||||
Portnoi et al [ | ], 2019|||||||||||
Traditional risk factors logistic regression modele | Logistic regression | Development model | Age, weight, height, breast density, age at menarche, age at first live birth, menopause, hormone replacement therapy usage, had gene mutation, had ovarian cancer, had breast biopsy, number of first-degree relatives who have had breast cancer, race/ethnicity, history of breast cancer, and background parenchymal enhancement on magnetic resonance images | No | 0.558 (0.492-0.624) | ||||||
Magnetic resonance image-deep convolutional neural network modelf | Convolutional neural network | Development model | Full-resolution magnetic resonance images | Yes | 0.638 (0.577-0.699) | ||||||
Tyrer-Cuzick modelb | Logistic regression | — | Age, weight, height, breast density, age at menarche, age at first live birth, menopause, hormone replacement therapy usage, had gene mutation, had ovarian cancer, had breast biopsy, number of first-degree relatives who have had breast cancer, and race/ethnicity, and history of breast cancer | No | 0.493 (0.353-0.633) | ||||||
Stark et al [ | ], 2019|||||||||||
Feed-forward artificial neural network model | Artificial neural network | Development model | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.608 (0.574-0.643) | ||||||
Logistic regression modele,f | Logistic regression | Development model | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, and race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.613 (0.579-0.647) | ||||||
Gaussian naive Bayes model | Gaussian naive Bayes | Development model | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, and race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.589 (0.555-0.623) | ||||||
Decision tree model | Decision tree | Development model | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, and race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.508 (0.496-0.521) | ||||||
Linear discriminant analysis model | Linear discriminant analysis | Development model | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, and race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.613 (0.579-0.646) | ||||||
Support vector machine model | Support vector machine | Development model | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, and race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.518 (0.484-0.551) | ||||||
Breast Cancer Risk Prediction Tool modelb | Logistic regression | — | Age, age at menarche, age at first live birth, number of first-degree relatives who have had breast cancer, and race/ethnicity, age at menopause, an indicator of current hormone usage, number of years of hormone usage, BMI, pack years of cigarettes smoked, years of birth control usage, number of liver births, an indicator of personal prior history of cancer | No | 0.563 (0.528-0.597) | ||||||
Dembrower et al [ | ], 2020|||||||||||
Deep learning risk score model | Deep neural network | Development model | Mammographic images, the age at image acquisition, exposure, tube current, breast thickness, and compression force | Yes | 0.65 (0.63-0.66) | ||||||
Dense area modelb,e | Logistic regression | Development model | Mammographic features | Yes | 0.58 (0.57-0.60) | ||||||
Percentage density modelb | Logistic regression | Development model | Mammographic features | Yes | 0.54 (0.52-0.56) | ||||||
Deep learning risk score + dense area + percentage density modelf | Deep neural network | Development model | Mammographic images, the age at image acquisition, exposure, tube current, breast thickness, and compression force | Yes | 0.66 (0.64-0.67) | ||||||
Arefan et al [ | ], 2020|||||||||||
End-to-end convolutional neural network model using GoogLeNet | Convolutional neural network | Development model | Imaging features of the whole-breast region | Yes | 0.62 (0.58-0.66) | ||||||
End-to-end convolutional neural network model using GoogLeNet | Convolutional neural network | Development model | Imaging features of the dense breast region only | Yes | 0.67 (0.61-0.73) | ||||||
GoogLeNet combining a linear discriminant analysis model | Linear discriminant analysis | Development model | Imaging features of the whole-breast region | Yes | 0.64 (0.58-0.70) | ||||||
GoogLeNet combining a linear discriminant analysis modele,f | Linear discriminant analysis | Development model | Imaging features of the dense breast region only | Yes | 0.72 (0.67-0.76) | ||||||
Area-based percentage breast density modelb | Logistic regression | Development model | Percentage breast density | Yes | 0.54 (0.49-0.59) | ||||||
Tan et al [ | ], 2013|||||||||||
Support vector machine classification modele,f | Support vector machine classification | Validation model | Age, family history, breast density, mean pixel value difference, mean value of short run emphasis; maximum value of short run emphasis, standard deviation of the r-axis cumulative projection histogram, standard deviation of the y-axis cumulative projection histogram, median of the x-axis cumulative projection histogram, mean pixel value, mean value of short run low gray-level emphasis, and median of the x-axis cumulative projection histogram | Yes | 0.725 (0.689-0.759) | ||||||
Saha et al [ | ], 2019|||||||||||
Mean reader scores modelb | Logistic regression | Development model | — | Yes | 0.59 (0.49-0.70) | ||||||
Median reader scores modelb | Logistic regression | Development model | — | Yes | 0.60 (0.51-0.69) | ||||||
Machine learning model 1 | Machine learning logistic regression | Development model | Magnetic resonance image background parenchymal enhancement features were based on the fibroglandular tissue mask on the fat saturated sequence | Yes | 0.63 (0.52-0.73) | ||||||
Machine learning model 2e,f | Machine learning logistic regression | Development model | Magnetic resonance image background parenchymal enhancement features were based on the fibroglandular tissue segmentation using the non–fat-saturated sequence | Yes | 0.70 (0.60-0.79) |
aAUC: area under the curve.
bTraditional risk factor–based optimal breast cancer risk prediction model.
cNot available.
dBI-RADS: Breast Imaging-Reporting And Data System.
eNonneural network–based optimal breast cancer risk prediction model.
fMachine learning–based optimal breast cancer risk prediction model.
gBOADICEA: Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm.
h95% CI not available.
Study Quality
PROBAST was used to assess the quality of the included studies in terms of both ROB and clinical applicability. All 8 studies demonstrated a low applicability risk; only 1 of the papers had low ROB [
], indicating that most machine learning models have technical pitfalls ( ). The other 7 studies that had high ROB were mostly in the domain of analysis, with several reasons as follows: (1) no information was provided on how the continuous/categorical predictors handle or they were handled unreasonably, (2) complexities in the data were not assessed in the final analysis, (3) model calibration was not assessed or lack of standardized evaluation of model calibration, (4) the calculation formulae of the predictors and their weights were not reported in the final model, and (5) insufficient number of participants was used to develop the models. The details are shown in . Only 3 neural network–based models were developed by bootstrap and cross-validation to evaluate the discrimination ability of the prediction model, whereas other machine learning models and regression models were developed by using random split or nonrandom split.Study | Risk of bias | Applicability | Overall | ||||||||
Participants | Predictors | Outcome | Analysis | Participants | Predictors | Outcome | Risk of bias | Applicability | |||
Yala et al [ | ], 2021LRa | LR | LR | HRb | LR | LR | LR | LR | LR | ||
Ming et al [ | ], 2020LR | HR | LR | HR | LR | LR | LR | HR | LR | ||
Portnoi et al [ | ], 2019LR | LR | LR | HR | LR | LR | LR | HR | LR | ||
Stark et al [ | ], 2019LR | LR | LR | HR | LR | LR | LR | HR | LR | ||
Dembrower et al [ | ], 2020LR | LR | LR | HR | LR | LR | LR | HR | LR | ||
Arefan et al [ | ], 2020LR | LR | LR | HR | LR | LR | LR | HR | LR | ||
Tan et al [ | ], 2013LR | LR | LR | HR | LR | LR | LR | HR | LR | ||
Saha et al [ | ], 2019LR | LR | LR | HR | LR | LR | LR | HR | LR |
aLR: low risk.
bHR: high risk.
Predictive Performance
The pooled AUC of the machine learning–based optimal breast cancer risk prediction model reported in each included study was 0.73 (95% CI 0.66-0.80; approximate 95% PI 0.56-0.96), with a high level of heterogeneity between studies (Q=576.07, I2=98.44%; P<.001) (
). We also performed metaregression, and the results showed that the heterogeneity remains high and essentially unchanged. Sensitivity analysis showed that the pooled AUC and 95% CI were not significantly altered before and after the omission of each data set, with a range of 0.72 (95% CI 0.67-0.76; approximate 95% PI 0.60-0.85) to 0.75 (95% CI 0.68-0.82; approximate 95% PI 0.57-0.98) ( ). The results of head-to-head comparison of the performance difference in both types of models trained by the same data set showed that the pooled AUC of machine learning prediction models (0.69, 95% CI 0.63-0.74; approximate 95% PI 0.57-0.83; A) was higher than that of the traditional risk factor–based models, with the range from 0.56 (95% CI 0.55-0.58; approximate 95% PI 0.51-0.62) to 0.58 (95% CI 0.57-0.59; approximate 95% PI 0.51-0.62) (all Pheterogeneity<.001) ( B-3E).The pooled AUC of neural network–based breast cancer risk prediction models was 0.71 (95% CI 0.65-0.77; approximate 95% PI 0.57-0.87; Q=131.42; I2=95.43%; P<.001) (
A), which was higher than that of nonneural network–based optimal risk prediction models (0.68, 95% CI 0.56-0.81; approximate 95% PI 0.53-0.81; Q=1268.99; I2=99.45%; P<.001) ( B). When stratified by the presence or absence of incorporation of imaging features, the pooled AUCs in models incorporated with imaging features and those in models not incorporated with imaging features were 0.73 (95% CI 0.67-0.79) and 0.61 (95% CI 0.57-0.64) (Pheterogeneity=.001), respectively ( ). Subgroup analysis also showed that the pooled AUC in models not incorporated with genetic risk factors was not significantly lower than that in models incorporated with genetic risk factors (0.71 vs 0.76, respectively; Pheterogeneity=.12) ( ). Our results also showed that models predicting short-term (≤5 year) breast cancer risk had a slightly higher pooled AUC than those predicting long-term risk (0.72 vs 0.66, respectively), although the difference was not significant (Pheterogeneity=.10) ( ).The funnel plot indicated that there was no publication bias, with an Egger regression coefficient of –3.85 (P=.46) (
). According to the trim-and-fill method, 2 studies had to be trimmed, and the adjusted pooled AUC was 0.75 (95% CI 0.69-0.82) after trimming ( ).Model, subgroup | Area under the curve (95% CI) | Pheterogeneity value | |
Model with/without imaging features | .001 | ||
Model incorporated with imaging features | 0.73 (0.67-0.79) | ||
Model not incorporated with imaging features | 0.61 (0.57-0.64) | ||
Model with/withoutgenetic risk factors | .12 | ||
Model incorporated with genetic risk factors | 0.76 (0.73-0.80) | ||
Model not incorporated with genetic risk factors | 0.71 (0.65-0.77) | ||
Model prediction of risk | .10 | ||
Model predicting short-term risk | 0.72 (0.65-0.78) | ||
Model predicting long-term risk | 0.66 (0.64-0.67) |
Discussion
Principal Findings
In this meta-analysis, 8 studies showed that the pooled AUC of machine learning–based breast cancer risk prediction models was 0.73 (95% CI 0.66-0.80). The results of head-to-head comparison of the performance difference in 2 types of models trained by the same data set showed that machine learning models had a slightly higher advantage than the traditional risk factor–based models in predicting future breast cancer risk. Machine learning approaches have the potential to achieve better accuracy and incorporate different types of information, including traditional risk factors, imaging features, genetic data, and clinical factors. However, of note, the predictive ability of the machine learning models showed substantial heterogeneity among the studies included in this review.
Machine learning represents a data-driven method; it has the ability to learn from past examples and detect hard-to-discern patterns from large and noisy data sets and model nonlinear and more complex relationships by employing a variety of statistical, probabilistic, and optimization techniques [
]. This capability of machine learning algorithms offers a possibility for the investigation and development of risk prediction and diagnostic prediction models in cancer research [ ]. It is evident that the use of machine learning methods can improve our understanding of cancer occurrence and progression [ , ]. Thus, developing machine learning–based breast cancer risk prediction models with improved discriminatory power can stratify women into different risk groups, which are useful for guiding the choice for personalized breast cancer screening in order to achieve a good balance in the risk benefit and cost benefit for breast cancer screening.In our stratified analysis, neural network–based breast cancer risk prediction models incorporating imaging features showed superior performance. This result suggests that the incorporation of imaging inputs in machine learning models can deliver more accurate breast cancer risk prediction. Previous breast cancer risk assessments have already recognized the importance of imaging features in mammography [
, ], but the existing model was based on the underlying pattern that was assessed visually by radiologists, and the whole image was subjectively summarized as a density score on mammography as the model input [ ]. It is unlikely that the single value of the density score would be able to take maximum advantage of the imaging features. The other human-specified features may not be able to capture all the risk-relevant information in the image. However, the flexibility of the neural networks might allow the extraction of more information from both finer patterns as well as the overall image characteristics, which can improve the accuracy of the prediction models.The findings in this study showed that neural network–based models that predicted short-term (≤5 year) breast cancer risk had slightly better discriminatory accuracy than models predicting long-term risk, although confidence intervals overlapped. Improvement of public health literacy and the popularization of healthy lifestyles motivated more opportunities for women in their lifetime to participate in breast cancer prevention and screening and modify their identified modifiable risk factors associated with breast cancer. Unlike many currently known risk factors that do not change and maintain constant risk values, short-term risk factors may change over time. The cumulative effect of these changes may reduce the incidence of breast cancer. Therefore, it is unreasonable to predict the long-term risk of breast cancer by using these risk factors, which may lead to high probability of false-positive recall.
Model Reliability and Clinical Feasibility
Our study showed several issues regarding machine learning model reliability. The PROBAST analysis indicated that machine learning models have technical pitfalls. First, most machine learning models did not report sufficient statistical analysis information, and only few studies [
, ] provided the details for model reproduction. Second, many machine learning models showed a poor calibration analysis, indicating that the assessment of their utility was problematic, leading to inaccurate evaluation of the future breast cancer risk. Third, only 1 study [ ] reported machine learning models that were externally validated in different ethnic populations. Six neural network–based models incorporated many complex imaging features, which may cause clinicians or public physicians to be unable to quickly and conveniently calculate the breast cancer risk by machine learning models manually. This may also be why few studies carry out external validation of the machine learning models. Due to the complexity of the machine learning model algorithms, many studies included many different types of predictors into the model construction, which may lead to an overfitting of the machine learning models [ ]. However, only few development studies [ , , ] reported the details for these predictor selection processes, which may lower the clinical feasibility of the machine learning models.Limitations
This review had several limitations. First, most of the included studies [
, - ] did not provide the expected/observed ratio or other indicators that could evaluate the calibration of the risk prediction model; therefore, this meta-analysis could not comprehensively review the calibration of the machine learning–based breast cancer risk prediction models. Second, substantial heterogeneity was presented in this systematic review, which impeded us from making further rigorous comparisons. The heterogeneity can be partially explained but could not be markedly diminished by different risk predicting times, with or without the incorporation of imaging features and genetic risk factors. The results of meta-analysis can only be interpreted carefully within the context. Third, the pooled results of the machine learning prediction model were analyzed based on most of the included studies that had high ROB [ - , - ]. The reason that these studies are rated as high ROB were that complexities in the data were not assessed or the calculation formulas of the predictors and their weights were not reported in the final model. These parameters, the so-called “black boxes,” are almost never presented in the original studies. Moreover, we performed a head-to-head fair comparison of the performance difference between 2 types of models trained by same data set, and the results showed that machine learning models had a slightly higher advantage in predicting future breast cancer risk. Lastly, we mainly focus on the statistical measures of model performance and did not discuss how to meta-analyze the clinical measures of performance such as net benefit. Hence, further research on how to meta-analyze net benefit estimates should be performed.Conclusions
In summary, machine learning–based breast cancer risk prediction models had a slightly higher advantage in predicting future breast cancer risk than traditional risk factor–based models in head-to-head comparisons of the performance under the same experimental settings. However, machine learning–based breast cancer risk prediction models had some technical pitfalls, and their clinical feasibility and reliability were unsatisfactory. Future research may be worthwhile to obtain individual participant data to investigate in more detail how the machine learning models perform across different populations and subgroups. We also suggest that they could be considered to be implemented by pooling with breast cancer screening programs and to help developing optimal screening strategies, especially screening intervals.
Acknowledgments
The authors thank Professor Yuan Wang, who provided suggestions for the analysis and for editing this manuscript. This study was supported by the National Natural Science Foundation of China (grants 71804124, 71904142, 72104179).
Authors' Contributions
QZ and YW conceptualized the data. Shu Li and YJ curated the data. YG performed the formal analysis and wrote the original draft. YG, Shu Li, and LZ performed the methodology. SS and XX administered the project. Shuqian Li supervised this study. YG, Shu Li, and HY reviewed and edited the manuscript. All authors read and agreed to the published version of the manuscript.
Conflicts of Interest
None declared.
Search strategy.
DOCX File , 17 KB
Details of risk of bias and the clinical applicability of the included studies.
DOCX File , 16 KB
Sensitivity analysis of the pooled area under the curve of the machine learning–based breast cancer risk prediction models.
DOCX File , 15 KB
Funnel plot of the discrimination of (A) machine learning–based breast cancer risk prediction model and (B) funnel plot adjusted by the trim-and-fill method. AUC: area under the curve.
PNG File , 40 KBReferences
- Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018 Nov;68(6):394-424 [FREE Full text] [CrossRef] [Medline]
- Lauby-Secretan B, Scoccianti C, Loomis D, Benbrahim-Tallaa L, Bouvard V, Bianchini F, International Agency for Research on Cancer Handbook Working Group. Breast-cancer screening--viewpoint of the IARC Working Group. N Engl J Med 2015 Jun 11;372(24):2353-2358. [CrossRef] [Medline]
- Massat NJ, Dibden A, Parmar D, Cuzick J, Sasieni PD, Duffy SW. Impact of Screening on Breast Cancer Mortality: The UK Program 20 Years On. Cancer Epidemiol Biomarkers Prev 2016 Mar;25(3):455-462. [CrossRef] [Medline]
- Mühlberger N, Sroczynski G, Gogollari A, Jahn B, Pashayan N, Steyerberg E, et al. Cost effectiveness of breast cancer screening and prevention: a systematic review with a focus on risk-adapted strategies. Eur J Health Econ 2021 Nov;22(8):1311-1344. [CrossRef] [Medline]
- Arnold M, Pfeifer K, Quante AS. Is risk-stratified breast cancer screening economically efficient in Germany? PLoS One 2019;14(5):e0217213 [FREE Full text] [CrossRef] [Medline]
- Brentnall AR, Cuzick J, Buist DSM, Bowles EJA. Long-term Accuracy of Breast Cancer Risk Assessment Combining Classic Risk Factors and Breast Density. JAMA Oncol 2018 Sep 01;4(9):e180174 [FREE Full text] [CrossRef] [Medline]
- Yala A, Mikhael PG, Strand F, Lin G, Smith K, Wan Y, et al. Toward robust mammography-based models for breast cancer risk. Sci Transl Med 2021 Jan 27;13(578):eaba4373. [CrossRef] [Medline]
- Wang X, Huang Y, Li L, Dai H, Song F, Chen K. Assessment of performance of the Gail model for predicting breast cancer risk: a systematic review and meta-analysis with trial sequential analysis. Breast Cancer Res 2018 Mar 13;20(1):18 [FREE Full text] [CrossRef] [Medline]
- Amir E, Evans DG, Shenton A, Lalloo F, Moran A, Boggis C, et al. Evaluation of breast cancer risk assessment packages in the family history evaluation and screening programme. J Med Genet 2003 Nov;40(11):807-814 [FREE Full text] [CrossRef] [Medline]
- Brentnall AR, Harkness EF, Astley SM, Donnelly LS, Stavrinos P, Sampson S, et al. Mammographic density adds accuracy to both the Tyrer-Cuzick and Gail breast cancer risk models in a prospective UK screening cohort. Breast Cancer Res 2015 Dec 01;17(1):147 [FREE Full text] [CrossRef] [Medline]
- Meads C, Ahmed I, Riley RD. A systematic review of breast cancer incidence risk prediction models with meta-analysis of their performance. Breast Cancer Res Treat 2012 Apr;132(2):365-377. [CrossRef] [Medline]
- Tice JA, Cummings SR, Smith-Bindman R, Ichikawa L, Barlow WE, Kerlikowske K. Using clinical factors and mammographic breast density to estimate breast cancer risk: development and validation of a new predictive model. Ann Intern Med 2008 Mar 04;148(5):337-347 [FREE Full text] [CrossRef] [Medline]
- Gail MH, Costantino JP, Pee D, Bondy M, Newman L, Selvan M, et al. Projecting individualized absolute invasive breast cancer risk in African American women. J Natl Cancer Inst 2007 Dec 05;99(23):1782-1792. [CrossRef] [Medline]
- Matsuno RK, Costantino JP, Ziegler RG, Anderson GL, Li H, Pee D, et al. Projecting individualized absolute invasive breast cancer risk in Asian and Pacific Islander American women. J Natl Cancer Inst 2011 Jun 22;103(12):951-961 [FREE Full text] [CrossRef] [Medline]
- Boggs DA, Rosenberg L, Adams-Campbell LL, Palmer JR. Prospective approach to breast cancer risk prediction in African American women: the black women's health study model. J Clin Oncol 2015 Mar 20;33(9):1038-1044 [FREE Full text] [CrossRef] [Medline]
- Kim G, Bahl M. Assessing Risk of Breast Cancer: A Review of Risk Prediction Models. J Breast Imaging 2021;3(2):144-155 [FREE Full text] [CrossRef] [Medline]
- Obermeyer Z, Emanuel EJ. Predicting the Future - Big Data, Machine Learning, and Clinical Medicine. N Engl J Med 2016 Sep 29;375(13):1216-1219 [FREE Full text] [CrossRef] [Medline]
- Dreiseitl S, Ohno-Machado L. Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 2002;35(5-6):352-359 [FREE Full text] [CrossRef] [Medline]
- Ming C, Viassolo V, Probst-Hensch N, Dinov ID, Chappuis PO, Katapodi MC. Machine learning-based lifetime breast cancer risk reclassification compared with the BOADICEA model: impact on screening recommendations. Br J Cancer 2020 Sep;123(5):860-867 [FREE Full text] [CrossRef] [Medline]
- Portnoi T, Yala A, Schuster T, Barzilay R, Dontchos B, Lamb L, et al. Deep Learning Model to Assess Cancer Risk on the Basis of a Breast MR Image Alone. American Journal of Roentgenology 2019 Jul;213(1):227-233. [CrossRef]
- Stark GF, Hart GR, Nartowt BJ, Deng J. Predicting breast cancer risk using personal health data and machine learning models. PLoS One 2019;14(12):e0226765 [FREE Full text] [CrossRef] [Medline]
- Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 2009 Jul 21;339:b2535 [FREE Full text] [CrossRef] [Medline]
- Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med 2014 Oct;11(10):e1001744 [FREE Full text] [CrossRef] [Medline]
- Debray TPA, Damen JAAG, Snell KIE, Ensor J, Hooft L, Reitsma JB, et al. A guide to systematic review and meta-analysis of prediction model performance. BMJ 2017 Jan 05;356:i6460 [FREE Full text] [CrossRef] [Medline]
- Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, PROBAST Group†. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med 2019 Jan 01;170(1):51-58 [FREE Full text] [CrossRef] [Medline]
- Moons KGM, Wolff RF, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann Intern Med 2019 Jan 01;170(1):W1-W33 [FREE Full text] [CrossRef] [Medline]
- DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986 Sep;7(3):177-188. [CrossRef] [Medline]
- Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ 2011 Feb 10;342:d549. [CrossRef] [Medline]
- Duval S, Tweedie R. Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics 2000 Jun;56(2):455-463. [CrossRef] [Medline]
- Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997 Sep 13;315(7109):629-634 [FREE Full text] [CrossRef] [Medline]
- Dembrower K, Liu Y, Azizpour H, Eklund M, Smith K, Lindholm P, et al. Comparison of a Deep Learning Risk Score and Standard Mammographic Density Score for Breast Cancer Risk Prediction. Radiology 2020 Feb;294(2):265-272. [CrossRef] [Medline]
- Arefan D, Mohamed AA, Berg WA, Zuley ML, Sumkin JH, Wu S. Deep learning modeling using normal mammograms for predicting breast cancer risk. Med Phys 2020 Jan;47(1):110-118 [FREE Full text] [CrossRef] [Medline]
- Tan M, Zheng B, Ramalingam P, Gur D. Prediction of near-term breast cancer risk based on bilateral mammographic feature asymmetry. Acad Radiol 2013 Dec;20(12):1542-1550 [FREE Full text] [CrossRef] [Medline]
- Saha A, Grimm LJ, Ghate SV, Kim CE, Soo MS, Yoon SC, et al. Machine learning-based prediction of future breast cancer using algorithmically measured background parenchymal enhancement on high-risk screening MRI. J Magn Reson Imaging 2019 Aug;50(2):456-464 [FREE Full text] [CrossRef] [Medline]
- Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and prognosis. Cancer Inform 2007 Feb 11;2:59-77 [FREE Full text] [Medline]
- Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J 2015;13:8-17 [FREE Full text] [CrossRef] [Medline]
- Hutt S, Mihaies D, Karteris E, Michael A, Payne AM, Chatterjee J. Statistical Meta-Analysis of Risk Factors for Endometrial Cancer and Development of a Risk Prediction Model Using an Artificial Neural Network Algorithm. Cancers (Basel) 2021 Jul 22;13(15):3689 [FREE Full text] [CrossRef] [Medline]
- Tan M, Zheng B, Leader JK, Gur D. Association Between Changes in Mammographic Image Features and Risk for Near-Term Breast Cancer Development. IEEE Trans. Med. Imaging 2016 Jul;35(7):1719-1728. [CrossRef]
- Sun Z, Dong W, Shi H, Ma H, Cheng L, Huang Z. Comparing Machine Learning Models and Statistical Models for Predicting Heart Failure Events: A Systematic Review and Meta-Analysis. Front Cardiovasc Med 2022;9:812276 [FREE Full text] [CrossRef] [Medline]
Abbreviations
AUC: area under the curve |
PI: prediction interval |
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analysis |
PROBAST: Prediction Model Risk of Bias Assessment Tool |
ROB: risk of bias |
Edited by A Mavragani, H Bradley; submitted 17.12.21; peer-reviewed by A Clift, A Spini; comments to author 10.05.22; revised version received 17.06.22; accepted 25.11.22; published 29.12.22
Copyright©Ying Gao, Shu Li, Yujing Jin, Lengxiao Zhou, Shaomei Sun, Xiaoqian Xu, Shuqian Li, Hongxi Yang, Qing Zhang, Yaogang Wang. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 29.12.2022.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.