%0 Journal Article
%@ 2369-2960
%I JMIR Publications
%V 11
%N 
%P e67840
%T Identifying Data-Driven Clinical Subgroups for Cervical Cancer Prevention With Machine Learning: Population-Based, External, and Diagnostic Validation Study
%A Lu,Zhen
%A Dong,Binhua
%A Cai,Hongning
%A Tian,Tian
%A Wang,Junfeng
%A Fu,Leiwen
%A Wang,Bingyi
%A Zhang,Weijie
%A Lin,Shaomei
%A Tuo,Xunyuan
%A Wang,Juntao
%A Yang,Tianjie
%A Huang,Xinxin
%A Zheng,Zheng
%A Xue,Huifeng
%A Xu,Shuxia
%A Liu,Siyang
%A Sun,Pengming
%A Zou,Huachun
%K cervical cancer
%K human papillomavirus
%K screening
%K machine learning
%K cervical tumor
%K cancer
%K carcinoma
%K tumor
%K malignant
%K ML
%K phenomapping strategy
%K logistic regression
%K regression
%K population-based
%K validation study
%K cancer prevention
%K validity
%K usability
%K algorithm
%K surveillance
%K electronic health record
%K EHR
%D 2025
%7 19.3.2025
%9 
%J JMIR Public Health Surveill
%G English
%X Background: Cervical cancer remains a major global health issue. Personalized, data-driven cervical cancer prevention (CCP) strategies tailored to phenotypic profiles may improve prevention and reduce disease burden. Objective: This study aimed to identify subgroups with differential cervical precancer or cancer risks using machine learning, validate subgroup predictions across datasets, and propose a computational phenomapping strategy to enhance global CCP efforts. Methods: We explored the data-driven CCP subgroups by applying unsupervised machine learning to a deeply phenotyped, population-based discovery cohort. We extracted CCP-specific risks of cervical intraepithelial neoplasia (CIN) and cervical cancer through weighted logistic regression analyses providing odds ratio (OR) estimates and 95% CIs. We trained a supervised machine learning model and developed pathways to classify individuals before evaluating its diagnostic validity and usability on an external cohort. Results: This study included 551,934 women (median age, 49 years) in the discovery cohort and 47,130 women (median age, 37 years) in the external cohort. Phenotyping identified 5 CCP subgroups, with CCP4 showing the highest carcinoma prevalence. CCP2–4 had significantly higher risks of CIN2+ (CCP2: OR 2.07 [95% CI: 2.03‐2.12], CCP3: 3.88 [3.78‐3.97], and CCP4: 4.47 [4.33‐4.63]) and CIN3+ (CCP2: 2.10 [2.05‐2.14], CCP3: 3.92 [3.82‐4.02], and CCP4: 4.45 [4.31‐4.61]) compared to CCP1 (P<.001), consistent with the direction of results observed in the external cohort. The proposed triple strategy was validated as clinically relevant, prioritizing high-risk subgroups (CCP3-4) for colposcopies and scaling human papillomavirus screening for CCP1-2. Conclusions: This study underscores the potential of leveraging machine learning algorithms and large-scale routine electronic health records to enhance CCP strategies. By identifying key determinants of CIN2+/CIN3+ risk and classifying 5 distinct subgroups, our study provides a robust, data-driven foundation for the proposed triple strategy. This approach prioritizes tailored prevention efforts for subgroups with varying risks, offering a novel and scalable tool to complement existing cervical cancer screening guidelines. Future work should focus on independent external and prospective validation to maximize the global impact of this strategy. 
%R 10.2196/67840
%U https://publichealth.jmir.org/2025/1/e67840
%U https://doi.org/10.2196/67840