%0 Journal Article %@ 2369-2960 %I JMIR Publications %V 11 %N %P e67840 %T Identifying Data-Driven Clinical Subgroups for Cervical Cancer Prevention With Machine Learning: Population-Based, External, and Diagnostic Validation Study %A Lu,Zhen %A Dong,Binhua %A Cai,Hongning %A Tian,Tian %A Wang,Junfeng %A Fu,Leiwen %A Wang,Bingyi %A Zhang,Weijie %A Lin,Shaomei %A Tuo,Xunyuan %A Wang,Juntao %A Yang,Tianjie %A Huang,Xinxin %A Zheng,Zheng %A Xue,Huifeng %A Xu,Shuxia %A Liu,Siyang %A Sun,Pengming %A Zou,Huachun %K cervical cancer %K human papillomavirus %K screening %K machine learning %K cervical tumor %K cancer %K carcinoma %K tumor %K malignant %K ML %K phenomapping strategy %K logistic regression %K regression %K population-based %K validation study %K cancer prevention %K validity %K usability %K algorithm %K surveillance %K electronic health record %K EHR %D 2025 %7 19.3.2025 %9 %J JMIR Public Health Surveill %G English %X Background: Cervical cancer remains a major global health issue. Personalized, data-driven cervical cancer prevention (CCP) strategies tailored to phenotypic profiles may improve prevention and reduce disease burden. Objective: This study aimed to identify subgroups with differential cervical precancer or cancer risks using machine learning, validate subgroup predictions across datasets, and propose a computational phenomapping strategy to enhance global CCP efforts. Methods: We explored the data-driven CCP subgroups by applying unsupervised machine learning to a deeply phenotyped, population-based discovery cohort. We extracted CCP-specific risks of cervical intraepithelial neoplasia (CIN) and cervical cancer through weighted logistic regression analyses providing odds ratio (OR) estimates and 95% CIs. We trained a supervised machine learning model and developed pathways to classify individuals before evaluating its diagnostic validity and usability on an external cohort. Results: This study included 551,934 women (median age, 49 years) in the discovery cohort and 47,130 women (median age, 37 years) in the external cohort. Phenotyping identified 5 CCP subgroups, with CCP4 showing the highest carcinoma prevalence. CCP2–4 had significantly higher risks of CIN2+ (CCP2: OR 2.07 [95% CI: 2.03‐2.12], CCP3: 3.88 [3.78‐3.97], and CCP4: 4.47 [4.33‐4.63]) and CIN3+ (CCP2: 2.10 [2.05‐2.14], CCP3: 3.92 [3.82‐4.02], and CCP4: 4.45 [4.31‐4.61]) compared to CCP1 (P<.001), consistent with the direction of results observed in the external cohort. The proposed triple strategy was validated as clinically relevant, prioritizing high-risk subgroups (CCP3-4) for colposcopies and scaling human papillomavirus screening for CCP1-2. Conclusions: This study underscores the potential of leveraging machine learning algorithms and large-scale routine electronic health records to enhance CCP strategies. By identifying key determinants of CIN2+/CIN3+ risk and classifying 5 distinct subgroups, our study provides a robust, data-driven foundation for the proposed triple strategy. This approach prioritizes tailored prevention efforts for subgroups with varying risks, offering a novel and scalable tool to complement existing cervical cancer screening guidelines. Future work should focus on independent external and prospective validation to maximize the global impact of this strategy. %R 10.2196/67840 %U https://publichealth.jmir.org/2025/1/e67840 %U https://doi.org/10.2196/67840