TY - JOUR AU - Lu, Zhen AU - Dong, Binhua AU - Cai, Hongning AU - Tian, Tian AU - Wang, Junfeng AU - Fu, Leiwen AU - Wang, Bingyi AU - Zhang, Weijie AU - Lin, Shaomei AU - Tuo, Xunyuan AU - Wang, Juntao AU - Yang, Tianjie AU - Huang, Xinxin AU - Zheng, Zheng AU - Xue, Huifeng AU - Xu, Shuxia AU - Liu, Siyang AU - Sun, Pengming AU - Zou, Huachun PY - 2025 DA - 2025/3/19 TI - Identifying Data-Driven Clinical Subgroups for Cervical Cancer Prevention With Machine Learning: Population-Based, External, and Diagnostic Validation Study JO - JMIR Public Health Surveill SP - e67840 VL - 11 KW - cervical cancer KW - human papillomavirus KW - screening KW - machine learning KW - cervical tumor KW - cancer KW - carcinoma KW - tumor KW - malignant KW - ML KW - phenomapping strategy KW - logistic regression KW - regression KW - population-based KW - validation study KW - cancer prevention KW - validity KW - usability KW - algorithm KW - surveillance KW - electronic health record KW - EHR AB - Background: Cervical cancer remains a major global health issue. Personalized, data-driven cervical cancer prevention (CCP) strategies tailored to phenotypic profiles may improve prevention and reduce disease burden. Objective: This study aimed to identify subgroups with differential cervical precancer or cancer risks using machine learning, validate subgroup predictions across datasets, and propose a computational phenomapping strategy to enhance global CCP efforts. Methods: We explored the data-driven CCP subgroups by applying unsupervised machine learning to a deeply phenotyped, population-based discovery cohort. We extracted CCP-specific risks of cervical intraepithelial neoplasia (CIN) and cervical cancer through weighted logistic regression analyses providing odds ratio (OR) estimates and 95% CIs. We trained a supervised machine learning model and developed pathways to classify individuals before evaluating its diagnostic validity and usability on an external cohort. Results: This study included 551,934 women (median age, 49 years) in the discovery cohort and 47,130 women (median age, 37 years) in the external cohort. Phenotyping identified 5 CCP subgroups, with CCP4 showing the highest carcinoma prevalence. CCP2–4 had significantly higher risks of CIN2+ (CCP2: OR 2.07 [95% CI: 2.03‐2.12], CCP3: 3.88 [3.78‐3.97], and CCP4: 4.47 [4.33‐4.63]) and CIN3+ (CCP2: 2.10 [2.05‐2.14], CCP3: 3.92 [3.82‐4.02], and CCP4: 4.45 [4.31‐4.61]) compared to CCP1 (P<.001), consistent with the direction of results observed in the external cohort. The proposed triple strategy was validated as clinically relevant, prioritizing high-risk subgroups (CCP3-4) for colposcopies and scaling human papillomavirus screening for CCP1-2. Conclusions: This study underscores the potential of leveraging machine learning algorithms and large-scale routine electronic health records to enhance CCP strategies. By identifying key determinants of CIN2+/CIN3+ risk and classifying 5 distinct subgroups, our study provides a robust, data-driven foundation for the proposed triple strategy. This approach prioritizes tailored prevention efforts for subgroups with varying risks, offering a novel and scalable tool to complement existing cervical cancer screening guidelines. Future work should focus on independent external and prospective validation to maximize the global impact of this strategy. SN - 2369-2960 UR - https://publichealth.jmir.org/2025/1/e67840 UR - https://doi.org/10.2196/67840 DO - 10.2196/67840 ID - info:doi/10.2196/67840 ER -