This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
Sexually transmitted infections (STIs) pose a significant public health challenge in the United States. Traditional surveillance systems are adversely affected by data quality issues, underreporting of cases, and reporting delays, resulting in missed prevention opportunities to respond to trends in disease prevalence. Search engine data can potentially facilitate an efficient and economical enhancement to surveillance reporting systems established for STIs.
We aimed to develop and train a predictive model using reported STI case data from Chicago, Illinois, and to investigate the model’s predictive capacity, timeliness, and ability to target interventions to subpopulations using Google Trends data.
Deidentified STI case data for chlamydia, gonorrhea, and primary and secondary syphilis from 2011-2017 were obtained from the Chicago Department of Public Health. The data set included race/ethnicity, age, and birth sex. Google Correlate was used to identify the top 100 correlated search terms with “STD symptoms,” and an autocrawler was established using Google Health Application Programming Interface to collect the search volume for each term. Elastic net regression was used to evaluate prediction accuracy, and cross-correlation analysis was used to identify timeliness of prediction. Subgroup elastic net regression analysis was performed for race, sex, and age.
For gonorrhea and chlamydia, actual and predicted STI values correlated moderately in 2011 (chlamydia:
Integrating nowcasting with Google Trends in surveillance activities can potentially enhance the prediction and timeliness of outbreak detection and response as well as target interventions to subpopulations. Future studies should prospectively examine the utility of Google Trends applied to STI surveillance and response.
Gonorrhea, chlamydia, and syphilis continue to pose a significant public health challenge with approximately 3.7 million new diagnoses each year in the United States [
National surveillance relies on mandatory laboratory and case reporting, a system that produces data that are often incomplete and limited in scope [
Although the internet is not a “new” technology, it is relatively new to the surveillance of infectious diseases. In 2004, Eysenbach coined the term “infodemiology” to describe the distribution of determinants of information on the internet and “infoveillance” to refer to syndromic surveillance of disease via the internet [
We build on our previous study, which found that Google search trend volume was positively and statistically significantly correlated with reported annual rates of STIs at the state level [
A deidentified data set containing weekly STI case data for chlamydia, gonorrhea, and primary and secondary syphilis for 2011-2017 was obtained from the CDPH. The Institutional Review Board at the CDPH reviewed and approved the project proposal as exempt. Gonorrhea and primary and secondary syphilis have been nationally notifiable infections since 1944, and chlamydia, since 2010 [
The performance ability and predictive accuracy of the model rely on the selection of the search terms. To account for the breadth of terms that can possibly be used, the top 100 related search terms were obtained from Google Correlate using the initial term “std symptoms” [
We used Google Health Application Programming Interface (API) (https://trends.google.com/trends/) data [
The distributions of each of the STI case counts by week comprised non–negative count variables. We applied Poisson regression modelling as dictated by the outcome distribution and in consideration of the Google Trends data [
As the number of search query terms increases and exceeds the number of observations (in this case, the number of weeks), a curse-of-dimensionality and small-n–large-p affect the model. In addition, many of the query volumes may be zero because many queries are irrelevant (ie, assuming sparsity). Regularized regression schemes, such as lasso and elastic net, can solve this problem [
We used a default parameter setting (10-fold cross validation for elastic net implementation in MATLAB 2017b to select the best regularization parameter lambda (λ) [
To estimate the potential time advantage of using the internet-based search terms, we applied cross-correlation analysis. Search terms were filtered by applying cross-correlation analysis to estimate the temporal relationship between the STI cases and Internet search volume derived from each term. The results were obtained as product-moment correlations between the 2 time series. The advantage of using cross-correlation is that it accounts for time dependence between 2 time series variables. The time dependence between 2 variables is termed as lag, which indicates the degree and direction of associations. A lag of –1 or +1 for assessing correlation implies that the Google Trend data have shifted backward or forward by 1 week from the CDPH data. Cross-correlation analysis also reduces spurious correlations in the subsequent regression analysis by excluding irrelevant Internet search trends. Those search terms that lacked statistically significant correlations or a definite time lag or lead pattern were excluded. We measured our performance by using the following metrics between the predicted and actual STI counts: Pearson correlation
From 2011 to 2017, there were 170,368 reported cases of chlamydia, 65,224 reported cases of gonorrhea and 4278 reported cases of syphilis (
Demographic characteristics and number of laboratory–confirmed reported cases of chlamydia, gonorrhea, and syphilis by year for Chicago, IL.
Characteristics | Chlamydia, n (%) | Gonorrhea, n (%) | Syphilis, n (%) | |||||||
Median age (years) | 22 | 23 | 31 | |||||||
|
||||||||||
|
Male | 66512 (33.63) | 36282 (55.74) | 2426 (74.39) | ||||||
|
Female | 131255 (66.36) | 28800 (44.26) | 835 (25.61) | ||||||
|
||||||||||
|
White | 68687 (36.55) | 20117 (31.82) | 1965 (42.31) | ||||||
|
Black | 107631 (57.27) | 41003 (64.86) | 2246 (48.36) | ||||||
|
Other | 11594 (6.17) | 2096 (3.31) | 433 (9.32) | ||||||
|
||||||||||
|
2011 | 27686 (75.07) | 8533 (23.14) | 658 (1.70) | ||||||
|
2012 | 27729 (73.23) | 9551 (25.224) | 585 (1.50) | ||||||
|
2013 | 27325 (74.18) | 8889 (24.23) | 618 (1.60) | ||||||
|
2014 | 26990 (76.07) | 7845 (22.11) | 643 (1.80) | ||||||
|
2015 | 28256 (76.67) | 7840 (21.27) | 758 (2.05) | ||||||
|
2016 | 29776 (71.87) | 10836 (26.15) | 813 (1.96) | ||||||
|
2017 | 30292 (70.75) | 11730 (27.40) | 788 (1.84) |
We evaluated the predictions of STI cases from the search terms for 5 consecutive annual periods from 2011 to 2017 using elastic net regression.
Model prediction performance.
Year | Gonorrhea | Chlamydia | Primary and secondary syphilis | ||||||
|
MAEa (%) |
|
MAE (%) |
|
MAE (%) | ||||
2011 | 0.72 | 12.56 | 0.65 | 36.12 | 0.70 | 2.50 | |||
2012 | 0.86 | 11.56 | 0.85 | 25.34 | 0.79 | 1.55 | |||
2013 | 0.88 | 19.04 | 0.94 | 37.98 | 0.77 | 2.24 | |||
2014 | 0.82 | 10.28 | 0.92 | 20.01 | 0.56 | 2.27 | |||
2015 | 0.85 | 8.27 | 0.87 | 23.27 | 0.55 | 3.02 | |||
2016 | 0.89 | 7.95 | 0.93 | 17.04 | 0.70 | 2.45 | |||
2017 | 0.90 | 10.23 | 0.91 | 22.26 | 0.79 | 1.94 |
aMAE: Mean absolute error.
Graphical comparison between actual and predicted number of gonorrhea cases for 2017.
Graphical comparison between the actual and predicted number of chlamydia cases for 2017.
Graphical comparison between the actual and predicted number of syphilis cases for 2017.
Following the same elastic net regression procedures, we developed separate models for the race (Black vs Nonblack), sex (male vs female), and age (<30 vs ≥30) subgroups. All the subgroup models across all years and all diseases (gonorrhea, chlamydia, and syphilis) performed optimally, showing high correlation values and low MAEs (see
Subgroup (race) prediction performance for gonorrhea.
Year | Black | Nonblack | |||
|
MAEa (%) |
|
MAE (%) | ||
2011 | 0.89 | 6.41 | 0.92 | 2.67 | |
2012 | 0.85 | 8.81 | 0.89 | 3.70 | |
2013 | 0.92 | 11.54 | 0.93 | 5.15 | |
2014 | 0.82 | 6.22 | 0.85 | 6.56 | |
2015 | 0.90 | 4.54 | 0.84 | 5.86 | |
2016 | 0.84 | 5.49 | 0.93 | 1.94 | |
2017 | 0.92 | 4.6 | 0.91 | 2.71 |
aMAE: Mean absolute error.
Subgroup (gender) prediction performance for gonorrhea.
Year | Male | Female | |||
|
MAEa (%) |
|
MAE (%) | ||
2011 | 0.88 | 4.73 | 0.92 | 2.67 | |
2012 | 0.93 | 4.0 | 0.89 | 3.70 | |
2013 | 0.96 | 5.25 | 0.93 | 5.15 | |
2014 | 0.94 | 4.26 | 0.85 | 6.56 | |
2015 | 0.93 | 3.94 | 0.84 | 5.86 | |
2016 | 0.92 | 5.70 | 0.88 | 4.12 | |
2017 | 0.91 | 6.13 | 0.90 | 3.92 |
aMAE: Mean absolute error.
Subgroup (age) prediction performance for gonorrhea.
Year | Less than 30 years | 30 years and above | ||
|
MAEa (%) |
|
MAE (%) | |
2011 | 0.83 | 9.15 | 0.81 | 3.17 |
2012 | 0.91 | 7.67 | 0.81 | 3.13 |
2013 | 0.97 | 7.59 | 0.86 | 4.42 |
2014 | 0.90 | 7.25 | 0.80 | 2.87 |
2015 | 0.98 | 2.30 | 0.87 | 2.78 |
2016 | 0.98 | 1.98 | 0.91 | 3.56 |
2017 | 0.91 | 7.22 | 0.83 | 3.92 |
aMAE: Mean absolute error.
First, we conducted a cross-correlation analysis to identify the temporal relationship between the STI data and search terms (ie, a lag or lead pattern).
Cross-correlation coefficients of reported cases of chlamydia using search term trend data for 2017a.
Search terms | Lags (week) | |||
|
–1 | 0 | 1 | |
|
||||
|
does chlamydia | –0.34b ( |
0.01 | 0.001 |
|
std symptoms in women | –0.31b ( |
–0.04 | 0.16 |
|
gonorrhea | –0.28b ( |
0.07 | 0.11 |
|
samuel l jackson movies | 0.29b ( |
–0.18 | 0.04 |
|
after period | –0.31b ( |
0.09 | -0.04 |
|
||||
|
treatment for chlamydia | 0.00 | –0.35b ( |
0.06 |
|
a black eye | –0.11 | –0.28b ( |
0.12 |
|
std | 0.09 | 0.33b ( |
0.12 |
|
two weeks | –0.04 | 0.28b ( |
0.08 |
|
crabs std | 0.03 | –0.36b ( |
0.003 |
|
feel pregnant | –0.12 | –0.28b ( |
–0.15 |
|
bleeding after period | –0.09 | –0.32b ( |
–0.005 |
|
||||
|
wine while pregnant | –0.10 | –0.24 | 0.29b ( |
|
talk to women | –0.06 | 0.09 | –0.30b ( |
aOnly significant cross-correlation coefficient values are shown in this table.
bValues indicate the maximum cross-correlation coefficient.
A separate regression analysis including only those search terms that coincided with and preceded the STI data by 2 weeks (ie, based on the results of the cross-correlation analysis in
Correlations between actual and predicted cases of STIs for 2015-2017.
Year | Gonorrhea | Chlamydia | Primary and secondary syphilis | |||
|
MAEa (%) |
|
MAE (%) |
|
MAE (%) | |
2015 | 0.60 | 12.35 | 0.66 | 32.22 | 0.59 | 2.94 |
2016 | 0.67 | 13.97 | 0.77 | 28.75 | 0.57 | 2.95 |
2017 | 0.46 | 21.49 | 0.65 | 37.98 | 0.52 | 2.71 |
aMAE: mean absolute error.
We performed a series of analyses to determine the predictive ability, timeliness, and performance of the Google Trends subgroups for the STI cases. The models performed consistently well overall across all diseases and time periods, showing moderate-to-high predictive power and low-to-moderate error. Applied nowcasting does not need to perform perfectly but must be reliable and consistent to inform disease control and response. As illustrated by the analyses of the Google Trends for influenza surveillance, the predictive performance of search volume may vary by disease, location, and over time [
Our models were able to nowcast within a 1-week time frame, a substantial improvement from the delays observed when using traditional STI surveillance data. Further work is needed to determine thresholds for response, including determining what level of increase in case volume indicates a public health response and to what intensity. For example, a jurisdiction may decide that 10%, 20%, and 30% increases in search trend volumes may trigger a low intensity response (eg, provider awareness), public awareness, and active screening campaigns, respectively. Each of these thresholds and response activities need to be refined by local health departments based on epidemiologic trends and health department resources, but given the opportunity for real-time surveillance, and thus timely decision making and response provision, these efforts become urgent.
The ability to target prevention and control efforts to impacted subgroups is of great utility for public health efforts. Our model subgroup analyses performed better than or as efficiently as the aggregate models, demonstrating the ability to monitor trends in subgroups. These analyses were limited to the data available from our local health department; future studies should be conducted to refine and enhance subgroup performance. For example, control techniques may be influenced by outbreaks in specific neighborhoods; therefore, determining models fit for geographical subgroups (eg, community area and zip code) would be beneficial. Further, the analyses were conducted on retrospective data and involved using final cleaned surveillance data sets; future studies should be conducted prospectively in real time.
The search terms that most strongly correlated with the case counts for all 3 diseases were “std symptoms,” “gonorrhea in men,” “yellow discharge,” “white creamy discharge,” “week pregnant,” “yellow discharge,” and “white discharge.” All of these terms appear to be related to STI symptoms and are likely to be generated by those exposed to STIs (or cases). In a previous study, we established that those exposed to STIs are likely generating symptom-related search terms; we compared 2 different sexually active populations and found that compared to the student sample, a greater proportion of the clinical sample used the term “STD symptoms” or conducted symptom-related searches (47% vs 17%,
Google Trends supports credibility and transparency because these data are openly available, and our analyses are replicable by other investigators (see
An evaluation of 8 state-wide health systems in North Carolina compared International Classification of Diseases-9 codes to a broad range of reported cases of notifiable communicable diseases, and showed that completeness of reporting ranged from 0% to 82% depending on the particular disease [
The results of this study must be interpreted with the following limitations in mind. This study used STI case data from 1 jurisdiction in the United States; thus, it is unknown how nowcasting will function in other jurisdictions with different disease trends and search trend volumes. Future studies should include a representative selection of jurisdictions with high disease burdens and internet penetration. Our subgroup analysis was limited by the characteristics included in the STI case data available from the health department and did not include indicators such as zip code, community area, and socioeconomic status; therefore, we were not able to determine the impact of an analysis with these characteristics on target resources. Finally, our study used Google API, which is currently limited by Google to approved research institutions only, thus limiting the replication of the results to those who have API access.
This is the first study to examine the utility of Google Trends search volumes to predict STI cases at a city level. Future studies should replicate procedures for other US jurisdictions and prospectively examine model performance while developing tolerance levels for false positives. Integrating nowcasting with Google Trends in surveillance activities can potentially enhance the timeliness of outbreak detection and response.
Subgroup analysis.
MATLAB code.
R code.
application programming interface
Chicago Department of Public Health
mean absolute error
sexually transmitted infection
This work was supported by a pilot grant awarded by the Northwestern University Clinical and Translational Sciences Institute (#UL1TR001422).
None declared.