Improved Real-Time Influenza Surveillance: Using Internet Search Data in Eight Latin American Countries

Background Novel influenza surveillance systems that leverage Internet-based real-time data sources including Internet search frequencies, social-network information, and crowd-sourced flu surveillance tools have shown improved accuracy over the past few years in data-rich countries like the United States. These systems not only track flu activity accurately, but they also report flu estimates a week or more ahead of the publication of reports produced by healthcare-based systems, such as those implemented and managed by the Centers for Disease Control and Prevention. Previous work has shown that the predictive capabilities of novel flu surveillance systems, like Google Flu Trends (GFT), in developing countries in Latin America have not yet delivered acceptable flu estimates. Objective The aim of this study was to show that recent methodological improvements on the use of Internet search engine information to track diseases can lead to improved retrospective flu estimates in multiple countries in Latin America. Methods A machine learning-based methodology that uses flu-related Internet search activity and historical information to monitor flu activity, named ARGO (AutoRegression with Google search), was extended to generate flu predictions for 8 Latin American countries (Argentina, Bolivia, Brazil, Chile, Mexico, Paraguay, Peru, and Uruguay) for the time period: January 2012 to December of 2016. These retrospective (out-of-sample) Influenza activity predictions were compared with historically observed flu suspected cases in each country, as reported by Flunet, an influenza surveillance database maintained by the World Health Organization. For a baseline comparison, retrospective (out-of-sample) flu estimates were produced for the same time period using autoregressive models that only leverage historical flu activity information. Results Our results show that ARGO-like models’ predictive power outperform autoregressive models in 6 out of 8 countries in the 2012-2016 time period. Moreover, ARGO significantly improves on historical flu estimates produced by the now discontinued GFT for the time period of 2012-2015, where GFT information is publicly available. Conclusions We demonstrate here that a self-correcting machine learning method, leveraging Internet-based disease-related search activity and historical flu trends, has the potential to produce reliable and timely flu estimates in multiple Latin American countries. This methodology may prove helpful to local public health officials who design and implement interventions aimed at mitigating the effects of influenza outbreaks. Our methodology generally outperforms both the now-discontinued tool GFT, and autoregressive methodologies that exploit only historical flu activity to produce future disease estimates.


Model
An ARGO (AutoRegressive model with GOogle search queries as exogenous variables) model was developed for each country to predict NPS activity one week ahead of FluNet's surveillance reports. The training set for each model consisted of the most recent 104 weeks of data prior to the desired weekly estimation. The use of dynamic time windows allows ARGO to re-calibrate to the most recent data and reduce overshooting.

Formulation
In the ARGO model, higher search frequencies for disease-related Google queries are observed when the disease has a higher impact, such as when people are infected or experience symptoms. NPS is modeled using the observed Google search frequencies and the flu case reports from past observed data as follows: y t = u y + j∈J α j y t−j + k∈K β k X k,t + t , t ∼ N (0, σ 2 ) where: •ŷ t is our estimate of NPS at time t • y t−j is the NPS observed at time t − j • J is the set of autoregressive lags • K is the set of Google query terms • X k,t is the Google search frequency of term k at time t • u y is an intercept term.
The endogenous and exogenous variable coefficients α j , j ∈ J and β k , k ∈ K were fitted using multivariable linear regression with L1 regularization (LASSO) and 10-fold cross validation to determine the level of regularization. The regression was re-trained on a weekly basis, updating the input features with the newest available data from FluNet and Google for the next prediction. This approach allows for recalibration of regression coefficients in a way that adjusts the variables based on their prediction ability over the training set.

Benchmarks
Two benchmarks were used to assess ARGO's performance. These are:

Autoregressive model
An autoregressive model uses as input only past values (i.e. lags) from the NPS timeseries. Using the previous notation, thenŷ For each country, we fit an L1-regularized multivariate linear model with 52 time lags, denoted AR52.

Google Flu Trends
Weekly flu activity values from Google Flu Trends (GFT) were collected from January 2, 2011 through August 9, 2015 (when GFT was discontinued). Google did not scale the values to any known official flu curve, so we rescaled the values to the FluNet NPS using a linear regression over the full study period.
The models were used to retrospectively simulate real-time estimates of influenza NPS from January 1, 2012 through December 25, 2016. In the case of Brazil, NPS data was only available until October 9, 2016. Two accuracy metrics were used for model comparison: 1. Root mean square error: The RMSE and Pearson correlation were calculated between each model and the FluNet NPS annually, over the entire prediction period, and over the sub-period when GFT was published (1/1/12 to 8/9/15). In addition, we calculated the inverse of the mean square error ratio between ARGO and AR (called the efficiency metric from now on) for the whole study period. Weeks for which the NPS activity were not available in each country were removed prior to computation of metrics.

Peaks and Onsets
The following metrics were defined and used to evaluate the ability of a model to correctly predict the timing of peaks and onsets of an epidemic outbreak: 1. Onset timing: Distance, in number of weeks, between the ground truth's outbreak onset and the model's estimated onset: Peak timing: Distance, in number of weeks, between the ground truth's outbreak peak and the model's estimated peak: ∆P = P model − P observed Influenza outbreak onsets were empirically identified for each country as follows: 1. For a given country, we extracted each of the five years' epidemic outbreaks, consisting of 52 data points per year (see Figure S2).
2. On a single 52 week interval, we created a new curve by adding all of the five years of epidemic data.
3. Using the resulting curve, we found the threshold value that separates low activity from outbreak activity as the average NPS case count from week n such that, if we sum the average NPS case counts from week 1 to week n, the sum would represent 10% of the total sum over the 52 weeks of the curve. The threshold was then normalized (divided by 5) to be applied for each individual flu season.
4. Finally, for each outbreak, we identified the onset week as the first week when this threshold was crossed.

Time series and coefficient heatmaps
The following set of figures displays the NPS curve as reported by the World Health Organization (black), along with the NPS estimates generated by ARGO (red), AR52 (dotted gray), and GFT (blue). Below the NPS curve are 2 error curves, which display the prediction error (Error t =ŷ t − y t ) and the percent error relative to the NPS value (%Error t = Error t y t =ŷ t − y t y t ). Finally, we show heatmaps representing the values of the ARGO coefficients involved in each weekly prediction.