Gastroenteritis Forecasting Assessing the Use of Web and Electronic Health Record Data With a Linear and a Nonlinear Approach: Comparison Study

doi:10.2196/34982

Original Paper

¹Department of Pediatrics, Harvard Medical School, Boston, MA, United States

²Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States

³Institut national de la santé et de la recherche médicale U1099, Rennes, France

⁴Laboratoire Traitement du Signal et de l'Image, Université de Rennes 1, Rennes, France

⁵Centre de Données Cliniques, Centre Hospitalier Universitaire Rennes, Rennes, France

⁶Harvard Tseng-Hsi Chan School of Public Health, Boston, MA, United States

⁷Machine Intelligence Group for the Betterment of Health and the Environment, Network Science Institute, Northeastern University, Boston, MA, United States

⁸Faculté de médecine, Université de Rennes 1, Rennes, France

⁹Institut de Recherche Mathématique de Rennes, Rennes, France

¹⁰Institut national de la santé et de la recherche médicale CIC 1414, Université de Rennes 1, Rennes, France

*these authors contributed equally

Corresponding Author:

Canelle Poirier, PhD

Computational Health Informatics Program

Boston Children's Hospital

300 Longwood Avenue

Boston, MA, 02115

United States

Phone: 1 617 355 6000

Email: canelle.poirier@outlook.fr

Background: Disease surveillance systems capable of producing accurate real-time and short-term forecasts can help public health officials design timely public health interventions to mitigate the effects of disease outbreaks in affected populations. In France, existing clinic-based disease surveillance systems produce gastroenteritis activity information that lags real time by 1 to 3 weeks. This temporal data gap prevents public health officials from having a timely epidemiological characterization of this disease at any point in time and thus leads to the design of interventions that do not take into consideration the most recent changes in dynamics.

Objective: The goal of this study was to evaluate the feasibility of using internet search query trends and electronic health records to predict acute gastroenteritis (AG) incidence rates in near real time, at the national and regional scales, and for long-term forecasts (up to 10 weeks).

Methods: We present 2 different approaches (linear and nonlinear) that produce real-time estimates, short-term forecasts, and long-term forecasts of AG activity at 2 different spatial scales in France (national and regional). Both approaches leverage disparate data sources that include disease-related internet search activity, electronic health record data, and historical disease activity.

Results: Our results suggest that all data sources contribute to improving gastroenteritis surveillance for long-term forecasts with the prominent predictive power of historical data owing to the strong seasonal dynamics of this disease.

Conclusions: The methods we developed could help reduce the impact of the AG peak by making it possible to anticipate increased activity by up to 10 weeks.

JMIR Public Health Surveill 2023;9:e34982

doi:10.2196/34982

Keywords

infectious disease; acute gastroenteritis; modeling; modeling disease outbreaks; machine learning; public health; machine learning in public health; forecasting; digital data

Background

Acute gastroenteritis (AG) is a major public health problem worldwide [1]. Commonly defined as diarrhea or vomiting in the past 24 hours [2], AG is one of the main causes of morbidity and mortality among young people and causes up to 2.5 million deaths per year in children aged <5 years around the world [3]. Although it is generally a mild disease, its morbidity and economic burden are high [4]. In France, there are >21 million episodes of AG each year [5]. Although AG episodes occur throughout the year, there is a winter peak, mainly owing to norovirus and rotavirus [6,7]. During these peaks, the increase of visits to general practitioners and emergency or pediatric departments causes health care system disruptions [8].

Disease surveillance systems capable of producing accurate real-time and short-term forecasts can help public health officials design timely public health interventions to mitigate the effects of disease outbreaks in affected populations. In France, all acute diarrhea cases seen during medical appointments are reported weekly by volunteer outpatient health care providers. An estimation of AG incidence rate is then computed, at the national or regional scale, by considering the number of sentinel physicians and the medical density of the area of interest [9]. However, data collection, processing, aggregation, and distribution processes introduce up to 3 weeks of delay in the availability of AG activity information. This temporal data gap prevents public health officials from having a timely perspective about AG activity and thus leads to the design of interventions that do not take into consideration the most recent changes in disease dynamics. Therefore, there is a growing interest in finding new ways to mitigate this information gap [10,11].

To alleviate this time lag, several studies have proposed approaches to produce accurate and reliable real-time disease activity estimates, for example, to monitor influenza [11-14]. For AG, studies have been focused on identifying the clinical characteristics of the disease. Norovirus and rotavirus are the viruses responsible for most gastroenteritis outbreaks [6,7,15-18]. This disease has a strong wintertime seasonality, but this seasonality could be affected by the climate change, which would affect norovirus transmission, host’s susceptibility to norovirus infection, and resistance of norovirus to environmental conditions. This may cause large oscillations in the number of cases per year [6,7]. AG remains as a major cause of hospitalizations, especially for children, and the use of a vaccine could help to decrease the impact of the disease [16,18]. Some research teams have assessed the correlation between data sources (eg, drug reimbursement data and emergency department visits) and general practitioner visits for AG [3,19]. Other studies have shown a significant correlation between internet search query trends and AG incidence rates in different locations such as the United States, Mexico, the United Kingdom, and France [20,21]. However, none, to the best of our knowledge [22], have proposed a feasible methodology to forecast AG activity. Through this study, we investigated the challenges of achieving this and proposed a reliable forecasting approach.

State of the Art

Existing forecasting systems for other disease outbreaks, such as influenza, include statistical models that leverage information available in near real time [11-14]. One of the first and prominent studies is Google Flu Trends [23], a web-based service operated by Google. Created in 2009, the platform used the volume of selected Google search terms to estimate influenza activity in real time. However, the web service was stopped following several prediction errors owing to changes in people’s search behavior as a result of the exceptional nature of the pandemic or owing to the announcement of a pandemic that finally did not appear [24]. Following this, some authors updated the Google Flu Trends algorithm to improve influenza forecasting, by including data from Google Correlate and Google Trends web services and other sources, for instance, historical influenza information [11]. Internet is not the only data source that can be used to produce information in real time. With the widespread adoption of patient electronic health records (EHRs), hospitals also generate a huge amount of data. Bouzillé et al [25] showed that EHRs are strongly correlated with influenza incidence rates. Some authors proposed statistical models using EHRs to predict influenza incidence rates in real time [12,26]. In addition, other studies showed that internet users’ searches were strongly correlated with influenza epidemics and other diseases, including AG [8,21].

In this study, we evaluated the feasibility of using internet search query trends and EHR to predict AG incidence rates in near real time, at the national and regional scales, and for long-term forecasts (up to 10 weeks). We used 2 different methods—a linear approach using Elastic Net and a nonlinear approach using random forest (RF). In addition, as AG outbreaks cause disruptions in hospitals and emergency departments, we estimated AG incidence rates at the level of emergency departments and hospital stays.

Variables to Be Predicted

National Level

We obtained the national (Metropolitan France) acute diarrhea weekly incidence rates (per 100,000 inhabitants) from the French Sentinel network [27], from January 2008 to March 2018. We retrieved these data in April 2018.

Regional Level

We obtained the regional (Brittany region) acute diarrhea incidence rates (per 100,000 inhabitants) from the French Sentinel network [27], from January 2008 to March 2018. We chose the Brittany region as we used her data from a hospital in Brittany. We retrieved these data in April 2018.

Predictive Variables

Web Data

We obtained the frequency per week of the 100 most correlated French queries from Google Correlate [28]. For each signal to be predicted (national and regional levels), we retrieved Google Correlate data for the period from January 2008 to March 2018. As our prediction period is from May 2014 to February 2018, the correlation was calculated from January 2008 to April 2014. All signals were normalized to obtain mean 0 and SD 1 before calculating the correlation. The reason to correlate was to choose the most appropriate queries to predict the outbreak without previous knowledge [29]. The most correlated queries obtained for national and regional levels can differ because the weekly incidence rates for France and Brittany are different.

Clinical Data

We used data from the clinical data warehouse (CDW) of Rennes University Hospital (France), called entrepôt de données de l’HÔPital (eHOP). This CDW includes structured (laboratory test results, prescriptions, and International Statistical Classification of Diseases and Related Health Problems 10th Revision diagnoses) and unstructured (discharge letter, pathology reports, and operative reports) patients’ data from 1.2 million inpatients and outpatients and 45 million documents. To identify patients with specific criteria, eHOP has its own search engine system that allows to query unstructured data with keywords or structured data with codes based on terminologies.

First, to retrieve clinical data connected with AG, we performed different full-text queries (related to gastroenteritis, its symptoms, virus, or treatments). These queries allowed to obtain all documents matching with the search criteria (often, several documents for 1 patient and 1 stay). Then, for each week, we kept the oldest document for 1 patient and 1 hospital stay, and we calculated the number of hospital stays with at least one document mentioning the keyword contained in the query. As we used 19 keywords, we obtained 19 variables from CDW eHOP.

Then, we built a database containing the time series constructed from the structured data (total n=1,335,347 time series). Regrading Google Correlate, we calculated the Pearson correlation between both national and regional incidence rates and the time series from the database. We retrieved the 100 most correlated signals. As our prediction period is from May 2014 to February 2018, we calculated the correlation between January 2008 and April 2014.

Overall, we obtained 119 variables (n=19, 15.9% of variables from the full-text queries and n=100, 84% of the most correlated variables from the structured data). The 100 most correlated variables can be different for national and regional levels. We retrieved EHR data for the period from January 2008 to March 2018 in April 2018. All these data could be extracted in real time if needed.

Historical Data

We used the incidence rates for the previous 52 weeks as predictive variables, for both national and regional levels.

Ethics Approval

This study was approved by the local ethics committee of the Rennes Academic Hospital (approval number 16.69).

Statistical Models

Linear Approach

To minimize the negative effects of using a large number of input variables, potentially including redundant information, we used Elastic Net, a regularized multivariate regression methodology that can identify parsimonious models [30]. Elastic Net combines the power of Lasso and Ridge regressions, allowing to perform a variable selection on variables that are highly correlated [31,32]. We performed the Elastic Net regression analysis using the caret package in R (R Foundation for Statistical Computing) and the associated function fit with the glmnet method [33,34]. We fixed a coefficient λ=0.5 to give the same importance to Ridge and Lasso methods.

The formulation of our model is the following:

Here, yT denotes AG incidence rate at time T=t, t+1, t+2, t+3 (for the different levels of prediction), denotes historical variables, denotes Google data, denotes EHR data, and denotes residuals.

For a given week, we needed to find the parameters, α=(α₁,..α₅₂), β=(β₁,..β₁₀₀), and γ=(γ₁,..γ₁₁₉), that minimize the following:

Here, are hyperparameters of the Elastic Net regression. We used 10-block cross-validation to optimize the parameters. All parameters (α=[α₁,..α₅₂], β=[β₁,..β₁₀₀], and γ=[γ₁,..γ₁₁₉]) were dynamically trained every week with a rolling window using all data available. In this way, the size of our training data set increased every week. For example, for the first week of January 2015, our training data set ranged from January 2008 to the last week of December 2014. To predict the first week of January 2016, our training data set ranged from January 2008 to the last week of December 2015. We obtained estimates from May 2014 to February 2018.

Nonlinear Approach

RF is a nonlinear machine learning approach based on the construction of multiple decision trees using the general bootstrap aggregating technique (known as bagging) [35]. We used this method as it showed good performance in short-term forecasting even when it is compared with other machine learning approaches such as support vector machine or neural network or a traditional approach such as autoregressive integrated moving average [36,37].

With RF, the AG incidence rates are obtained with the following:

Here, y_T denotes AG incidence rate at time T=t, t+1, t+2, t+3 (for the different levels of prediction) and denotes AG incidence rates estimate obtained with the decision tree b. We used the R package, randomForest [38], to create our RF models. The hyperparameters corresponding to the number of decision trees and the number of variables randomly sampled at each split were optimized on a training data set from January 2008 to May 2014. Then, regarding the Elastic Net model, RF was dynamically recalibrated for every new week of prediction by incorporating all the data available. We obtained estimates from May 2014 to February 2018.

Contribution of Each Data Source

In addition, to assess the contribution of each individual data sources or their combinations, we built Elastic Net and RF models using the following predictive variables:

AG incidence rates—baseline model called autoregressive model of order 52 (AR(52)) in the following sections—for the previous 52 weeks
Google data
EHR data
Google data and AR(52)
EHR data and AR(52)
Google data and EHR data

Evaluation

To assess the performance of our models, we compared our estimates with the real incidence rates from the Sentinel network. We calculated the root mean squared error and the Pearson correlation coefficient for our test period starting from May 2014 to February 2018. The model allowing to obtain the most accurate estimates is the one having the highest correlation and the lowest error:

Here, is the predicted value for the week t, is the mean of predicted values, y_t is the real value for the week t, and is the mean of real values.

Comparison With Influenza

As we used a method developed for influenza outbreaks, we compared the results obtained for AG with those obtained for influenza. The aim was to determine whether external data sources are as relevant for AG as for influenza. We started by comparing the stationarity and the seasonality of both time series by calculating the following:

1. The autocorrelation function (ACF), allowing to determine the autocorrelation between y_t and y_t–h:

where γ(h)=cov(y_t, y_t–h)

2. The partial ACF (PACF), allowing to determine the autocorrelation between y_t and y_t–h after removing the autocorrelation between the intermediate variables y_t–1,...,y_t–h+1:

r(h)=corr(y_t,y_t–h|y_t–1,...,y_t–h+1)

Then, we compared the accuracy of estimates for forecast up to 10 weeks with Elastic Net and RF models using only historical data or combining Google, EHR, and historical data.

Overview

First, we studied the impact of each data source for short-term forecasts with the 2 different approaches already used to predict influenza outbreaks—a linear approach with the Elastic Net model and a nonlinear approach with an RF model.

Then, we analyzed the AG and influenza time series, especially the seasonality, to better understand the differences between the 2 diseases.

Finally, we compared AG and influenza results obtained for long-term forecasts with the 2 approaches, and we assessed the impact of external data sources to increase the accuracy of our estimates.