This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.

Although dynamic models are increasingly used by decision makers as a source of insight to guide interventions in order to control communicable disease outbreaks, such models have long suffered from a risk of rapid obsolescence due to failure to keep updated with emerging epidemiological evidence. The application of statistical filtering algorithms to high-velocity data streams has recently demonstrated effectiveness in allowing such models to be automatically regrounded by each new set of incoming observations. The attractiveness of such techniques has been enhanced by the emergence of a new generation of geospatially specific, high-velocity data sources, including daily counts of relevant searches and social media posts. The information available in such electronic data sources complements that of traditional epidemiological data sources.

This study aims to evaluate the degree to which the predictive accuracy of pandemic projection models regrounded via machine learning in daily clinical data can be enhanced by extending such methods to leverage daily search counts.

We combined a previously published influenza A (H1N1) pandemic projection model with the sequential Monte Carlo technique of particle filtering, to reground the model bu using confirmed incident case counts and search volumes. The effectiveness of particle filtering was evaluated using a norm discrepancy metric via predictive and dataset-specific cross-validation.

Our results suggested that despite the data quality limitations of daily search volume data, the predictive accuracy of dynamic models can be strongly elevated by inclusion of such data in filtering methods.

The predictive accuracy of dynamic models can be notably enhanced by tapping a readily accessible, publicly available, high-velocity data source. This work highlights a low-cost, low-burden avenue for strengthening model-based outbreak intervention response planning using low-cost public electronic datasets.

The capacity to accurately project communicable disease outbreak evolution is of great value in public health planning for prevention and control strategies. Use of such information can inform resource allocation, including surge-capacity planning and planning of the timing of outbreak response immunization campaigns, and, when applied across distinct scenarios, provide a basis for evaluating tradeoffs between intervention strategies. Although dynamic models are increasingly widely used to conduct such scenario projection, the construction of such models for new and rapidly evolving pathogens commonly faces significant barriers due to uncertainties regarding important factors governing the natural history of the disease, such as duration of latent, incubation, and infectious phases; the probability of asymptomatic carriage; rates of waning immunity; contact rates; and per-discordant-contact transmission probabilities. Moreover, even the most intricate models face strict limitations in their ability to project evolution of factors treated as stochastic, such as weather-related variables and the timing of arrival of exogenous infections due to global travel. Using computational statistical estimation methods such as sequential Monte Carlo techniques, in recent years, researchers have contributed approaches to elevate the predictive accuracy of dynamic transmission models by updating their state estimates at the time of appearance of each new observation. The predictive accuracy of methods has thus far been evaluated purely in the context of models that make use of traditional surveillance data sets, such as laboratory and clinically confirmed case reports [

Although such traditional surveillance data sets offer high-quality, rich information about individuals who present for medical care, they suffer from notable shortcomings, including delayed reporting and a failure to include counts of infective individuals who choose not to present for care. In a separate stream of work from the dynamic modeling work noted above, in recent years, researchers have sought to compensate for the limitations of traditional epidemiological data sources more generally by exploiting information related to online communicational behavior, particularly, the growing tendency of many users to search, post, and tweet about their illnesses. Specifically, such researchers have assessed the health insights that can be gained from public health surveillance applications employing a variety of online sources of information.

A prominent line of this work has focused on time sequences of search query volumes, such as those previously captured in Google Flu Trends (GFT) [

An important subset of research in this area has leveraged data obtained from Google to develop statistical forecast models and evaluated the degree to which GFT data in combination with statistical models can support accurate predictions [

The prediction of epidemic outbreaks by dynamic models often involves significant error and generally needs to consider both underlying dynamics and noise related to both measurement and process evolution. Although older techniques based on Kalman Filtering and variants [

Epstein et al explored the effect of adaptive behaviors such as social distancing based on fear and contact behavior in models of epidemic dynamics. They used nonlinear dynamic systems and agent-based computation and integrated disease and fear of the disease contagion processes. Based on their models, individuals anxious (“scared”) about or infected by a pathogen can transfer fear through contact with other individuals who are not scared, and scared individuals may isolate themselves, thereby influencing the contact rate dynamic, which is a key parameter in governing outbreak evolution. The authors studied flight as a behavioral response and concluded that even small levels of fear-inspired flight can have a dramatic impact on spatiotemporal epidemic dynamics [

Despite the fact that both high-velocity search volume and social media data and transmission models share a temporal perspective, data drawn from such internet series has not, to our knowledge, been previously used as a source of information for filtering (via recurrent regrounding) compartmental transmission models with the arrival of new data.

In this work, we sought to address that gap by combining the transmission model from the study by Epstein et al [

In the first stage of characterization of the particle filtered model, we present the formulation of the existing Epstein compartmental model from a previous study [_{F}), Infected with pathogen (_{P}), Infected with pathogen and fear (_{FP}), Removed due to fear (_{F}), Removed due to fear and pathogen (_{FP}), and Recovered (_{F} is the (hazard) rate of removal due to self-isolation of those in fear only, λ_{P} refers to the rate of recovery from infection with pathogen, λ_{FP} represents the rate of removal due to self-isolation of the infected who are also afraid, and _{FP}) to persons in at-risk category

When adapting the model, we took advantage of the previously demonstrated [_{P}), which is the fraction of people who are reported to public health authorities when emerging from the latent state and is both uncertain and evolving over time. Likewise, the fraction of people becoming afraid who search Google upon infection, named the fraction of Google search incidents (_{F}), is further treated as a dynamic uncertain parameter.

System dynamics model.

Other parameters also treated as stochastic are the contact rate (_{F}), and removal rate from those with fear who are also infected (λ_{FP}). To support this, such dynamic parameters are associated with state variables evolving over time according to stochastic differential equations. Because variable

_{t} is a standard Wiener process following a normal distribution with mean of 0 and variance of 1. Thus,

follows a normal distribution with mean of 0 and variance of _{c}^{2}. We also performed a log-transform on λ_{F}; the stochastic differential equation of λ_{F} is formulated as:

The initial values of _{F} are drawn uniformly from the interval between 0 and 100 per day and between 0.4 and 1 per day, respectively. The SDs of _{c} and _{λF} were both selected to be 1.

In contrast, reflecting the fact that _{P} and _{F} represent fractions, such parameters were logit-transformed, with the initial value for each varying between 0 and 0.2. We described the stochastic differential equations of fractions _{P} and _{F} according to Brownian Motion as:

Within the model, the parameter _{P} is multiplied by inflows to state variables Infective (_{FP}) to account for fractional actual reporting. Similarly, the parameter _{F} is multiplied by inflows to state variables Scared (_{F}), Scared Infective (_{FP}), Removed due to Fear and Infection (_{FP}), and Removed due to Fear (_{F}) and accounts for the fractional of the actual scared population.

We treated λ_{F}_{P} as: _{FP} as a fraction and performed a logit-transform on it. This parameter varies over the range from 0 to 1 and the dynamic process for λ'_{FP} is similar to _{P} and _{F}, specifically,

The SDs _{fP}, _{fF}, and s_{λ'FP} are selected to be 5, 5 and 1, respectively. The initial values of _{P}, _{F}, and λ'_{FP} are set on the intervals [0, 0.2), [0, 0.2) and [0, 0.5), respectively.

By applying random walks to these parameters, a more accurate estimate was achieved during model simulation. As such, in our model, each particle at each point in time is associated with all state variables and state variables associated with stochastic parameters (_{F}, _{P}_{FP}, _{F}, _{FP}, _{P}, _{F}, λ_{F}, and λ'_{FP}) (

Parameters used in the model.

Parameter name | Notation | Value for Quebec | Value for Manitoba | Unit |

Probability of infection transmission given exposure | β | 0.04 | 0.04 | Unit |

Probability of fear transmission given exposure | α | 0.02 | 0.02 | Unit |

Mean latent time | τ | 3 | 3 | Day |

Mean time to recovery | μ | 7 | 7 | Day |

Total population of province | 7843475 | 1214403 | Person | |

Rate of recovery from fear | 0.2 | 0.2 | One per day | |

Rate of removal to self-isolation from fear | λ_{F} |
Dynamic | Dynamic | One per day |

Fraction of mean time to recovery of going from “Scared Infected” to “Recovered” via “Removed Due to Fear & Infection” | λ'_{FP} |
Dynamic | Dynamic | Unit |

Rate of removal to self-isolation from fear and pathogen | λ_{FP} |
One per day | ||

Rate of recovery from infection with pathogen | λ_{P} |
One per day | ||

Rate of recovery from removal due to fear and infection | λ'_{P} |
One per day |

We evaluated the prediction of the above-described dynamic model assisted by particle filtering against two publicly available empirical datasets. The first was from Manitoba Health - Healthy Living and Seniors and included daily laboratory-confirmed case counts of pandemic H1N1 influenza for the period of October 6, 2009, through January 18, 2010, for the province of Manitoba [

In addition to the daily clinical case count data noted above, we obtained normalized daily Google search counts from Google trends and weekly normalized data from GFT for Manitoba and Quebec during the second pandemic wave. Reflecting the linguistic differences between the two provinces, the search terms used for each were distinct. In Manitoba, we used search terms “flu” and “H1N1,” while for Quebec, we used “flu,” “Influenza A virus sub-type H1N1,” “h1n1 vaccination,” “ah1n1,” “ah1n1 vaccin,” “grippe,” and “grippe ah1n1,” which are the most frequent search queries related to this topic suggested by Google during that period.

When defining the likelihood function for observing empirical data, given the state of a given particle, the exact variant of the likelihood used varied across three different scenarios examined. The first scenario evaluated the impact of assuming a likelihood formulation that considered purely clinical data, termed _{infection with pathogen}. The likelihood being used in the second scenario considered only the likelihood of observing the empirical data regarding Google search counts for the appropriate province in light of the count of individuals posited to be currently in fear within the model, a likelihood denoted as _{infection with fear}.

Following several past contributions [_{t} and _{t} represent observed individuals per day and particle-posited daily rate (count per day) of new cases, respectively:

_{Pt} and _{Ft} represent number of laboratory-confirmed incident cases reported for day _{P} and _{F} follow _{Pt} is a fraction of the flow of new cases of infection and _{Ft} is a fraction of the flow of new cases of scared. The dispersion parameter _{infection with pathogen} (_{P}) was considered as 40, while _{infection with fear} (_{F}) was considered as 25. This reflects the larger noise that we believed to be associated with Google search data, in light of the fact that a larger dispersion parameter leads to a more narrowly dispersed distribution.

The third scenario considered a total likelihood function _{T} consisting of a combination of _{infection with pathogen} and _{infection with fear}. For defining the total likelihood function, the simplifying assumption was made that deviations with respect to one measure were independent of the other, and thus, the total multivariate likelihood function could be treated as a multiplication of two univariate likelihood functions, given as _{T}=_{infection with pathogen}_{×}_{infection with fear}>

The purpose of running this third scenario was to compare the effectiveness of a univariate likelihood function with that of the multivariate likelihood function, when evaluated in terms of a calculated discrepancy of model predictions against the epidemiologically confirmed case count.

The three scenarios noted above were conducted using particle filtering, employing 1000 particles. For each such scenario, reflective of the need to make decisions in light of uncertainty about the evolution of an unfolding outbreak, in which only information about time points up to the present is available, we sought to examine the impact of right censoring the empirical data at certain time point

To judge the accuracy of particle filter ^{2} norm of the difference between sampled particles (reporting rate coefficient × [infected state+scared infected state]) and reported case count observations calculated after time _{f} is the final observation time, and

In this work, for each scenario (each associated with a particular likelihood function), we plotted the graphs associated with

Empirical data (red and magenta points) superimposed on samples (blue and green) from the model-generated distribution of particles for the model output of the count of reported cases (left panel) and number of searches (right panel) using two likelihood functions, T*=30 for Manitoba.

Empirical data (red and magenta points) superimposed on samples (blue and green) from the model-generated distribution of particles for the model output of the count of reported cases (left panel) and number of searches (right panel) using two likelihood functions, T*=30 for Quebec.

In this configuration, particle filtering was performed using _{infection with pathogen} as the sole likelihood function.

Empirical data (red and magenta points) superimposed on samples (blue and green) from the model-generated distribution of particles for the model output of the count of reported cases (left panel) and count of searches (right panel) using the likelihood function associated with clinical data alone, T*=30 for Manitoba.

Empirical data (red and magenta points) superimposed on samples (blue and green) from the model-generated distribution of particles for the model output of the count of reported cases (left panel) and count of searches (right panel) using the likelihood function associated with clinical data alone, T*=30 for Quebec.

In this configuration, particle filtering was performed using _{infection with fear} as the sole likelihood function.

Empirical data (red and magenta points) superimposed on samples (blue and green) from the model-generated distribution of particles for the model output of the count of reported cases (left panel) and count of searches (right panel) when using the likelihood function associated with search volume data alone, T*=30 for Manitoba.

Empirical data (red and magenta points) superimposed on samples (blue and green) from the model-generated distribution of particles for the model output of the count of reported cases (left panel) and count of searches (right panel) when using the likelihood function associated with search volume data alone, T*=30 for Quebec.

Discrepancies associated with different scenarios and

Scenario | |||||||||||

25 | 30 | 35 | 40 | 45 | 50 | ||||||

Google Likelihood | 7,846,178 | 2,896,092 | 1,941,998 | 695,330 | 192,819 | 13,569 | |||||

Clinical Likelihood | 956,021 | 604,749 | 564,651 | 469,159 | 106,307 | 3275 | |||||

Two Likelihoods | 545 | 361 | 174 | 158 | 60 | 12 | |||||

Google Likelihood | 577,919,468 | 437,577,329 | 290,486,216 | 108,993,972 | 29,645,905 | 9,179,791 | |||||

Clinical Likelihood | 31,571,941 | 3,544,611 | 461,804 | 55,938 | 4862 | 751 | |||||

Two Likelihoods | 535,927 | 17,386 | 8338 | 3322 | 1071 | 336 |

In this contribution, we investigated the predictive accuracy gains from applying particle filtering using both traditional and search volume data to estimate latent states of a compartmental transmission model (including time evolution of stochastic parameters involved in that model). The capacity to perform this estimation then provides support for projection and scenario evolution using the model.

To be able to use search data effectively when particle filtering a transmission model, we found it helpful to move beyond the traditional scope of compartmental transmission models and to adopt a more articulated model of the outbreak, reflecting the fact that causal drivers promoting Web searches are not restricted to stages in the natural history of infection, but are additionally driven by factors with distinct but coupled dynamics, such as fluctuations in perceived risk on the part of the population. Responsive to this consideration, we have adapted a previously published model with an explicit consideration of the coupled dynamics of fear and pathogen. Although there are challenges associated with assessing perceived risk and anxiety on the part of the population during an outbreak, we found here that projection of outbreak dynamics can be materially enhanced through inclusion of a surprisingly accessible source of data: Daily relative search query volumes for defined geographic regions on the widely used Google search engine. The reliable and timely public availability of such data across many areas of the world raises the prospects for significantly enhancing effective outbreak projection using combinations of dynamic modeling and machine learning techniques such as the particle filter.

The work presented here has significant limitations. Although search trend data provide some indication of topic-specific interest over time in a defined spatial region, from the standpoint of “big data,” it is often available only with modest (daily) temporal resolution and frequently coarse geographic resolution. It is also affected by many unobserved confounders. Such search trend data are further limited by providing little sense of count of distinct users and no sense of longitudinal progression of a single user. In this regard, the Google search query volume time series compare unfavorably to the richness of information present in other publicly available types of online data, such as region-specific Twitter feeds.

In addition to the shortcomings in the data sources employed, there are notable methodological limitations of our study. The likelihood function employing two distinct data sources was simplistic in its design, merely serving to multiply each of the dataset-specific likelihood functions. The use of a random walk during particle filtering for no fewer than five distinct parameters likely contributes to a rapid divergence in the model’s estimates, compared to the behavior observed in previous particle filtered models of influenza [

Such limitations point to natural avenues for future work. We expect that the prospects for the sorts of projections explored here will be significantly elevated by combining such data with other public data sources containing distinct sources of information, such as daily or finer resolution time series from Twitter and Tumblr. We further expect the accuracy of the projections to be improved by more powerful machine learning techniques, such as through the use of PMCMC techniques, ensemble techniques supporting inclusion of multiple models, and potential PMCMC techniques employing multiple models using reverse-jump MCMC strategies.

Pandemic forecasting is important for public health policy making due to its support for judicious planning involving resource allocation. Official statistics typically capture only subsets of the epidemiological burden (eg, the subset of individuals who engage in care seeking). Prospects for rapid use of such data to understand outbreak evolution are often further handicapped by reporting delays and a lack of capacity to project epidemiological case count time series forward. Traditional outbreak data have been complemented in recent years by high-resolution data sets from public social media such as Twitter, Tumblr, and time series provided by the Google search application programming interface via Google trends and Google flu, which can be retrieved programmatically and analyzed over time. The results presented in this work suggest that, when combined with traditional epidemiological data sources, social media–driven data sets, machine learning, and dynamic modeling can offer powerful tools for anticipating future evolution of and assessing intervention tradeoffs with respect to infectious disease outbreaks, particularly for emerging pathogens.

Google Flu Trends

Particle Markov Chain Monte Carlo

None declared.