This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
Over one-third of the population of Havelock North, New Zealand, approximately 5500 people, were estimated to have been affected by campylobacteriosis in a large waterborne outbreak. Cases reported through the notifiable disease surveillance system (notified case reports) are inevitably delayed by several days, resulting in slowed outbreak recognition and delayed control measures. Early outbreak detection and magnitude prediction are critical to outbreak control. It is therefore important to consider alternative surveillance data sources and evaluate their potential for recognizing outbreaks at the earliest possible time.
The first objective of this study is to compare and validate the selection of alternative data sources (general practice consultations, consumer helpline, Google Trends, Twitter microblogs, and school absenteeism) for their temporal predictive strength for Campylobacter cases during the Havelock North outbreak. The second objective is to examine spatiotemporal clustering of data from alternative sources to assess the size and geographic extent of the outbreak and to support efforts to attribute its source.
We combined measures derived from alternative data sources during the 2016 Havelock North campylobacteriosis outbreak with notified case report counts to predict suspected daily Campylobacter case counts up to 5 days before cases reported in the disease surveillance system. Spatiotemporal clustering of the data was analyzed using Local Moran’s I statistics to investigate the extent of the outbreak in both space and time within the affected area.
Models that combined consumer helpline data with autoregressive notified case counts had the best out-of-sample predictive accuracy for 1 and 2 days ahead of notified case reports. Models using Google Trends and Twitter typically performed the best 3 and 4 days before case notifications. Spatiotemporal clusters showed spikes in school absenteeism and consumer helpline inquiries that preceded the notified cases in the city primarily affected by the outbreak.
Alternative data sources can provide earlier indications of a large gastroenteritis outbreak compared with conventional case notifications. Spatiotemporal analysis can assist in refining the geographical focus of an outbreak and can potentially support public health source attribution efforts. Further work is required to assess the location of such surveillance data sources and methods in routine public health practice.
In August 2016, Havelock North, one of the 5 cities in the Hawke’s Bay region, New Zealand, was the site of a large waterborne outbreak of Campylobacter infection. This outbreak began on August 8, but a large number of cases were not known to the national notifiable disease surveillance system until August 14. By that time, more than a third of Havelock North residents had been infected with Campylobacter. This event led to serious interruption of daily life in the area and large economic costs [
The surveillance for notifiable diseases in New Zealand is predominantly passive, with laboratories and physicians notifying their local public health service through submission to the national notifiable disease surveillance system, EpiSurv [
Interest in considering alternative data sources for early prediction of such outbreaks was motivated by previously published work reporting on the use of data from internet search engines [
This study revisits the Havelock North Campylobacter outbreak to examine signals present in data sources that were not available to the public health team during the response. By analyzing temporal and spatiotemporal patterns in these alternative data sources, the study assesses the relative effectiveness and sensitivity of different data sources in detecting the outbreak earlier. First, we aim to assess the temporal predictive strength of modeled combinations of measures from the following daily alternative data sources: GP consultations, consumer health helpline calls, Google Trends, Twitter microblogs, and school absenteeism records. These models will be measured by the time gained (up to 5 days ahead) compared with the cases notified in the existing disease surveillance system, using multiple evaluation metrics. Second, we will examine city-level spatiotemporal patterns in measures from alternative data sources relative to notified case counts to identify clusters and outliers in both space and time over the outbreak period.
The study protocol was approved by the Health and Disability Ethics Committee, New Zealand, under the protocol number NZ/1/6350114. The Twitter data used in this study were obtained under the Twitter terms and conditions and in agreement with its public privacy settings.
For the greater area affected by the outbreak (Hawkes Bay), we collected daily data for the entire 2016 calendar year from the data sources described in
Description of data sources used in analysis.
Source | Fields of interest | Data level used in analysis | Counts | References |
Notified case count (New Zealand surveillance database EpiSurv) | Date of onset, testing, and notification for confirmed and probable cases of campylobacteriosis | Aggregated by notification date and city of residence in Hawkes Bay | 1345 | Ministry of Health New Zealand [ |
General practice consultations (HealthStat) | Visits for gastrointestinal complaints | Individual with visit date, age, and sex, for entire Hawkes Bay District Health Board area only | 772 | Cumming J and Gribben B [ |
Consumer helpline (HealthLine) calls | Consumer calls concerning gastrointestinal complaints | Individual with call date, age, sex, and residential city in Hawkes Bay | 1196 | St George IM and Cullen MJ [ |
Google Trends | User queries with keywords for gastrointestinal complaints | Normalized counts aggregated by date, query keyword, and Google Trends normalized count for entire Hawkes Bay District Health Board area only | Not applicable | Google Trends [ |
Twitter microblogs (from Gnip Historical PowerTrack service) | Tweets with keywords for gastrointestinal complaints | Individual tweets geocoded to cities in Hawkes Bay | 191 | Gnip [ |
School absenteeism records (from individual schools) | Absence owing to illness or any valid reason | Aggregated by schools for the 5 schools providing data, areas represented: Havelock North, Napier, and Hastings | 23,836 | Ministry of Education, New Zealand [ |
We extracted confirmed and suspected cases of campylobacteriosis in Hawkes Bay from EpiSurv [
Daily data on consultations with GPs were collected through HealthStat. This system automatically monitors the number of people who consult primary care medical practitioners based on automated extracts of GP-coded data from computerized practice management systems [
Consumer helpline data were collected from HealthLine, which is a free national 24-hour 0800 telephone health advice service funded by the New Zealand Ministry of Health [
Google Trends provides a time series index of the volume of queries users enter into Google in a given geographic area [
Twitter is a free social networking and microblogging service that enables millions of users to send and read each other's tweets, or short, 140-character messages. Registered users collectively send more than 200 million tweets a day. Twitter accounts are by default public and visible to all (even to unregistered visitors using the Twitter website). Users can restrict their account settings to private, in which case their contents can only be visible to approved followers.
In a previous study, we obtained Twitter data from Gnip, their licensed data provider, through their Historical PowerTrack service [
Twitter feeds were classified by developing a supervised machine learning classifier using the Naïve Bayes algorithm in Python. A total of 10,000 random tweets were manually labeled as (1) gastrointestinal illness, (2) other infectious illness, and (3) irrelevant tweets. A tweet was labeled “gastrointestinal illness” when its content described a recent account of infectious gastrointestinal illness, “infectious illness” for tweets that described a recent account of other infectious illnesses, and “irrelevant” for tweets that did not fit in the other 2 categories. This training set was used to train the machine learning classifier, which was then used to classify the complete Twitter data. This classifier was evaluated on 1000 randomly selected and manually labeled tweets that were not included in the training set. Precision, recall, and F1 scores were calculated to evaluate the performance of the classifier. Precision is the ratio of observations judged relevant to the total observations predicted as relevant, recall is the ratio of observations judged relevant out of total relevant observations, and F1 is the weighted average of precision and recall [
We collected school absenteeism data from 5 schools in Hawke’s Bay: 2 from Havelock North, 2 from Hastings, and 1 from Napier. These included 4 primary schools and 1 secondary school. Primary school data had a reason for absence code, so we included data for codes related to illness and/or any justified absence. Absenteeism codes are listed in
A daily time series with cumulative counts from all the previously mentioned data sources was constructed. For the school data set, days covering the school holidays were removed from the analysis. In all data sources, missing data values were estimated by interpolation of observational data. These adjustments were made to reduce the impact of missing data in the analysis.
To assess whether the selected data sources could have predicted this Campylobacter outbreak earlier, we used Pearson correlation statistics to calculate correlations between daily counts of these alternative surveillance measures and daily counts of notified cases. Correlations were calculated for the notified case count with the alternative measure on the same day as well as with up to a 10-day negative lag for each alternative measure (ie, correlating the notified case count on day t with the alternative measure on day t−10, t−9, etc;
Correlation and lagged transformed correlation of alternative predictors with notified case counts of campylobacteriosis.
Data source | Number of days that alternative measures are lagged before notifiable counts | ||||||||||
|
0 days | −1 day | −2 days | −3 days | −4 days | −5 days | −6 days | −7 days | −8 days | −9 days | −10 days |
GPa consultations | 0.5b | 0.43b | 0.39b | 0.26b | 0.17b | 0.14b | 0.09 | 0.05 | 0.04 | 0.01 | 0.01 |
Consumer helpline | 0.44b | 0.59b | 0.67b | 0.64b | 0.55b | 0.37b | 0.2b | 0.12b | 0.1 | 0.07 | 0.07 |
Google Trends | 0.13b | 0.16b | 0.22b | 0.22b | 0.21b | 0.17b | 0.21b | 0.21b | 0.16b | 0.08 | 0.02 |
Twitter microblogs | 0.11b | 0.21b | 0.31b | 0.25b | 0.21b | 0.07 | 0 | −0.01 | 0 | −0.03 | 0 |
School absenteeism | 0.3b | 0.48b | 0.64b | 0.7b | 0.52b | 0.35b | 0.21b | 0.2b | 0.17b | 0.18b | 0.15b |
aGP: general practice.
bStatistically significant correlation coefficient >0.1.
To forecast daily suspected cases of campylobacteriosis, a collection of multivariable autoregressive integrated moving average (ARIMA) models were constructed. These models were found to be a good tool for the prediction of communicable disease incidences [
These models used the negative lagged (day −1 to day −10) daily counts for each alternative measure (
Models were thus evaluated for their predictive performance during the test period from July 31 to August 30, 2016. For each model, we report 3 evaluation metrics: the Pearson correlation (ρ), RMSE, and the relative root mean square error (rRMSE) of the predictions. ρ is a measure of the linear dependence between two variables during a period. RMSE is a measure of the difference between the predicted and true values. rRMSE is a measure of the percent difference between the predicted and true values. The equations for these measures are given below:
where yi denotes the observed value of the notified Campylobacter cases at time ti, xi denotes the predicted value by any model at time ti,
Sources that included city-level locations (notified cases, school absenteeism, consumer helpline, and Twitter feeds) were used for spatiotemporal analysis. To understand the spatial and temporal trends of the event data, we broke them up into a series of time snapshots, using the space-time cube method [
We used a Local Outlier Analysis tool in ArcGIS (Esri) to identify locations that were statistically different from their neighbors in both space and time. This tool generates Anselin Local Moran’s I [
All alternative surveillance measures correlated significantly with notified Campylobacter cases on the same day. Many of these alternative surveillance measures also demonstrated strong correlations when lagged 1 to 8 days before notified cases. Indeed, the correlation ranged from 0.14 to 0.43 for up to 5 days of lag for GP consultations, 0.12 to 0.67 for up to 7 days of lag for consumer helpline inquiries, 0.16 to 0.22 for up to 8 days of lag for Google Trends, 0.21 to 0.31 for up to 4 days of lag for Twitter, and 0.15 to 0.7 for up to 10 days of lag for school absenteeism (
The final ARIMA models and the covariates of alternative data sources with their in-sample error measure of RMSE are summarized in
Autoregressive integrated moving average models with time-lagged covariates used with alternative data sources for forecasting 1 to 5 days ahead.
Alternative data source and forecast step | Time-lagged covariates, daysa | ARIMAb orderc | RMSEd | ||||
|
|||||||
|
1 day | 1 to 10 | 3,0,1 | 1.01 | |||
|
2 days | 2 to 10 | 2,0,0 | 1.04 | |||
|
3 days | 3 to 10 | 2,0,0 | 1.04 | |||
|
4 days | 4 to 10 | 2,0,0 | 1.05 | |||
|
5 days | 5 to 10 | 2,0,0 | 1.06 | |||
|
|||||||
|
1 day | 1, 2, 3, 4, 5, 6, 7, 8, 10 | 3,0,2 | 1.08 | |||
|
2 days | 2, 3, 5, 6, 7, 8, 10 | 3,0,2 | 1.08 | |||
|
3 days | 3, 4, 5, 6, 7, 8, 10 | 3,0,2 | 1.08 | |||
|
4 days | 4, 6, 7, 8, 9, 10 | 3,0,2 | 1.09 | |||
|
5 days | 6, 7, 8, 9, 10 | 3,0,2 | 1.09 | |||
|
|||||||
|
1 day | 1 to 10 | 2,0,0 | 1.07 | |||
|
2 days | 2 to 10 | 2,0,0 | 1.08 | |||
|
3 days | 3 to 10 | 2,0,0 | 1.08 | |||
|
4 days | 4 to 10 | 2,0,0 | 1.08 | |||
|
5 days | 5 to 10 | 2,0,0 | 1.08 | |||
|
|||||||
|
1 day | 1 to 10 | 4,0,1 | 1.07 | |||
|
2 days | 2 to 10 | 5,0,2 | 1.08 | |||
|
3 days | 3 to 10 | 3,0,2 | 1.08 | |||
|
4 days | 4 to 10 | 2,0,2 | 1.09 | |||
|
5 days | 5 to 10 | 2,0,2 | 1.09 | |||
|
|||||||
|
1 day | 1 to 10 | 5,1,3 | 0.94 | |||
|
2 days | 2 to 10 | 5,1,3 | 0.94 | |||
|
3 days | 3 to 10 | 5,1,3 | 0.94 | |||
|
4 days | 4 to 10 | 5,0,2 | 1.09 | |||
|
5 days | 5 to 10 | 5,0,2 | 1.09 |
aLagged covariates refer to the time-lagged independent variables of alternative data source.
bARIMA: autoregressive integrated moving average.
cARIMA order (p,d,q) refers to the number of autoregressive terms, degree of differencing, and moving average components of the model.
eGP: general practice.
We produced predictions for 1 to 5 days ahead during the outbreak (ie, the testing period) using the models in
Actual notified case counts and prediction results 1 to 5 days ahead for all developed models, with their prediction errors based on relative root mean square error. The best model performance with the lowest prediction error (relative root mean square error) in each time series is shown as a bold line. ABS: abseentism; AR: autoregressive; CHL: consumer helpline; GP: general practice; GT: Google Trends.
Root mean square error, relative root mean square error, and Pearson correlation for 1-, 2-, 3-, 4-, and 5-day ahead predictions during the test period (August 2016).
Model | 1 Day | 2 Days | 3 Days | 4 Days | 5 Days | |||||||||||||
|
RMSEa | rRMSEb | ρc | RMSE | rRMSE | ρ | RMSE | rRMSE | ρ | RMSE | rRMSE | ρ | RMSE | rRMSE | ρ | |||
ARd | 15.28 | 46.9 | 0.917 | 23.73 | 72.8 | 0.76 | 33.9 | 105.3 | 0.82 | 38.85 | 119.2 | 0.20 | 67.57 | 202 | 0.65 | |||
AR+CHLe |
|
|
|
|
|
|
39.74 | 123.5 | 0.79 | 38.14 | 117 | 0.28 | 68.51 | 204.8 | 0.64 | |||
AR+GPg | 15.71 | 48.2 | 0.901 | 23.77 | 72.9 | 0.75 | 31.55 | 98 | 0.84 | 39.59 | 121.4 | 0.21 | 63.21 | 189 | 0.66 | |||
AR+GTh | 12.9 | 39.6 | 0.933 | 22.5 | 69 | 0.76 |
|
|
|
37.84 | 116.1 | 0.21 |
|
|
|
|||
AR+Twitter | 11.61 | 35.6 | 0.951 | 22.67 | 69.5 | 0.80 | 35.63 | 110.7 | 0.81 |
|
|
|
80.83 | 241.7 | 0.62 | |||
AR+ABSi | 4.74 | 14.5 | 0.989 | 15.97 | 49 | 0.89 | 38.68 | 120.2 | 0.81 | 47.26 | 145 | 0.28 | 71.5 | 213.8 | 0.65 |
aRMSE: root mean square error.
brRMSE: relative root mean square error.
cρ: Pearson correlation.
dAR: autoregressive.
eCHL: consumer helpline.
fBest performing model for a particular day on basis of the rRMSE.
gGP: general practice.
hGT: Google Trends.
iABS: school absenteeism.
As seen in the evaluation metric values in
The out-of-sample (ie, using the data for the testing period) prediction with the best performing models for the 1, 2, 3, 4, and 5 days ahead time horizons and their prediction errors are shown in
The daily estimations of the best performing models (lowest relative root mean square error) and their prediction errors during the testing period (August 2016). AR: autoregressive; CHL: consumer helpline; GT: Google Trends.
The summarized cluster types in notified case counts, consumer helpline inquiries, and school absenteeism in Hastings and Havelock North are shown in
Cluster types in notified case counts, consumer helpline inquiries, and school’s absenteeism in Hastings and Havelock North. High-high cluster refers to high values surrounded by high values, high-low cluster refers to high values surrounded by low values, low-high cluster refers to low values surrounded by high values, and low-low cluster refers to low values surrounded by low values. Multiple Types refer to multiple cluster-type designations (ie, high high, low low, high low, and low high) through the time period.
The prevalence of the designation Multiple Types did not illuminate trends or clusters in the data set. Therefore, we examined daily Local Moran’s I to compare the clustering between 2 cities during the outbreak (
Daily Local Moran’s I in school absenteeism, consumer helpline inquiries, and notified case counts in Havelock North and Hastings cities in August 2016.
Date | Havelock North | Hastings | ||||
|
School absenteeism | Consumer helpline | Notified case count | School absenteeism | Consumer helpline | Notified case count |
|
Moran’s I value, Z score | Moran’s I value, Z score | Moran’s I value, Z score | Moran’s I value, Z score | Moran’s I value, Z score | Moran’s I value, Z score |
August 4, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.03 (−0.16) | 0.04 (−0.23) | 0.08 (−0.29) |
August 5, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.04 (−0.23) | 0.07 (−0.29) | 0.09 (−0.32) |
August 6, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 7, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 8, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 9, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 10, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.04 (−0.19) | 0.03 (−0.1) | 0.09 (−0.29) |
August 11, 2016 | − |
− |
0 (0.01) | 0.03 (−0.15) | 0.01 (−0.1) | 0.08 (−0.29) |
August 12, 2016 | −0.40 (−0.23) | -0.77 (−0.29) | 0 (−0.32) | 0.04 (−0.23) | 0.03 (−0.29) | 0.09 (−0.32) |
August 13, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 14, 2016 | −1.62 (7.08)a | −1.92 (6.71)a | − |
|
− |
− |
August 15, 2016 | −1.62 (−0.23) | −1.92 (−0.29) | −2.17 (−0.32) | 0.03 (−0.17) | -0.01 (−0.04) | 0.56 (0.89) |
August 16, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.03 (−0.16) | 0 (−0.04) | 1.20 (1.37) |
August 17, 2016 | 0.05 (0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.02 (−0.15) | 0 (0.03) | 1.20 (0.89) |
August 18, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.02 (−0.11) | 0 (0.03) | 0.31 (0.35) |
August 19, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.03 (−0.23) | −0.01 (−0.29) | −0.11 (−0.32) |
August 20, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 21, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.03 (−0.13) | 0.01 (−0.04) | −0.08 (0.25) |
August 22, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.02 (−0.17) | 0 (−0.04) | −0.05 (−0.19) |
August 23, 2016 | −0.10 (0.45) | −0.11 (0.37) | −0.11 (0.34) | 0.03 (−0.18) | 0 (−0.1) | −0.02(0.13) |
August 24, 2016 | 0.21 (0.46) | 0.14 (0.37) | 0.12 (0.34) | 0.03 (−0.16) | 0.02 (−0.16) | −0.03 (−0.23) |
August 25, 2016 | 0.14 (0.3) | 0.14 (0.37) | 0.23 (0.68) | 0.03 (−0.16) | 0.04 (−0.23) | 0.06 (−0.29) |
August 26, 2016 | −0.07 (−0.23) | −0.11 (−0.29) | −0.22 (−0.32) | 0.04 (−0.23) | 0.07 (−0.29) | 0.09 (−0.32) |
August 27, 2016 | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
August 28, 2016 | −0.05 (0.2) | −0.01 (0.04) | −0.11 (0.34) | 0.04 (−0.19) | 0.03 (−0.1) | 0.03 (−0.1) |
August 29, 2016 | −0.05 (−0.23) | −0.01 (−0.29) | −0.11 (−0.32) | 0.04 (−0.23) | 0.03 (−0.29) | 0.03 (−0.32) |
August 30, 2016 | −0.02 (0.11) | −0.11 (0.37) | 0.05 (−0.16) | 0.05 (−0.23) | 0.08 (−0.29) | 0.10 (−0.32) |
aNegative values of the Moran’s I value and corresponding Z scores greater than 1.96 indicate that there is a statistically significant spatial outlier.
bFirst day when the data source shows a spatial outlier.
The results show that alternative surveillance data sources can be used to predict an increase in notified Campylobacter cases up to 5 days before the outbreak would be detected via the notifiable disease surveillance system. Importantly, models that relied solely on available time-lagged notified case data were found to be no better than the models based on alternative data sources in predicting near–real-time Campylobacter cases. This finding further underscores the need for alternative real-time data sources such as consumer helpline and Google Trends.
Models that relied on consumer helpline calls provided 1 to 2 days of lead time before an increase in notified cases and consistently performed well, with low error rates. This finding suggests that consumer helpline data have potential utility for earlier detection of outbreaks of acute gastroenteritis. Qualitatively, this result is consistent with our expectations, as the consumer helpline and GP consultations are well-established services for those seeking medical attention in New Zealand [
The web data sources (Google Trends and Twitter) were found to be good estimators of Campylobacter cases, even earlier than consumer helpline data. For example, Google Trends reduced the prediction error by less than 6% compared with the next-best model (ie, with GP consultations) for 3-days ahead prediction, as shown in
As seen in prediction studies for other diseases [
Spatiotemporal analysis was also retrospectively able to confirm the area impacted by the outbreak. Havelock North and Hastings followed the same clustering in notified case counts and consumer helpline inquiries, whereas Hastings, which was not in the area most affected by the outbreak, had early peaks in consumer helpline inquiries and school absenteeism but fewer overall helpline calls and cases. Aggregating the time series data at the city level may immediately give indications of potential clusters, such as the one identified in Havelock North by Local Moran’s I statistics. In particular, primary clusters in school absenteeism and consumer helpline inquiries started on August 11, which was 3 days before the same type of cluster was found in notified case counts and a day earlier than actual public health response actions were initiated. Used prospectively, such spatiotemporal analysis could identify clusters and outbreaks earlier in their course than notification data [
There are limitations in our approach from inherent biases in the alternative data sources. Users of any of these services are not representative of the general population or those at risk of exposure to pathogens. Google search patterns and care seeking may reflect media coverage and situational awareness rather than the actual impact of the outbreak. Local media in regions with a large outbreak may react differently than the regions where these diseases are fewer in number. Thus, media attention has the potential to dramatically influence our daily predictions [
We used the correlation of keywords with notified cases to filter Google Trends data and to classify tweets, which improved the predictive values of these data sources. However, neither of these data sources can distinguish people who search or tweet because of awareness from those with infection. In addition, the static assessment of the predictive power of the included keywords can impose some limitations. Self-correcting keyword selection by dynamically reassessing the predictive power of each input variable, as discussed by McGough et al [
As mentioned in the Results section, there was insufficient Twitter data to use in the spatiotemporal analysis. However, tweets were only queried in English. With an already low tweet volume, capturing other languages such as Māori might be needed to refine models in the future. Furthermore, we relied on Twitter-generated coordinate information to capture local data. To overcome this limitation, future work could explore ways to geocode the data using location information in the tweet text [
We are not advocating alternative data sources to replace traditional methods, but rather to complement them. For example, in the Havelock North outbreak, public health officials still required information that suggested an outbreak source (positive bacterial test from local water supply) to start control activities (boil water notice and chlorination of drinking water supply). Early signals from social media and HealthLine calls could have triggered efforts to investigate potential outbreak sources earlier. However, nontraditional surveillance carries with it the workload required to interpret and respond to signals, which can be extensive, as others have noted [
This study shows a number of improvements over previous methodologies using monthly or weekly data from alternative sources to predict disease incidence in the community [
In addition to internet search volumes, some studies have used time-lagged data from Twitter to predict the incidence of diseases such as Zika [
Dong et al [
This study has further demonstrated that alternative surveillance data sources can identify large outbreaks of gastrointestinal illness a few days earlier than traditional surveillance methods. The lead time gained depends on the nontraditional surveillance data source used, with onset of symptoms quickly stimulating Google and Twitter activity followed soon after by calls to consumer health helplines, days off from school, and GP consultations.
Such alternative data sources also need to be combined with suitable analytic methods that can be run routinely and easily to identify potential outbreaks, so they can be further investigated and acted on if control measures are needed. This research has identified models with autoregressive information as promising approaches for the analysis of a set of alternative data sources. However, for waterborne outbreaks, as in Havelock North, inclusion of measures from drinking water supply and weather conditions could be included as further data sources for disease surveillance.
This study used the traditional ARIMA models to assess the efficiency of using alternative data sources for the early prediction of a large Campylobacter outbreak. The development of further machine learning models using other techniques to validate the results of this study will be useful. For example, deep learning–based algorithms have been found to increase the performance of traditional time series forecasting methods [
The Havelock North outbreak was very large. The signal produced in data sources was therefore easier to detect than would be the case in a smaller outbreak where the signal-to-noise ratio would be lower. It would be useful to repeat the study with outbreaks of smaller magnitude and in different settings to determine whether similar findings apply.
There are multiple operational questions that would need to be resolved before any of the methods identified here could be introduced for routine use by public health agencies in New Zealand or elsewhere. In particular, it is important to identify the range of conditions or syndromes where early detection is important for guiding effective public health action. It is also important to consider the volume of false positives that might be generated and the additional resources required to investigate and rule them out. The range of surveillance modalities also needs to be considered. For example, specific forms of environmental surveillance may be more effective for guiding public health action, for example, improved surveillance of drinking water quality and meteorological data may be more effective in preventing disease rather than focusing on early indicators of illness. Resource issues will also need to be considered, which might favor systems that are already operating on a real-time basis (eg, consumer calls to HealthLine).
This study presents several important conclusions. We tested the use of data from alternative sources in predictive models and showed that they could have provided earlier detection of the Havelock North outbreak. Given the need for early intervention to curb disease transmission, our model predictions could fill a critical time gap in existing surveillance based on notification of cases of disease. These notifications inevitably do not appear until a few days after the occurrence of a communicable disease outbreak. Our results show that models that combine consumer helpline data with autoregressive information of notified case counts performed best for predictions 1 and 2 days ahead, whereas models using Google and Twitter data performed best for predictions 3 and 4 days ahead, although with lower prediction accuracy. Spatiotemporal clusters showed an earlier spike in school absenteeism and consumer helpline inquiries when compared with the notified case counts in the city primarily affected by the outbreak, which suggests that spatiotemporal modeling of alternative data sources could help to identify and locate outbreaks earlier in their development. The methods presented here can potentially be expanded to other regions in the country to signal changes in disease incidence for public health decision makers. However, before doing that, a number of key questions will need to be systematically investigated to establish the practical role of these methods and how they could be most effectively integrated into routine public health practice.
Symptoms classified as gastrointestinal illness in HealthLine calls.
Keywords used to collect Google Trends data.
Correlation and cross correlation of key words in Google Trends with the notified case counts.
Gnip Query to collect Twitter data.
Codes used to collect absenteeism data form primary schools.
autoregressive
autoregressive integrated moving average
consumer helpline
general practice
Google Trends
root mean square error
relative root mean square error
seasonal autoregressive integrative moving average
This work was supported by the Health Research Council, New Zealand. The authors gratefully acknowledge the help of the Ministry of Health, New Zealand; participating schools; and Hawkes Bay District Health Board staff in making available the Twitter and school absenteeism data essential to this research.
None declared.