This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
West Nile virus is an arbovirus responsible for an infection that tends to peak during the late summer and early fall. Tools monitoring Web searches are emerging as powerful sources of data, especially concerning infectious diseases such as West Nile virus.
This study aimed at exploring the potential predictive power of West Nile virus–related Web searches.
Different novel data streams, including Google Trends, WikiTrends, YouTube, and Google News, were used to extract search trends. Data regarding West Nile virus cases were obtained from the Centers for Disease Control and Prevention. Data were analyzed using regression, times series analysis, structural equation modeling, and clustering analysis.
In the regression analysis, an association between Web searches and “real-world” epidemiological figures was found. The best seasonal autoregressive integrated moving average model with explicative variable (SARIMAX) was found to be (0,1,1)x(0,1,1)4. Using data from 2004 to 2015, we were able to predict data for 2016. From the structural equation modeling, the consumption of West Nile virus–related news fully mediated the relation between Google Trends and the consumption of YouTube videos, as well as the relation between the latter variable and the number of West Nile virus cases. Web searches fully mediated the relation between epidemiological figures and the consumption of YouTube videos, as well as the relation between epidemiological data and the number of accesses to the West Nile virus–related Wikipedia page. In the clustering analysis, the consumption of news was most similar to the Web searches pattern, which was less close to the consumption of YouTube videos and least similar to the behavior of accessing West Nile virus–related Wikipedia pages.
Our study demonstrated an association between epidemiological data and search patterns related to the West Nile virus. Based on this correlation, further studies are needed to examine the practicality of these findings.
West Nile virus, first isolated in Uganda in 1937, is a widely distributed arbovirus belonging to the
West Nile virus was detected in North America for the first time in August 1999 during an outbreak that occurred in College Point, Queens, in New York City. A cluster of human encephalitis cases, all residing in the same 16-square-mile area, was identified by Drs Deborah Asnis (a local physician based in Queens), Marcelle Layton, and Annie Fine (of the New York City Department of Health) [
The widespread resurgence of human West Nile virus disease in 2012 following several years of relatively low incidence has highlighted the continued public health hazard posed by West Nile virus, and has emphasized the need for more accurate predictive models of when and where new West Nile virus outbreaks will occur.
Web-based tools are emerging as remarkable sources of data, especially for infectious diseases, by enabling Web search monitoring in real time and potentially capturing epidemiologically relevant information [
Little is known about West Nile virus–related digital behavior. To the best of our knowledge, only a few authors have investigated this topic. Carneiro and Mylonakis [
Bragazzi and collaborators [
However, the potential predictive power of West Nile virus–related Web searches has not yet been explored. To fill this gap in knowledge, we conducted this study.
West Nile virus–related data were retrieved, downloaded, and analyzed from several novel data streams, including Google Trends, WikiTrends, YouTube, and Google News, as well as from epidemiological repositories.
Google Trends (an open source tool) was mined from inception (2004) to 2015, by searching for West Nile virus in the United States and using the “search topic” option. This strategy enables one to systematically collect all the searches related to a given keyword or list of keywords (in this case, West Nile virus), including synonyms and related terms, not just the precise string of characters typed by users [
Epidemiological data related to West Nile virus cases in the United States were obtained from the Centers for Disease Control and Prevention (CDC) website and the bulletins of the
Novel data streams-generated data were retrieved and downloaded from 2004 for Google Trends and 2008 for the other open source tools to 2015. All data were analyzed on a trimester basis. To detect a potential association with “real-world” epidemiological figures, regression analyses (with time as the confounding variable) were carried out. Furthermore, novel data streams-generated data were modeled as a time series and analyzed using time series analyses. In particular, a seasonal autoregressive integrated moving average model with explicative variable (SARIMAX) was used. By visually inspecting the autocorrelogram and the partial autocorrelogram based on the autocorrelation and partial autocorrelation function, respectively,
Regression and clustering analyses were performed using SPSS version 24.0 (IBM Corp, Armonk, NY, USA), whereas the SARIMAX models and the structural equation modeling were carried out with XLSTAT (Addinsoft, Paris, France). A
Visual inspection of novel data streams–based data showed that each tool captured a specific digital behavior, generating specific curves which were not perfectly superimposable (
Temporal pattern of searching behavior related to the West Nile virus in the United States, as captured by four different novel data streams: Google Trends, WikiTrends, Google News, and YouTube. RSV: relative search volume (expressed as percentage).
Concerning temporal trends, only West Nile virus–related Web searches pattern well-reproduced the epidemiological trend, with most Google queries concentrated in August. For instance, the number of accesses to the West Nile virus–related Wikipedia page (as captured by WikiTrends) and the consumption of YouTube videos exhibited high search volumes also during winter months compared to Google Trends (
Regression analyses showed a significant correlation between real-world epidemiological data and novel data streams-generated figures only for Google Trends data (
Seasonal pattern of searching behavior related to the West Nile virus in the United States, as captured by four different novel data streams: Google Trends, WikiTrends, Google News, and YouTube. RSV: relative search volume (expressed as percentage).
Regression analyses to detect potential association between novel data streams (Google Trends, WikiTrends, Google News, and YouTube) and real-world epidemiological figures.
Source | Regression coefficient | SE | 95% CI | |||
Intercept | 3327.876 | 966.213 | 1380.603, 5275.150 | 3.444 | .001 | |
Trimester | –0.531 | 1.535 | –3.624, 2.561 | –0.346 | .73 | |
Year | –1.653 | 0.481 | –2.622, –0.684 | –3.438 | .001 | |
West Nile virus cases | 0.014 | 0.001 | 0.011, 0.017 | 9.629 | <.001 | |
Intercept | –7402.427 | 4130.337 | –15863.039, 1058.184 | –1.792 | .08 | |
Trimester | 1.715 | 4.288 | –7.068, 10.498 | 0.400 | .69 | |
Year | 3.694 | 2.053 | –0.512, 7.900 | 1.799 | .08 | |
West Nile virus cases | –0.003 | 0.005 | –0.012, 0.007 | –0.566 | .58 | |
Intercept | 8875.080 | 3807.169 | 1076.447, 16673.712 | 2.331 | .03 | |
Trimester | 1.478 | 3.952 | –6.618, 9.574 | 0.374 | .71 | |
Year | –4.407 | 1.893 | –8.284, –0.530 | –2.328 | .03 | |
West Nile virus cases | –0.001 | 0.004 | –0.010, 0.008 | –0.237 | .81 | |
Intercept | 3297.355 | 3606.758 | –4090.754, 10685.464 | 0.914 | .37 | |
Trimester | 3.454 | 3.744 | –4.216, 11.124 | 0.923 | .36 | |
Year | –1.622 | 1.793 | –5.295, 2.051 | –0.904 | .37 | |
West Nile virus cases | –0.002 | 0.004 | –0.010, 0.007 | –0.453 | .65 |
Correlation between real-world epidemiological figures of West Nile virus (WNV) cases and digital searches. RSV: relative search volume (expressed as percentage).
A Google Trends–based autocorrelogram and partial autocorrelogram are reported in
Concerning structural equation modeling, the consumption of West Nile virus–related news fully mediated the relationship between Google Trends and the consumption of YouTube videos, as well as the relation between this latter variable and the number of West Nile virus cases. Web searches as captured by Google Trends fully mediated the relation between West Nile virus cases and the consumption of YouTube videos, as well as the relation between epidemiological data and the number of accesses to the West Nile virus–related Wikipedia page as captured by WikiTrends (
Clustering analysis showed that the consumption of news was most similar to the Web searches pattern (as captured by Google Trends), which was less close to the consumption of YouTube videos and least similar to accessing the West Nile virus–related Wikipedia page as captured by WikiTrends (as can be seen by the dendrogram in
Autocorrelogram and partial autocorrelogram of West Nile virus–related search volumes generated on Google Trends.
Descriptive statistics of the Google Trends–generated data concerning Web queries related to the West Nile virus.
Lag | Autocovariance | Autocorrelation | SE | 95% CI | Partial autocorrelation | SE | 95% CI |
0 | 430.22 | 1.00 | 0.00 | Refa | 1.00 | 0.00 | Ref |
1 | 53.83 | 0.13 | 0.14 | –0.27, 0.274 | 0.13 | 0.14 | –0.28, 0.28 |
2 | –60.95 | –0.14 | 0.14 | –0.27, 0.271 | –0.16 | 0.14 | –0.28, 0.28 |
3 | 7.85 | 0.02 | 0.14 | –0.27, 0.268 | 0.06 | 0.14 | –0.28, 0.28 |
4 | 166.97 | 0.38 | 0.14 | –0.27, 0.265 | 0.37 | 0.14 | –0.28, 0.28 |
5 | 3.73 | 0.01 | 0.13 | –0.26, 0.262 | –0.11 | 0.14 | –0.28, 0.28 |
6 | –66.73 | –0.16 | 0.13 | –0.26, 0.259 | –0.06 | 0.14 | –0.28, 0.28 |
7 | –19.78 | –0.05 | 0.13 | –0.26, 0.256 | –0.04 | 0.14 | –0.28, 0.28 |
8 | 114.81 | 0.27 | 0.13 | –0.25, 0.253 | 0.14 | 0.14 | –0.28, 0.28 |
9 | –13.47 | –0.03 | 0.13 | –0.25, 0.250 | –0.08 | 0.14 | –0.28, 0.28 |
10 | –66.81 | –0.16 | 0.13 | –0.25, 0.247 | –0.04 | 0.14 | –0.28, 0.28 |
11 | –30.33 | –0.07 | 0.12 | –0.24, 0.243 | –0.04 | 0.14 | –0.28, 0.28 |
12 | 67.99 | 0.16 | 0.12 | –0.24, 0.240 | 0.01 | 0.14 | –0.28, 0.28 |
13 | –27.40 | –0.06 | 0.12 | –0.24, 0.237 | –0.07 | 0.14 | –0.28, 0.28 |
14 | –45.30 | –0.11 | 0.12 | –0.23, 0.233 | 0.02 | 0.14 | –0.28, 0.28 |
15 | –22.67 | –0.05 | 0.12 | –0.23, 0.230 | –0.02 | 0.14 | –0.28, 0.28 |
16 | 56.78 | 0.13 | 0.12 | –0.23, 0.226 | 0.03 | 0.14 | –0.28, 0.28 |
17 | –15.62 | –0.04 | 0.11 | –0.22, 0.223 | –0.01 | 0.14 | –0.28, 0.28 |
aRef: reference.
The outcome of the best seasonal autoregressive integrated moving average model with explicative variable (SARIMAX) forecasting the West Nile virus in the United States using Google Trends-generated data. RSV: relative search volume (expressed as percentage).
Parameters of the best seasonal autoregressive integrated average model with explicative variable (SARIMAX) for forecasting West Nile virus in the United States using Google Trends–generated data.
Parameter | Value | Hessian SD | 95% CI | Asymptotic SD | 95% CI |
Constant | 4.261 | Refa | Ref | Ref | Ref |
West Nile virus cases | 0.022 | 0.055 | –0.086, 0.130 | Ref | Ref |
MAb(1) | –0.867 | 0.101 | –1.065, –0.670 | 0.124 | –1.110, –0.624 |
SMAc(1) | 0.672 | 0.120 | 0.436, 0.907 | 0.150 | 0.379, 0.965 |
aRef: reference.
bMA: nonseasonal component.
cSMA: seasonal component.
Structural equation model showing the interplay between the different novel data streams concerning West Nile virus–related searching behavior: (a) not adjusted and (b) adjusted for time as confounding variable.
Dendrogram analysis of the four novel data streams (Google Trends, Google News, YouTube, and WikiTrends). Units are arbitrary.
Currently, arboviruses are re-emerging infectious agents. This is not a new phenomenon—it has been happening for centuries—but today arboviral re-emergence and dispersion are more rapid and geographically extensive mainly due to globalization and to arthropod adaptation to its effects [
In the existing scholarly literature, different predictive models of West Nile virus have been reported. Kala and colleagues [
In our work, we exploited a variable (West Nile virus–related digital behavior), which has so far not been used in predicting West Nile virus epidemiology. We explored different novel data streams (Google Trends, WikiTrends, Google News, and YouTube) concerning seeking behavior and we were able to find a statistically significant association between epidemiological figures and digital behavior only in the case of Web searches as captured by Google Trends. Furthermore, we computed the best SARIMAX model for the period of 2004 to 2015, and we were able to forecast data related to 2016. Moreover, structural equation modeling and clustering analysis have enabled us to capture the complex interplay between the different novel data streams and the West Nile virus–related digital seeking behavior.
Even if our experience suggests the usefulness of using Google Trends for predicting West Nile virus, this should be considered as a pilot study, calling for the need for making our model more accurate and reliable, and maybe incorporating other variables (eg, environmental, socioeconomic, and ecological ones). This is of fundamental importance when designing and implementing a digital system for West Nile virus surveillance, which could complement the classical one or those actually under experimentation [
Our study has some limitations that should be recognized. Some of the novel data streams used provide users with relative, normalized figures, and not with raw, absolute data, thus hindering further mathematical processing and statistical analysis. Another drawback is given by the fact that Google Trends captures only a portion of the entire population, namely the percentage of people using Google as their preferred search engine (although Google is the most commonly used search engine worldwide). Furthermore, we did not perform a content analysis of the West Nile virus–related material; from the existing literature, it is known to be of rather poor quality and to exhibit some degrees of inconsistencies [
Statistically significant temporal correlations between West Nile virus epidemiological data and Google Trends suggest the feasibility of exploiting Google Trends as an internet-based monitoring tool. This is timely and of crucial importance given the recent re-emergence of arboviral infections. Workers in the field of public health and health authorities should be aware of the public interest and reaction to West Nile virus outbreaks in terms of Web searches. They could exploit the new information and communication technologies both for performing real-time monitoring of new population-based epidemic events and for carrying out a content analysis of the available online material, promptly replying to public concerns and correcting prejudices and inaccurate and misleading reports by disseminating high-quality information. However, based on the previously mentioned limitations of this paper, further studies are warranted to make our model more useful and practical.
Different tested seasonal autoregressive integrated average (SARIMAX) models for forecasting the West Nile virus in the United States using Google Trends–generated data.
Akaike information criterion
seasonal autoregressive integrated moving average model with explicative variable
None declared.