This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
An extended discussion and research has been performed in recent years using data collected through search queries submitted via the Internet. It has been shown that the overall activity on the Internet is related to the number of cases of an infectious disease outbreak.
The aim of the study was to define a similar correlation between data from Google Trends and data collected by the official authorities of Greece and Europe by examining the development and the spread of seasonal influenza in Greece and Italy.
We used multiple regressions of the terms submitted in the Google search engine related to influenza for the period from 2011 to 2012 in Greece and Italy (sample data for 104 weeks for each country). We then used the autoregressive integrated moving average statistical model to determine the correlation between the Google search data and the real influenza cases confirmed by the aforementioned authorities. Two methods were used: (1) a flu score was created for the case of Greece and (2) comparison of data from a neighboring country of Greece, which is Italy.
The results showed that there is a significant correlation that can help the prediction of the spread and the peak of the seasonal influenza using data from Google searches. The correlation for Greece for 2011 and 2012 was .909 and .831, respectively, and correlation for Italy for 2011 and 2012 was .979 and .933, respectively. The prediction of the peak was quite precise, providing a forecast before it arrives to population.
We can create an Internet surveillance system based on Google searches to track influenza in Greece and Italy.
Syndromic surveillance systems refer to monitoring of infectious diseases through data collection from various sources. This is accomplished by setting indicators and methods and publishing reports for early detection of an infectious disease. The desired result is to minimize the extensive spread in the population and take precautionary measures.
These systems operate both at national and international levels and provide useful data and guidelines to deal with various outbreaks of different pathogens and infections. Influenza is considered as an important example for syndromic surveillance and response.
At the international level, the World Health Organization (WHO) has launched the Global Influenza Program [
Special monitoring and laboratories projects are used, such as the Global Influenza Surveillance and Response System [
In Europe, the competent public health authorities use the European Influenza Surveillance Network (EISN) [
As described in the Fact Sheet no. 211 (revised March 2003) of WHO [
The corresponding health costs are high. Only in the United States, large sums of money are directed to influenza treatment and hospitalization. The annual cost for the United States [
Web data is now frequently being used in research conducted by scientists, and they show that the Internet may be an alternative source for collection of data that indicate the development of a syndromic disease using search engine queries. Eysenbach [
Syndromic surveillance relies on the real-time use of information about the population to identify health issues of concern, and it is the current tool used by public health authorities to address them before they become epidemics. Consequently, a syndromic surveillance system implements a variety of outbreak detection algorithms, requiring a good understanding of the strengths and limitations of various detection techniques and their applicability. For example, Ping et al used data which were available via the Web and from physicians’ databases [
Our group has conducted similar research for another infectious disease (scarlet fever) in the United Kingdom. We used Web data [
Hopkins University researchers in Baltimore, United States, find
According to the CDC’s definition of fever and cough or sore throat, the researchers used the CDC’s traditional surveillance methods reporting system from January 25, 2009 to October 18, 2009 and an ED electronic reporting system (from October 18, 2009 to October 3, 2010).
Google Flu Trends weekly data were collected for Baltimore, Maryland. They also collected data from ED, CDC-reported standardized influenza-like illness (ILI) data, and influenza data confirmed by laboratories.
The data were analyzed separately for adult and pediatric cases and correlated to the Google data using cross-correlation functions. The conclusions of this study were that city-level Google Flu Trends shows strong correlation with influenza cases and EDs’ ILI visits, validating its use as an ED surveillance tool. Google Flu Trends correlated with several pediatric ED crowding measures and those for low-acuity adult patients.
Two other research studies were conducted using the autoregressive integrated moving average (ARIMA) model, by Dugas et al (
Recent research [
In 2017, several studies [
In this study, we examine the development and the spread of seasonal influenza in Greece and Italy. Our goal is to define the correlation (and finally accomplish prediction patterns) between data from Google Trends and data collected by the Hellenic Center for Disease Control and Prevention (KEELPNO) [
For the purposes of this study, we used datasets as follows:
Weekly data for ILI from the sentinel system of KEELPNO for the years 2011 to 2012 (105 weeks), for which we could find data. In Greece, through the sentinel system, the influenza activity is monitored on a weekly basis. This system consists of three basic networks: (1) from selected health units of the largest social security organization (IKA) [
We also used data from Italy for the same period and compared the results between the two countries. To get the ILI rates for Italy, we used the weekly reports from ECDC [
We used weekly data from the Google Trends for Greece and Italy, using C# programming code (see
Google Trends analyzes [
To calculate the popularity of a searched term among users in a certain geographical location (eg, country) and in a certain period, Trends examines a percentage of all searches for the specified term within the same time and location parameters. The results are then shown on a graph plotted on a scale from 0 to 100. The same information is also displayed graphically by the geographic heat map.
In our case, we must deal with the problem that for the term
The solution to this problem is to perform searches for separate keywords related to the term
Our next task was to determine whether we can use one of these or all together, creating a
The results of the multiple regressions are shown in
Regressions of separate keywords (year 2011).
Keyword | English term | Variable | Standard error | Constant | Coefficient | ||
γριπη | Influenza | x1 | .888765 | .789903 | 0.00996 | .008855 | .539551 |
πυρετος | Fever | x2 | .655427 | .429584 | 0.01641 | −.03361 | 2.747708 |
βηχας | Cough | x3 | .658775 | .433984 | .01634 | −.01984 | 2.03161 |
πονοκεφαλος | Headache | x4 | .007578 | .000057 | 0.02172 | .019887 | .034104 |
πονολαιμος | Sore throat | x5 | .242802 | .058953 | 0.02107 | .01486 | .22728 |
φαρυγγιτιδα | Pharyngitis | x6 | .340787 | .116136 | 0.02042 | .005607 | .708454 |
αντιβιωση | Antibiotics | x7 | .327644 | .107351 | 0.02052 | −.01148 | 1.596785 |
As shown in
In our case, we decided to combine all the keywords and create a flu score.
To obtain the data from Google Trends, we used Visual Studio 2012 Ultimate and Visual C# as the programming language.
Our goal was to construct prediction models based on the ARIMA model, previously used by other researchers, in specific data from Greece and Italy for influenza, describing two cases: using a flu score for Greece and without it for Italy.
We examined two different cases, both based on ARIMA models. The assumptions are based on that we can create a flu score from different keywords searched by people on the Internet. This score consists of the separate keywords, and it is the aggregation of them. In terms of statistics, this score is the average of all the values of each separate keyword, as shown in the following
The first case assumes that this score can be created and used (the case of Greece), whereas the other assumes that there is enough and reliable data from the Internet that can be safely used. In that case, we used data from a neighboring country using the keyword
After creating the flu score, we used the model ARIMA (1, 0, 0) [
Lags of the differenced series appearing in the forecasting equation are called autoregressive terms, lags of the forecast errors are called moving average terms, and a time series, which needs to be differenced to be made stationary, is said to be an integrated version of a stationary series. The ARIMA models are, in theory, the most general class of models for forecasting a time series, which can be stationarized by transformations such as differencing and logging.
A nonseasonal ARIMA model is classified as an ARIMA (p, d, q) model, where p is the number of autoregressive terms, d is the number of nonseasonal differences, and q is the number of lagged forecast errors in the prediction equation.
In more detail, the above parameters can be analyzed as follows:
p stands for the number of autoregressive orders in the model. Autoregressive orders specify which previous values from the series are used to predict current values. For example, an autoregressive order of 2 specifies that the value of the series two-time periods in the past be used to predict the current value.
d specifies the order of differencing applied to the series before estimating models. Differencing is necessary when trends are present (series with trends are typically nonstationary and ARIMA modeling assumes stationarity) and is used to remove their effect. The order of differencing corresponds to the degree of series trend—first-order differencing accounts for linear trends, second-order differencing accounts for quadratic trends, and so on.
Finally, q means the number of moving average orders in the model. Moving average orders specify how deviations from the series mean for previous values are used to predict current values. For example, moving average orders of 1 and 2 specify that deviations from the mean value of the series from each of the last two-time periods be considered when predicting current values of the series.
Flu-score equation.
In our model, we do not use nonseasonal differences, as we examine a single period of a year, which means there is no seasonality inside the same year, and the peak occurs only once. This model is a special case of an ARIMA model (autoregressive moving average [ARMA] model).
As this model combines autoregression (AR) and moving averages (MA), mathematically, it can be expressed as seen in
First, we create an estimation (base) model for the year 2011. If we assume that the parameters, the constant, and the errors of the estimate for the year 2011 are the same as for the year 2012, by downloading the values from the Google Trends, we build a model for the year 2012. This means that we tried to forecast the influenza ILI rates of 2012 having the knowledge of only the Google Trends data.
The second method addresses the situation, when a flu score is not needed, as there is sufficient data volume of searches by Google Trends for the term
The test case was the data for the year 2011, building a base model, as in the previous method.
The next task was to use the same parameters of the ARIMA (1, 0, 0) model (constant, φ, θ, and ε of year 2011) for the year 2012 (from the first to the last week of 2012) to develop a forecast.
As shown in the Results Section, the case of Italy is different from the Greek one because the official data from ECDC do not exist from a specific week until the end of the year 2012. Nevertheless, using the ARIMA model, we can predict the entire time series—that means the peak and the spread of influenza based on data from Google Trends.
Generally, the methodology could be summarized as shown in
Autoregression equation.
Moving averages equation.
Autoregressive integrated moving average (ARIMA) equation.
Flow diagram of methodology.
Autoregression and moving averages models may be used to correlate data and build prediction models. The methods described above were developed to address the problem of insufficient or complete missing data of Google or even the statistical correlation is below .90. This correlation is denoted by the Pearson
In the first method, combining the most relevant keywords in Greek language to the term
For the year 2011, the
The results for the year 2011 are shown in
The horizontal axis in
Some interesting remarks can be mentioned about the above estimation:
The predicted development of the disease is almost the same as the real one.
The predicted peak appears with a delay of 1 week after the actual one in early February in the 6th week instead of the 5th week of the real cases.
The predicted peak is 73.35, very close to the real peak, which is 76.98. The difference is below 5% (4.71%).
It takes 4 weeks to reach the maximum value from the baseline, which are the 20 ILI cases (from the first week to the 5th week for the real values and from the second week to the 6th week for predicted values).
The above estimation is very good, and we tested the same model with the same parameters to establish a prediction model for the next year.
For the year 2012, the
The correlation coefficient (
Some interesting aspects of the forecast for the year 2012 are the following:
The forecast of the peak of the seasonal flu is almost accurate, considering that during the year 2011 (from the first to the last week of 2011), the peak arrived very early in February (5th week), whereas in 2012 (from the first to the last week of 2012), there is a significant difference, as the peak arrived later in the 9th week.
It takes the same time (as for the year 2011) for the peak to appear (4 weeks above the baseline of 20).
The forecast model predicts almost accurately the year 2012. The predicted value is 76.61, and the real value is 73.26, which means that the difference is below 5% (4.58%).
The predicted peak is shown 1 week earlier.
The development of the curve after the peak also follows closely the observed data.
The prediction in this case is based on that for both years the maximum activity of the disease appears exactly 4 weeks after the value comes to a point of more than 20 ILI cases, which is the baseline of influenza activity, as previously mentioned.
The final point of the forecast will be the assumption of an early detection.
From
Finally, the comparison of real and predicted (ILI per 1000 people) cases is shown in
In the case of Italy, we can see that for the year 2011, the coefficient
The result of the ARIMA model is shown in
Estimation for Greece using the autoregressive integrated moving average (ARIMA) model (year 2011).
The prediction for Greece (year 2012).
Rising from 20 to maximum.
Value | Year | Week above 20 | Week of the peak |
>20 | 2012 | 5th | 9th |
>20 | 2011 | 1st | 5th |
Prediction of the real cases for year 2012 (peak).
Predicted value | Real value | Difference | Difference (%) |
76.61 | 73,260 | 3.35 | 4.58 |
The prediction of ILI rates for Italy, using the ARIMA model, is very good, and the correlation coefficient
The outcome of the model indicates that there is a strong and significant statistical correlation between the Google searches made by Italians for the word
The main results of this estimation can be summarized as follows:
The development of the disease is almost the same after the peak.
The estimation model predicts the highest value 1 week later.
The baseline is 500 ILI cases (per 100,000 people).
It takes 4 weeks to reach the maximum value (1st to 5th week).
The real maximum value is 1102.1, whereas the estimated value is 1013.05. The difference is −89.1 (−8.08%).
The prediction results for the year 2012 are as follows:
For the year 2012, the coefficient
The prediction of this model for the year 2012 is shown in
As shown in
The statistical correlation is above .90. The correlation coefficient
Let’s see the main results of this prediction model:
There is a lot of missing official data for ILI cases after the 16th week (mid-April).
Despite the missing values, the prediction of the peak and the size of the peak are very good. The predicted value is 1037.461, and the real peak is 947. The difference is 90.46 (9.55%). The predicted peak occurs 1 week later than the real peak (6th week instead of 5th week).
There is another peak to the end of the year, such as that of the 6th week. The value of the baseline before this peak arrives is 500 ILI cases and occurs at the end of the year starting 5 weeks earlier (46th week).
The predicted baseline is 400 ILI cases, above which it takes 4 weeks for the peak to arrive.
The summary of the results of ARIMA (1, 0, 0) model for both Greece and Italy cases is shown in
Autoregressive integrated moving average (ARIMA) model for Italy (year 2011).
Autoregressive integrated moving average (ARIMA) model. Prediction for Italy (year 2012).
Summary of the results.
Country | Year | Predicted peak | Real peak | Difference | Difference (%) | Weeks to reach the peak |
Greece | 2011 | 73.35 | 76.98 | −3.63 | −4.71 | 4 |
Greece | 2012 | 76.61 | 73.26 | 3.35 | 4.58 | 4 |
Italy | 2011 | 1013.05 | 1102.1 | −89.05 | −8.08 | 4 |
Italy | 2012 | 1037.461 | 947 | 90.461 | 9.55 | 4 |
Italy | 2012 | 1037.461 | 5 |
Autoregression and moving averages models may be used to correlate data and build prediction models. The methods described above were developed to use the data from Google searches found in the Google Trends system with the help of ARIMA models. The first method is used when searches for the term
The early detection of a future influenza pandemic activity is a key issue for all public health authorities [
Similar researches, mentioned in the Introduction, were conducted by scientists who used Internet data to make predictions and estimations for infectious diseases. Different models were used to detect and predict the outbreak of seasonal diseases. The results of other researches were focused to various countries such as the United States, Sweden, the United Kingdom, or to the countries of Asia. Our research is the first that examines a serious infectious disease such as influenza in small countries such as Greece and Italy. We consider the ARIMA model, already used by other scientists, very effective, and we made use of it to make estimations and predictions for the spread and the peak of seasonal influenza in Greece and Italy.
The main restrictions should be as follows:
To perform analysis based on Google searches requires Google data to exist. This can be done when people can do searches on the Internet and, of course, it also requires a general extend of Internet penetration and use in the specific country. Although nowadays, Internet use has continuously risen; it is of great importance that the Internet speeds should be fast enough, and people are familiarized to the Google services.
Another aspect is the language used. The keyword
The popularity and publicity regarding infectious diseases. The influenza disease can be safely used, as it is a very common disease among many countries of the world. Nevertheless, if there is a need for examination and study for another disease with less popularity, the first method will be possibly the only solution, when a researcher wishes to analyze data from Google Trends, specifically in smaller countries.
Despite the above restrictions, it is certain, as other similar studies have shown that the Google Trends system can be safely used. In general, an Internet surveillance system can be an alternative system to the official sentinel systems for monitoring and evaluating the development of infectious diseases.
There is a lot of discussion about the usability of Google Flu Trends, a service which was provided by Google. It has been found [
The outlook of testing different systems and generally the use of Internet surveillance systems is very important. This does not mean that monitoring systems based on Internet surveillance should totally substitute the traditional systems, but they can be certainly used on a supplementary basis.
Besides the above remark, a monitoring system based on Internet data may take advantage of the same definitions, methods, and indicators created and proposed by international and national organizations. Consequently, this means that this kind of system can have a great contribution to coordinate the different national monitoring systems.
The official definitions of diseases and the proposed specific indicators are made to coordinate the national systems. In Europe, ECDC monitors the levels of influenza activity in European countries reported by EISN members during the influenza season. The levels are based on the following three assessments or indicators [
An indicator of the overall intensity of influenza activity in the country
An indicator of the geographical spread of influenza in the country
An indicator of trend in ILI or acute respiratory infection (ARI) sentinel consultations in the country compared with the previous week
The main three indicators concern the overall intensity of influenza activity, the geographical spread of influenza, and the trend of the disease. These indicators can be described as follows:
The intensity of influenza activity is based on the overall level of clinical influenza activity in the country (or region). Each country assesses the intensity of clinical activity based on the historical data at its disposal. Some countries have historical data that date back over 30 years (eg, the United Kingdom [England] and the Netherlands), whereas others have data that date back over shorter periods of time (eg, Ireland). Some networks can establish numeric thresholds that define the different intensity levels of clinical influenza activity.
The EISN intensity definitions are denoted as low, medium, high, very high, and unknown.
The baseline influenza activity is the level that clinical influenza activity remains in throughout the summer and most of the winter. Usually, there will be a 6- to 12-week period in winter when the level of clinical influenza activity rises above the baseline threshold, but in the very occasional winter, activity never gets above the baseline level.
Each country defines the geographical spread of influenza according to the definitions outlined below. The definitions are based on those used by the WHO global influenza surveillance system—FluNet [
ILI: influenza-like illness
ARI: acute respiratory infection
Country: countries may be made up of one or more regions
Region: the population under surveillance in a defined geographical subdivision of a country. A region should not (generally) have a population of less than 5 million unless the country is large with geographically distinct regions
The geographical spread is indicated through as no-activity, sporadic, local outbreak, regional activity, and widespread activity.
Trend is reported by the countries as increasing, stable, or decreasing. Trend is a comparison of the level of ILI or ARI sentinel consultations during 1 week with the previous week.
Outside the influenza season, when ILI and ARI rates are at baseline level, increasing or decreasing trends are not informative.
Increasing: evidence that the level of respiratory disease activity is increasing compared with the previous week.
Stable: evidence that the level of respiratory disease activity is unchanged compared with the previous week.
Decreasing: evidence that the level of respiratory disease activity is decreasing compared with the previous week.
The usability of the aforementioned definitions and indicators indicate that an Internet surveillance system may be a useful tool to manage a coordination of the different national systems that are currently used.
In terms of government spending, we mentioned in the introduction the huge costs connected to influenza through absenteeism, influenza complications, and hospital stays and deaths. We believe that early detection could provide useful means and tools for preventing purposes to reduce the overall spending but mostly to address public health issues concerning influenza tracking, monitoring, and treatment. Many studies in various universities and research centers have been conducted to indicate and propose the extensive use of the Internet to meet the requirements for a successful monitoring of epidemics and for creating an Internet surveillance system in an inexpensive way.
Finally, the main conclusions of this study can be summarized as follows:
There is a significant statistical correlation with influenza ILI rates of Greece and Italy and the searches made in Google search engine.
We can use the ARIMA statistical model for estimations and to create prediction rules and patterns for influenza in Greece and Italy based on searches made in Google search.
By using Google Trends, we can predict the maximum point of influenza 4 weeks before it arrives.
Google Trends can be a useful source of data. In cases of insufficient data or with low correlation of Google searches to the real cases for a single word (influenza) for a specific location (country) and for a certain period (year), a combined flu score can be created based on Google searches made by people with keywords related to the symptoms of the disease. When sufficient and reliable data volume of a keyword exists, we can still use ARIMA models for forecasts.
An Internet surveillance system can be an alternative, as it can operate as a supplementary system, and it can use the same official definitions and indicators of the traditional systems to help coordinating national monitoring systems across Europe.
On the basis of Google search data, an Internet system can contribute to lowering costs by helping governments to prevent severe influenza outbreaks and manage their operational public health plans.
The term
Programming codes.
autoregression
acute respiratory infection
autoregressive integrated moving average
autoregressive moving average
Centers for Disease Control and Prevention
emergency department
European Influenza Surveillance Network
European Union
influenza-like illness
Hellenic Center for Disease Control and Prevention
moving averages
classification of territorial units for statistics
World Health Organization
We acknowledge KEELPNO for the provision of useful data of Influenza epidemics. We also acknowledge Dr Kassiani Golfinopoulou from the competent Infectious Disease Department of KEEPLNO for the general help on interpreting the concepts and terms and Mrs Agoritsa Bakka and Dr Christos Hadjichristodoulou for supervising the entire manuscript.
LS made the experiments and the programming code, MA examined and edited the codes, and EG edited the text. The conception and plan of the research conducted is part of LS’ ongoing PhD work and was done under the supervision of MA and EG.
None declared.