This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
With the internet’s penetration and use constantly expanding, this vast amount of information can be employed in order to better assess issues in the US health care system. Google Trends, a popular tool in big data analytics, has been widely used in the past to examine interest in various medical and health-related topics and has shown great potential in forecastings, predictions, and nowcastings. As empirical relationships between online queries and human behavior have been shown to exist, a new opportunity to explore the behavior toward asthma—a common respiratory disease—is present.
This study aimed at forecasting the online behavior toward asthma and examined the correlations between queries and reported cases in order to explore the possibility of nowcasting asthma prevalence in the United States using online search traffic data.
Applying Holt-Winters exponential smoothing to Google Trends time series from 2004 to 2015 for the term “asthma,” forecasts for online queries at state and national levels are estimated from 2016 to 2020 and validated against available Google query data from January 2016 to June 2017. Correlations among yearly Google queries and between Google queries and reported asthma cases are examined.
Our analysis shows that search queries exhibit seasonality within each year and the relationships between each 2 years’ queries are statistically significant (
Online behavior toward asthma can be accurately predicted, and significant correlations between online queries and reported cases exist. This method of forecasting Google queries can be used by health care officials to nowcast asthma prevalence by city, state, or nationally, subject to future availability of daily, weekly, or monthly data on reported cases. This method could therefore be used for improved monitoring and assessment of the needs surrounding the current population of patients with asthma.
Health informatics is the field where information technology, computer science, social sciences, and health care meet [
According to Gu et al [
Over the last few years during the integration of the health pillar in smart cities, where big data is being continuously gathered and analyzed [
What has been of notable popularity in big data analytics is the analysis of online search queries [
In this study, online queries for the term “asthma” in the United States were analyzed in order to explore the possibility of nowcasting (ie, forecasting the present) asthma prevalence using Google Trends. Asthma was selected because it is a common chronic respiratory disease characterized by exacerbations, also known as asthma attacks; therefore, the reported cases are bound to show seasonality as well as constant interest.
Asthma is a chronic condition characterized by airway inflammation and hyper-responsiveness that causes airways to constrict in response to exercise, infection, exposure to allergens, and occupational exposures [
Asthma presents with coughing, wheezing, and chest tightness that seem to be worse during the night and early mornings. These symptoms, along with a family history of asthma or atopic dermatitis, can prompt investigations to confirm an asthma diagnosis. Exacerbation of normal asthma symptoms is more common in patients with uncontrolled asthma or in high-risk patients [
The management of asthma usually involves the use of several inhalers, leading to a rather complicated treatment regime that presents difficulties in terms of patient compliance because it interferes with their daily living activities. Poor compliance can lead to increased morbidity as well as increased cost of treatment [
Asthma has several social complications such as limiting patients’ activity levels [
Google Trends data have been previously shown to be valid by many studies [
Monthly time series from Google Trends for the keyword “asthma” from 2004 to 2015 in the United States and by individual state were used. The data were normalized by Google and downloaded in .csv format on July 7, 2017, between 12:47 and 13:02 for the United States and on July 18 between 14:03 and 14:33 for each of the 50 states and the District of Columbia. The data adjustment procedure is reported by Google as follows [
The seasonality of asthma queries was explored followed by the estimation of the forecasts for the online interest in the term from 2016 through 2020 for the country as well as for each state. The additive method for the Holt-Winters exponential smoothing (using the statistical programming language R) is employed. The Holt-Winters equations [
In order to further elaborate on the seasonality, the Pearson correlations for Google Trends data for the term “asthma” between each 2 years from 2004 to 2015 in the United States were calculated. Finally, the Pearson correlations between Google queries and the National Health Interview Survey (NHIS) prevalence data [
Asthma is not included in the list of diseases with a Centers for Disease Control and Prevention (CDC) surveillance case definition, defined as “a set of uniform criteria used to define a disease for public health surveillance. Surveillance case definitions enable public health officials to classify and count cases consistently across reporting jurisdictions. They provide uniform criteria of national notifiable infectious and non-infectious conditions for reporting purposes” [
In 2011, the BRFSS changed its weighting methodology in addition to also including mobile phone respondents. Therefore, any comparisons between years before and after 2011 should be carefully interpreted. In this study, no such comparisons are made, as each year’s online queries are compared with the respective year’s asthma reported cases, thus including no cross-year comparisons. For this study, we used the CDC definition of asthma prevalence, based on affirmative responses to the following NHIS questions: (adults) “Have you ever been told by a doctor or other health professional that you had asthma?” and “Do you still have asthma?” and (children) “Has a doctor or other professional ever told you that [sample child] had asthma?” and “Does [sample child] still have asthma?” [
Equations for Holt-Winters exponential smoothing, where
Out of the 50 states and District of Columbia, 29 fall into the 81 to 100 group, 21 in the 61 to 80 group, only 1 (Oregon) in the 41 to 60 group, and none in the 21 to 40 and 0 to 20 groups. This classification indicates that the examined term is of high interest to the population of the United States. The detailed data for
There has been a significant increase in searches for the term “asthma” in the states from 2004 to 2015, with the lowest count of states in the 81 to 100 group being in 2007 and the highest in 2012. The top asthma-related queries in the United States from January 2004 to December 2015 include “allergy asthma” (100), “asthma symptoms” (45), “asthma attack” (35), “what is asthma” (25), “asthma inhaler” (20), “asthma children” (15), “exercise asthma” (15), “asthma medications” (10), and “allergy and asthma center” (10).
As is evident, online behavioral changes toward the term “asthma” depict behavior toward said disease. The next steps are to examine if forecasting online interest in the United States is possible and identify existing relationships between online search traffic data and reported asthma cases.
Online interest by state in the term "asthma" from 2004 to 2015.
Monthly changes in online interest in the term "asthma" from 2004 to 2015.
Weekly changes in online interest in the term "asthma" for each year from 2004 to 2015.
Online interest by state in the term "asthma" per year from 2004 to 2015.
The smoothing parameters for the additive Holt-Winters exponential smoothing with trend and additive seasonal component are α=.33, β*=0, and γ=.65. The estimated values for the coefficients for the level, trend and season are as follows: a=69.54, b=–.07, s1=–.94, s2=1.44, s3=3.37, s4=7.84, s5=2.51, s6=–5.68, s7=–8.51, s8=–7.20, s9=1.89, s10=4.67, s11=1.11, and s12=–3.53.
In order to elaborate on the robustness of the forecasting model, the estimated values are validated against the available Google queries for the term “asthma” from January 2016 to June 2017, as is shown in
It is therefore suggested that the online behavior exhibits seasonality and can be predicted. The last step in exploring if nowcasting of asthma prevalence in the United States is possible using Google Trends is to examine the correlations between Google Trends data and reported lifetime and current asthma.
As shown in
Google Trends (2004 to 2015) versus forecasts (2005 to 2020) in the United States.
Google Trends (2004 to 2015) versus forecasts (January 2016 to June 2017) in the United States.
Pearson correlations between each 2 years’ normalized Google asthma queries in the United States from 2004 to 2015.
2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | |
2005 | .89 | — | — | — | — | — | — | — | — | — | — |
2006 | .86 | .89 | — | — | — | — | — | — | — | — | — |
2007 | .77 | .85 | .77 | — | — | — | — | — | — | — | — |
2008 | .94 | .93 | .81 | .78 | — | — | — | — | — | — | — |
2009 | .79 | .76 | .64 | .89 | .80 | — | — | — | — | — | — |
2010 | .88 | .94 | .87 | .82 | .92 | .81 | — | — | — | — | — |
2011 | .94 | .93 | .85 | .87 | .93 | .91 | .93 | — | — | — | — |
2012 | .88 | .90 | .85 | .81 | .90 | .82 | .98 | .91 | — | — | — |
2013 | .84 | .87 | .72 | .89 | .90 | .93 | .89 | .92 | .90 | — | — |
2014 | .75 | .82 | .68 | .77 | .87 | .78 | .82 | .83 | .86 | .92 | — |
2015 | .86 | .85 | .69 | .86 | .92 | .93 | .88 | .92 | .90 | .98 | .93 |
Total lifetime and current asthma National Health Interview Survey (2004 to 2015) and Behavioral Risk Factor Surveillance System (2004 to 2014) prevalence data.
Year | NHISa | BRFSSb | ||||
Lifetime asthma | Current asthma | Asthma hitsc | Lifetime asthma | Current asthma | Asthma hitsc | |
2004 | 30,189 | 20,545 | 81.41 | 33,084,183 | 20,422,385 | 83.17 |
2005 | 32,621 | 22,227 | 79.58 | 30,661,476 | 19,453,974 | 80.33 |
2006 | 34,132 | 22,876 | 72.58 | 35,107,599 | 22,853,570 | 73.92 |
2007 | 34,008 | 22,879 | 65.66 | 36,832,798 | 23,556,048 | 68.17 |
2008 | 38,450 | 23,333 | 65.00 | 38,050,505 | 24,521,005 | 66.92 |
2009 | 39,930 | 24,567 | 65.83 | 38,033,371 | 24,051,245 | 67.92 |
2010 | 39,191 | 25,710 | 61.41 | 39,005,338 | 25,069,373 | 62.83 |
2011 | 39,504 | 25,943 | 64.58 | 34,759,106 | 22,605,961 | 66.42 |
2012 | 39,982 | 25,553 | 65.91 | 39,085,744 | 25,954,771 | 67.67 |
2013 | 37,328 | 22,648 | 65.25 | 41,030,777 | 26,227,484 | 67.00 |
2014 | 40,461 | 24,009 | 66.58 | 40,706,401 | 26,957,918 | 68.75 |
2015 | 40,153 | 24,633 | 68.16 | — | — | — |
aNHIS: National Health Interview Survey.
bBRFSS: Behavioral Risk Factor Surveillance System.
cValues slightly vary due to the different time frame: 2004 to 2015 for NHIS and 2004 to 2014 for BRFSS.
To further explore the relationships between online searches and asthma prevalence in the United States, data on the yearly cases of lifetime and current asthma for all ages from the NHIS prevalence data from 2004 to 2015 [
The Pearson correlations of the annual NHIS prevalence data with the annual averages of the normalized Google Trends data from 2004 to 2015 show high correlations between lifetime asthma (
Although statistically significant, all Pearson correlations are negative, and lag analysis should be employed to identify the time interval of response between asthma online interest and case reporting or vice versa. Although Google Trends data for the term “asthma” in the United States over the examined period are monthly, the data on lifetime and current asthma are yearly; until weekly or monthly data are available, further analysis cannot by done.
In order to show that the method of nowcasting asthma prevalence in the United States using Google queries is possible, this methodology is applied in each of the 50 states and the District of Columbia and exhibits good forecasting results.
According to the results, online interest in Alaska, Nebraska, New Hampshire, Oklahoma, and Tennessee exhibits increasing forecast trends from 2016 to 2020. On the contrary, online interest in Delaware, Kansas, Oregon, and Virginia exhibits decreasing forecast trends from 2016 to 2020. Overall, the states of Arizona, California, Connecticut, Florida, Georgia, Illinois, Indiana, Maryland, Michigan, Missouri, New Jersey, New York, North Carolina, Pennsylvania, Texas, and Washington show high interest in the term “asthma” throughout the examined period, while in Hawaii and Wyoming, interest is low. Virginia is the only state where online interest exhibits very significant variations from 2004 to 2016.
Our study indicates that analysis of online behavior toward asthma by state can assist with nowcasting asthma prevalence. Since search queries and reporting of asthma are shown to correlate in the United States, if short-interval data (eg, weekly or monthly) were made available, a robust nowcasting model could be developed.
Google Trends (2004 to 2015) versus forecasts (2005 to 2020) in California.
Google Trends (2004 to 2015) versus forecasts (2005 to 2020) in Texas.
Google Trends (2004 to 2015) versus forecasts (2005 to 2020) in Florida.
Google Trends (2004 to 2015) versus forecasts (2005 to 2020) in New York.
In addressing integration of smart health into smart city management, monitoring of search traffic data could be useful in predictions and nowcastings, as has also been suggested by previous work on the subject. This study shows that online interest can be predicted nationally and by state. Therefore, governments, policy makers, and health care officials have the ability to use these data to better address the responsiveness of the US health care system at national, regional, state, or even city level in order to nowcast asthma prevalence. Google Trends also provides detailed regional US data, and this method can be applied in other countries as well.
Empirical relationships between Google Trends and human behavior have been suggested, therefore nowcasting asthma prevalence in the United States is possible using online search traffic data, subject to availability of daily, weekly, or monthly data. In this study, it was shown that online search traffic data are highly correlated between each 2 years during the examined period and that Google Trends data are correlated with reported cases of lifetime and current asthma in the United States from 2004 to 2015.
After analyzing changes in online interest in the United States over the examined period, the next step was to identify any seasonal similarities between each 2 years’ (monthly) search queries. As the hits between each 2 years from 2004 to 2015 on the term “asthma” were highly correlated, the seasonal effect was evident; using Holt-Winters exponential smoothing, 5-year forecasts for online interest in the term from 2016 to 2020 nationally and in each state were estimated. Validated against available data from January 2016 to June 2017, the forecasts were well fitted and accurately approximated the actual Google Trends data for the same period, suggesting seasonal behavioral changes over the course of a year can be accurately predicted using the proposed method. Google Trends data are correlated with reported cases of lifetime and current asthma, and thus nowcasting asthma prevalence in the United States is suggested to be possible using online search traffic data. As the calculated correlations are negative at this point and there is a lag between internet queries and asthma reporting and vice versa, short-interval data (eg, monthly, weekly, and daily—not available at this point) are required in order to identify said lag.
This study has limitations. It cannot be assumed that each hit corresponds to an asthma case and vice versa because hits could be also attributed to academic or research reasons or general interest on the subject, and they could be influenced by news reports or social media. Queries related to asthma could be also influenced by factors such as changes of health insurance and weather or environmental conditions that trigger similar symptoms. This is a general limitation when examining online queries, despite the empirical relationships that have been shown to exist between Google Trends and health data.
The sample is not representative, although as internet penetration increases, so does the possibility of higher volumes of online queries being related to asthma cases. Additionally, nowcasting asthma prevalence using online search queries is not possible at this point because the available data on reported lifetime and current asthma are yearly. If monthly, weekly, or daily data on past asthma prevalence were available and the correlations between search traffic data and reported asthma are validated, the possibility of nowcasting asthma could be further explored.
This study has not accounted for state-by-state confounders that could influence search patterns, such as the socioeconomic status and demographics of different states that might be relevant to asthma prevalence, as this exceeds the scope of this paper. The latter, along with the impact of socioeconomic and cultural differences on asthma reporting and online search patterns, are of interest for further investigation. In addition, more search terms related to asthma symptoms such as “breathlessness” and “wheezing” could be included in future research on asthma monitoring in the United States.
The findings of this study support previous work on the subject and highlight the value of online data in health and medical informatics. Google Trends data have been shown to be useful and valuable in the monitoring, surveillance, or prediction of epidemics and outbreaks [
Internet behavior can be measured by infodemiology metrics as information patterns and population health are related [
State data tables.
Google Trends (2004 to 2015) versus forecasts (2005 to 2020) by state.
Behavioral Risk Factor Surveillance System
Centers for Disease Control and Prevention
National Health Interview Survey
None declared.