The Causality Inference of Public Interest in Restaurants and Bars on COVID-19 Daily Cases in the US: A Google Trends Analysis

The COVID-19 coronavirus pandemic has affected virtually every region of the globe. At the time of conducting this study, the number of daily cases in the United States is more than any other country, and the trend is increasing in most of its states. Google trends provide public interest in various topics during different periods. Analyzing these trends using data mining methods might provide useful insights and observations regarding the COVID-19 outbreak. The objective of this study was to consider the predictive ability of different search terms (i.e., bars and restaurants) with regards to the increase of daily cases in the US. We considered the causation of two different search query trends, namely restaurant and bars, on daily positive cases in top-10 states/territories of the United States with the highest and lowest daily new positive cases. In addition, to measure the linear relation of different trends, we used Pearson correlation. Our results showed for states/territories with higher numbers of daily cases, the historical trends in search queries related to bars and restaurants, which mainly happened after re-opening, significantly affect the daily new cases, on average. California, for example, had most searches for restaurants on June 7th, 2020, which affected the number of new cases within two weeks after the peak with the P-value of .004 for Granger's causality test. Although a limited number of search queries were considered, Google search trends for restaurants and bars showed a significant effect on daily new cases for regions with higher numbers of daily new cases in the United States. We showed that such influential search trends could be used as additional information for prediction tasks in new cases of each region. This prediction can help healthcare leaders manage and control the impact of COVID-19 outbreaks on society and be prepared for the outcomes.

Due to the rapid spreading of this virus, finding effective reasons can play a significant role in prevention policies. Using data mining and time series analysis methods, it is possible to investigate the impact of different phenomena on time series data. In economics, as an example, there are different studies that model the temporal relationship of two or more time series (e.g., the relationship between oil and gold price) using the same methods [4].
Google search trends can be useful for reflecting public interests/concerns during different periods [5]. During the COVID-19 outbreak, different studies have investigated the correlation of web-based data and cases of this virus. Kutlu et al. [6] investigated the correlation of dermatological diseases obtained by specific Google search trends with the COVID-19 outbreak. In addition, Google trends have been utilized to predict and monitor COVID-19 cases around the world [5], [7], [8], [9], [10], [11]. Multiple studies analyzed the data related to the US to correlate the search trends and COVID-19 cases [12], [13], [14], [15], [16], [17]. However, these studies did not consider the predictive ability of search trends on future confirmed cases.
In this paper, we considered the causality effect and predictive ability of search terms related to bars and restaurants on the daily new cases of the US in different regions. Along with the linear correlation analysis between search trends and COVID-19 cases, we have utilized the statistical causality methods to investigate the influential confidence of these methods on COVID-19 daily new cases.

A. Datasets
For our analysis, we obtained the daily cases of COVID-19 in the US using the COVID tracker project [18]. This project compiles the daily statistics, including the number of positive/negative tests, hospitalization, available ventilators, and the number of deaths from each US state and territory. For this study, we considered the data of approximately three months starting April 09, 2020, to July 07, 2020, which contains 5040 samples for 56 states/territories. We used Google Trends to obtain the public interest in bars and restaurant categories with daily resolution. We used the most popular query for each category from April 09, 2020, to July 07, 2020, for 45 available regions in Google trends API. For restaurants and bars, we chose "dine-in restaurants that are open near me" and "bars near me", respectively. Google trend does not provide the number of queries per day. Instead, it provides a normalized number between 0 and 100, where 0 refers to "low volume of data for the query" while 100 refers to the "highest popularity for the term" [19]. To be consistent with Google trend values, we normalized the US daily new cases between 0 and 100 in our analysis.
Aggregating data from Google trends results and COVID-19 daily cases, and removing missing values, resulted in available data for 45 regions in the US. We categorized our analysis to two different groups: First, top-10 states/territories with the highest number of daily new cases as of July 7th, 2020 which consist of Texas (

B. Correlation and Causation
To analyze the linear correlation of two time-series, the Pearson correlation has been utilized. The value of such a correlation ranges from -1 to 1, which shows a negative and positive correlation, respectively. Our analysis measured the Pearson correlation between the trends of search queries (i.e., restaurants and bars) and the daily new cases of COVID-19 in each state.
In addition, we used Granger's causality [20] to model the influence of a time series' past values on the new values of another time series. Granger's causality tests whether the past values of a time series X cause the current values of another time series Y. Hence, in this study, the null hypothesis is that X's past values do not affect Y's current values. If the P-value is less than the marginal value (.05), we can reject the null hypothesis. In our analysis, we reported P-values for each aforementioned search query's influence on the daily new cases. One of the main assumptions of modeling the influence of time series on each other is their stationarity. To test such a characteristic, we used the Augmented Dickey-Fuller (ADF) test [21] as our unit root test. This test determines the effect of a trend in the creation of the time series. In other words, it determines how strongly a trend defines a time series. The alternative hypothesis in the ADF test is the stationarity of the time series.
In this study, since the time series were not stationary, we applied first differencing on search trends and second differencing on daily new cases to make all of the three series stationary. For statistical analysis, we used the Python Statsmodel package [22].

C. Vector Autoregression
In our study, we leveraged the fact that search trends might impact the daily new cases in the future; hence a Vector Autoregression (VAR) [23] model for each region was fitted to the data. A VAR model takes into account the influence of the past values of time series X and Y on current values of time series Y with a given lag order. Lag order with the lowest Akaike's Information Criterion (AIC) was picked in this study. Since symptoms may appear within 2-14 days after exposure to the COVID-19 virus [24], a maximum of 14 lags was used. The equation for the VAR model with two lags is summarized below: In this model, Y t represents the value of time series Y at time t, which consists of a combination of previous lag values from Y and X with different weights β, β and random white noise, t . We fitted a VAR model to perform Granger's causality test.

D. Long Short-Term Memory
Long Short-Term Memory (LSTM) [25] models are a type of recurrent neural network useful for time series prediction. These models capture the long term effect of time series as well as their most recent values. In this study, we utilized LSTMs to predict the daily new cases using two sets of features: 1) the historical values of the new cases time series and 2) using additional information from searching query time series. We used 70% of the data for training, and the rest were used for evaluation of the model. Root mean square error (RMSE) was selected as the performance metric. RMSE can be calculated as follows: In this equation, N is the number of samples,Ŷ is the predicted value, and Y is the actual value of the time series. The architecture of the used model is illustrated in Fig. 1. It consists of 3 LSTM layers along with dropout layers, and a fully connected layer at the end. Dropout layers were utilized to avoid overfitting, which is a typical problem in Machine Learning tasks. To train such a model, we used the TensorFlow package of Python.

III. RESULTS
Observation. Search trends and the daily new cases can be different in each state/territory. Hence, the significance level of influence of search-related time series on the current values of daily new cases is different in each region. For the sake of comparison, Fig. 2 illustrates the moving average trend of "bar and restaurant" searches as well as the daily new cases in California (CA) and Delaware (DE). Granger's Causality Test. Due to the inclusion of a large number of states/territories in our analysis (45), we picked top-10 regions in the US with the highest and lowest daily new cases as of July 7th, 2020. Table I summarizes the P-values for testing the null hypothesis (the coefficients corresponding to past values of the second time series are zero) for the first group. P-values below .05 represents the rejection of the null hypothesis, which shows the effect of searching queries on daily new cases for each region.
Based on Table I, California has small P-values, which shows the influence of the aforementioned search queries on daily new cases; hence, they can be used to predict daily new cases. Florida and North Carolina are two examples of states that the effect of restaurants is rejected with the Granger's Causality Test; however, Louisiana clearly is affected by restaurant searches. Fig. 3 illustrates the moving average of daily new cases and restaurant search trends for these three states. Based on Fig. 3, the high P-value for Florida is because of the first peak in the restaurant search, which did not change the daily new cases trends. North Carolina has an overall increasing trend, causing the effect of the search to be marginal. However, Louisiana is influenced by the sudden changes in restaurant search trends, which clearly have affected the daily new cases.
Similarly, Table II summarizes the P-values for Granger's causality test for the second group (i.e., top-10 regions with the lowest daily new cases).
Comparison between Tables I and II shows that regions with higher daily cases are more affected by restaurant and bar searches on average.  Pearson Correlation To show the linear relationship of time series, the Pearson correlation is utilized. Tables III and IV summarize these correlations with corresponding P-values for each group. Based on these two tables, the linear correlation between the search trends related to bars/restaurants and daily new cases in regions with a higher number of daily cases is more substantial, on average, compared to regions with lower daily cases.
New Cases Prediction. We used LSTM models to predict the value of daily new cases for a given region. We utilized the search trend time series as additional information to adjust the predicted values. The RMSE scores for test data for top-10 highest and lowest daily new cases are summarized in Tables  V and VI. These tables show the results for: 1) the baseline model which uses only the past values of new cases time series for the prediction, 2) the model that uses the past values of restaurant searches along with the past values of new cases time series, 3) and, finally, the model that combines the information from the daily cases and bar searches time series. Table V shows that regions with a significant causality effect, the RMSE improves on average. CA is an example of such an improvement. Table VI shows for some states, although there is no causality effect for the restaurants, the value of RMSE improves. On the other hand, for states like Montana (MT), which Granger's Causality Test shows a significant effect, the RMSE has been increased. By investigating the time series for these two states (Figs. 4 and 5), we can interpret such inconsistencies for two reasons. First, for states such as Kansas (KS), the improved value is because of the fluctuation in the new cases time series, making the prediction unreliable. Second, as Figs. 4 and 5 show, the impulses in restaurant searches for KS and MT are point impulses. These unit jumps cannot improve the prediction of the time series, although they appear in causality tests.

A. Principal Results
To the best of our knowledge, this study is the first analysis that considers the predictive ability of Google search trends, namely restaurants and bars, on daily new cases of COVID-19 in the US. This study uses statistical methods to validate such an effect on daily new cases.
Granger's causality test shows that in some regions, the effect of restaurants on daily new cases is significant. California is an example of such states. On May 18th, the governor of California announced the easing of criteria for counties to reopen faster than the state, and on May 25th he announced plans for the re-opening of in-store shopping [26]. Consequently, there was an increase in restaurant searches, and the peak of the searches happened on June 7th. The daily new cases drastically increased within two weeks of the escalation in dine-in restaurant searches.
Similarly, such a trend in bar searches happened in California (Fig. 2). Regardless of the seasonal effect of time series, which shows a higher number of searches for bars during weekends, the average trend in bar searches increased. However, North Carolina is not influenced by restaurant searches (Table I). The reason is that this state has an increasing average trend regardless of the other time series (Fig. 3). Therefore the P-value for Granger's causality is high (.53).
This study suggests that the effect of restaurant and bar searches is higher in the regions with higher daily new cases compared to the regions that have a lower number of positive cases reporting every day. On average, in the regions with a higher number of daily new cases, more significant Granger's casualties and higher values of Pearson correlation support this fact.
We used artificial intelligence models to improve the prediction results of new cases using additional information, namely Google trends. These Google trends for restaurants and bars can be useful depending on the time series structure. Prediction in time series uses the information of previous values (lags) to estimate the current values.

B. Limitations
There are several limitations to this study. We only used the most popular search queries suggested by Google for each  category. People use different search terms to find the information they are looking for. Moreover, we only considered the effect of restaurants and bars on daily cases. The other limitation of our study can be the limited number of samples for each region (88 samples on average). This limitation affects the prediction results to a certain degree.

V. CONCLUSION
In conclusion, we investigated the causality effect of search queries related to restaurants and bars on daily new cases in the US regions with high and low daily cases. We showed that for most of the regions with a high number of daily new cases, the effect of search queries on bars and restaurants is higher; hence, they can be used as additional information for prediction tasks.