This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
The SARS-COV-2 virus and its variants pose extraordinary challenges for public health worldwide. Timely and accurate forecasting of the COVID-19 epidemic is key to sustaining interventions and policies and efficient resource allocation. Internet-based data sources have shown great potential to supplement traditional infectious disease surveillance, and the combination of different Internet-based data sources has shown greater power to enhance epidemic forecasting accuracy than using a single Internet-based data source. However, existing methods incorporating multiple Internet-based data sources only used real-time data from these sources as exogenous inputs but did not take all the historical data into account. Moreover, the predictive power of different Internet-based data sources in providing early warning for COVID-19 outbreaks has not been fully explored.
The main aim of our study is to explore whether combining real-time and historical data from multiple Internet-based sources could improve the COVID-19 forecasting accuracy over the existing baseline models. A secondary aim is to explore the COVID-19 forecasting timeliness based on different Internet-based data sources.
We first used core terms and symptom-related keyword-based methods to extract COVID-19–related Internet-based data from December 21, 2019, to February 29, 2020. The Internet-based data we explored included 90,493,912 online news articles, 37,401,900 microblogs, and all the Baidu search query data during that period. We then proposed an autoregressive model with exogenous inputs, incorporating real-time and historical data from multiple Internet-based sources. Our proposed model was compared with baseline models, and all the models were tested during the first wave of COVID-19 epidemics in Hubei province and the rest of mainland China separately. We also used lagged Pearson correlations for COVID-19 forecasting timeliness analysis.
Our proposed model achieved the highest accuracy in all 5 accuracy measures, compared with all the baseline models of both Hubei province and the rest of mainland China. In mainland China, except for Hubei, the COVID-19 epidemic forecasting accuracy differences between our proposed model (model i) and all the other baseline models were statistically significant (model 1, t198=–8.722,
Our approach incorporating real-time and historical data from multiple Internet-based sources could improve forecasting accuracy for epidemics of COVID-19 and its variants, which may help improve public health agencies' interventions and resource allocation in mitigating and controlling new waves of COVID-19 or other relevant epidemics.
COVID-19 poses extraordinary challenges for public health systems worldwide. As of November 26, 2021, COVID-19 had affected 222 countries and territories [
Internet-based data sources, such as social media data (like microblogs), online news article data, and search query data, accumulate huge amounts of data all the time and have been proven to be an effective supplement to traditional infectious disease surveillance systems [
As COVID-19 has been and continues to be the most consequential infectious disease worldwide in this century, many researchers have used various Internet-based data sources to supplement COVID-19 surveillance [
As for improving COVID-19 forecasting timeliness, Yuan et al [
Our study explored whether combining real-time and historical data from multiple Internet-based sources could improve COVID-19 forecasting accuracy over the existing baseline models. We also compared COVID-19 forecasting timelines based on different Internet-based data sources.
We focused on the first wave of the COVID-19 epidemic in mainland China and compiled data on daily new confirmed COVID-19 case counts, online news articles, microblogs, and search queries from various sources. Following a previous study [
Daily new confirmed COVID-19 case counts were collected from the Chinese Center for Disease Control and Prevention (China CDC) website [
We first described the Internet-based data we retrieved and the COVID-19–related data we extracted. We then summarized all the COVID-19 forecasting-related data in 1 figure, including the fraction of online news articles and microblogs, search query counts, and lab-confirmed new case counts in mainland China, except Hubei, and Hubei province. All the data were normalized into an interval of 0 to 100 for better comparison. The figures aimed to show the Internet-based data sources’ potential to provide warnings for COVID-19 epidemics.
We also conducted lagged Pearson correlation analyses to evaluate the strength of relationships between different Internet-based data sources and daily new confirmed COVID-19 case counts. The max time lag explored was 20 days [
Following previous infectious disease surveillance research [
We proposed our autoregressive model with exogenous inputs, denoted as
Incorporating the real-time and historical data from online news articles, microblogs, and search query volume:
Where
We considered 5 baseline models, including (1) AR(
Retrospective estimations of the daily proportion of confirmed COVID-19 counts were produced through the proposed model and baseline models. The estimation period was from January 19, 2020, to February 29, 2020, for mainland China, except for Hubei. For Hubei province, even though the official laboratory-confirmed COVID-19 cases can be retrieved since January 10, 2020, there was a severe lack of laboratory testing capacity at the beginning of this unexpected epidemic. Specifically, there were thousands of COVID-19–suspected cases that could not be confirmed due to the lack of testing capacity before January 27, 2020, and the daily test capacity in Hubei had to be extended 10 times on January 27, 2020 to address this issue [
We used the variance inflation factor (VIF) to measure multicollinearity in the independent variables. A VIF over 4 indicates a moderate level of multicollinearity, and a VIF exceeding 10 shows severe multicollinearity [
Overall, we extracted 608,335 (out of 75,431,068) and 123,955 (out of 15,062,844) COVID-19–related online news articles for mainland China, except Hubei, and Hubei province separately, respectively. Unofficial online news articles accounted for about 92.8% (83,966,946/90,493,912) of all the news articles traced. We also identified 476,932 (out of 32,475,162) and 191,296 (out of 4,926,738) COVID-19–related microblogs posted in mainland China, except Hubei, and Hubei province, respectively. For the COVID-19–related search queries, we retrieved 24,165,139 queries in mainland China, except Hubei, and 988,402 related queries in Hubei province. The daily new confirmed COVID-19 case counts, the fraction of COVID-19–related online news articles, the fraction of COVID-19–related microblogs, and COVID-19–related search query counts are displayed in
Lagged Pearson correlation analyses between different Internet-based data sources and daily new confirmed COVID-19 case counts were also conducted to illustrate the predictive power. The highest correlations for different sources with different time lags are summarized in
Daily time series of new confirmed COVID-19 case counts (NC), the fraction of COVID-19 related microblogs (Mblog), the fraction of COVID-19–related online news articles (News), and numbers of COVID-19–related search queries with the keyword “fever,” “dry cough,” “chest distress,” “pneumonia,” or “coronavirus” in mainland China, except Hubei province, from December 21, 2019 to February 29, 2020.
Daily time series of new confirmed COVID-19 case counts (NC), the fraction of COVID-19 related microblogs (Mblog), the fraction of COVID-19–related online news articles (News), and numbers of COVID-19–related search queries with the keyword “fever,” “dry cough,” “chest distress,” “pneumonia,” or “coronavirus” in Hubei province from December 21, 2019 to February 29, 2020.
Strongest correlation coefficients,
Source | Outside Hubei | Hubei | |||||
|
Highest correlation | Days earlier | Highest correlation | Days earlier | |||
News articles | 0.619 | <.001 | 2 | 0.667 | <.001 | 14 | |
Microblogs | 0.613 | <.001 | 2 | 0.632 | <.001 | 7 | |
Search for “fever” | 0.949 | <.001 | 4 | 0.826 | <.001 | 12 | |
Search for “dry cough” | 0.831 | <.001 | 6 | 0.775 | <.001 | 12 | |
Search for “chest distress” | 0.867 | <.001 | 3 | 0.806 | <.001 | 10 | |
Search for “pneumonia” | 0.854 | <.001 | 5 | 0.750 | <.001 | 11 | |
Search for “coronavirus” | 0.831 | <.001 | 6 | 0.765 | <.001 | 12 |
The forecasting results for our proposed model and baseline models are presented in
COVID-19 epidemic forecasting model comparison for mainland China, except Hubei, between January 19, 2020, and February 29, 2020.
Model (lag) | Model number | RMSEa | MAEb | MAPEc | Correlation | Incremental correlation | t198 | |
AR(7)+News(1)+ Mblog(10)+Query(1) | model i | 87.461 | 47.780 | 0.154 | 0.960 | 0.435 | N/Ad | N/A |
AR(7) | model 1 | 152.182 | 97.852 | 0.579 | 0.852 | 0.006 | –8.722 | <.001 |
AR(7)+News(1) | model 2 | 117.223 | 68.158 | 0.374 | 0.911 | 0.066 | –5.000 | <.001 |
AR(7)+Mblog(10) | model 3 | 93.754 | 51.375 | 0.185 | 0.948 | 0.403 | –1.882 | .06 |
AR(7)+Query(1) | model 4 | 138.724 | 85.024 | 0.421 | 0.905 | 0.168 | –4.644 | <.001 |
AR(7)+News(1)+ Mblog(1)+Query(1) | model 5 | 90.494 | 53.332 | 0.306 | 0.954 | 0.167 | –4.488 | <.001 |
aRMSE: root-mean-square error.
bMAE: mean absolute error.
cMAPE: mean absolute percentage error.
dN/A: not applicable.
COVID-19 epidemic forecasting model comparison for Hubei province, China, between January 27, 2020, and February 29, 2020.
Model (lag) (model no.) | Model number | RMSEa | MAEb | MAPEc | Correlation | Incremental correlation | t198 | |
AR(1)+News(3)+ Mblog(1)+Query(3) | model i | 325.216 | 225.620 | 0.168 | 0.990 | 0.984 | N/Ad | N/A |
AR(1) | model 1 | 658.238 | 403.665 | 0.267 | 0.963 | 0.958 | –1.732 | .09 |
AR(1)+News(2) | model 2 | 488.974 | 325.731 | 0.226 | 0.978 | 0.976 | –1.196 | .24 |
AR(1)+Mblog(1) | model 3 | 431.457 | 311.196 | 0.228 | 0.983 | 0.977 | –0.252 | .80 |
AR(1)+Query(3) | model 4 | 437.368 | 286.900 | 0.201 | 0.983 | 0.976 | –0.364 | .72 |
AR(1)+News(1)+ Mblog(1)+Query(1) | model 5 | 360.725 | 272.602 | 0.206 | 0.988 | 0.981 | –0.965 | .34 |
aRMSE: root-mean-square error.
bMAE: mean absolute error.
cMAPE: mean absolute percentage error.
dN/A: not applicable.
The results from the 5 accuracy measures were interpreted. The results in
We then assessed the statistical significance of the forecasting accuracy improvement between different models based on paired
We also evaluated the practical significance of the forecasting models from the perspective of MAPE. For provinces outside Hubei of mainland China in
The collinearity diagnostics revealed that real-time social media data, online news articles, and search queries are independent of each other in supplementing COVID-19 surveillance. More detailed results and discussions are presented in
(A) Forecasting results for mainland China, except Hubei, between January 19, 2020 and February 29, 2020, during which the daily estimations of our proposed model and baseline models were compared against the daily new confirmed COVID-19 case counts (NC), and (B) the estimation error, defined as the estimated value minus the daily new confirmed COVID-19 case counts.
(A) Forecasting results for Hubei province between January 27, 2020 and February 29, 2020, during which the daily estimations of our proposed model and baseline models were compared against the daily new confirmed COVID-19 case counts (NC), and (B) the estimation error, defined as the estimated value minus the daily new confirmed COVID-19 case counts.
The SARS-COV-2 virus and its variants pose extraordinary challenges for public health systems worldwide. More accurate forecasting of COVID-19 epidemics is key to improving the efficiency of resource allocation and the implementation of intervention policies [
This study also explored COVID-19 forecasting timeliness using different Internet-based data sources. Unlike previous studies that mainly focused on official online news articles, our study also took into account unofficial online news articles, which accounted for about 92.5% of all online news articles. The results show that COVID-19–related online news articles could provide a warning for the COVID-19 epidemic in mainland China, except Hubai, about 2 days earlier and in Hubai about 12 days to 14 days earlier. A similar early warning ability was also shown for microblogs and search queries. We found significant differences in the lag in an early warning for mainland China, except Hubei, and Hubei province, which may be caused by 2 reasons. First, Hubei experienced an extreme shortage of testing capacity in the beginning [
Our study innovatively proposes core terms and symptom-related keyword-based approaches to extract COVID-19–related Internet-based data sources. The keyword-based approaches allow us to constantly and conveniently update the core terms and symptoms to keep up with the mutation of the COVID-19 virus. For example, people infected with the Delta variant are more likely to have a “runny nose,” “headache,” or “sore throat” and less likely to experience “loss of smell” [
Another interesting finding of our study is that the peak of daily new confirmed case counts in Hubei was reached on February 4, 2020, while the peak in the rest of mainland China was reached on January 30, 2020 (5 days earlier than Hubei Province). This finding was contrary to our common sense, for Hubei was the epicenter of the initial outbreak, and the rest of mainland China was influenced by this epidemic later. One possible reason for the delay of the COVID-19 epidemic peak in Hubei was the extreme shortage of medical resources at the beginning of the epidemic, including testing ability and hospital beds [
Overall, the results show that incorporating both real-time and historical data from multiple Internet-based sources into the COVID-19 forecasting model could significantly improve the forecasting accuracy, compared with other baseline models. Internet-based data sources, including online news articles, microblogs, and search queries, could provide early warning for COVID-19 outbreaks. These findings have broad public health implications. Internet-based data are timely, low-cost, and rich in information, making them critical in the surveillance of COVID-19 outbreaks. This application is even more important in rural areas, where the health infrastructure does not allow for widespread screening. COVID-19 surveillance using Internet-based data could provide much-needed information to help the government trace the outbreak and more effectively allocate resources, including testing capacity, oxygen cylinders, and hospital beds. Internet-based platforms allow users to capture detailed real-time snapshots of COVID-19–related events that happen to them or near them. As the COVID-19 virus continues to mutate, Internet-based sources with richer information have the potential to identify novel COVID-19 variants through deeper information analysis.
There are several limitations and potential future directions of this study that we would like to mention. First, our study only used retrospective data from mainland China and did not test the proposed model in countries that are currently experiencing an epidemic of COVID-19 and its variants. This is mainly because of data accessibility. We could not find available databases or online platforms that allowed us to access a large volume of real-time and historical microblogs and unofficial online news articles in other countries. We encourage future work to use the proposed method in different countries to test its generalizability and robustness.
Second, our study did not incorporate machine learning methods in the data filtering process. In this study, we explored the full database of Internet-based sources in mainland China from the SNOSS and Baidu Search Index, where the raw data are not available for downloading and further analysis. Future research could apply advanced machine learning methods to the raw data of various Internet-based sources to achieve more accurate epidemic-related data extraction and deeper information analyses. For example, future research can use the support vector machine to help extract COVID-19–related online data [
Finally, our study mainly used symptom- and core term–related keywords to extract COVID-19–related Internet-based data, which has been proven to provide the most accurate predictions compared with other types of keywords [
COVID-19 and its variants have been and continue to be a major public health threat worldwide. COVID-19 core term– and symptom-related Internet-based data could provide invaluable warning signals to the public and supplement existing COVID-19 surveillance systems. This study showed that our proposed COVID-19 forecasting method, incorporating both real-time and historical data from multiple Internet-based sources, could significantly improve the forecasting accuracy compared with other baseline models. Our results also show that Internet-based sources, including online news articles, could provide a warning 2 days to 6 days earlier for COVID-19 outbreaks.
Detailed descriptions of the Internet-based data extraction and filtering methods.
Supplementary tables.
Descriptions and formulations of baseline models.
Accuracy indexes.
Collinearity diagnostics.
Center for Disease Control and Prevention
mean absolute error
mean absolute percentage error
root-mean-squared error
Sina Network Opinion Surveillance System
variance inflation factor
JL would like to acknowledge the partial grant support for the research (71731009, 72061127002, 92146005). WH would also like to acknowledge the partial grant support (2018WZDXM020, 71722014, 71732006, 91546119). CLS would also like to acknowledge the partial grant support (Hong Kong’s RGC-GRF grant 9042571 and CityU 11504417). This research was also partially supported by Shenzhen Key Research Base in Arts & Social Sciences and the National Laboratory of Mechanical Manufacture Systems Engineering, Xi’an Jiaotong University.
None declared.