This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
Harnessing health-related data posted on social media in real time can offer insights into how the pandemic impacts the mental health and general well-being of individuals and populations over time.
This study aimed to obtain information on symptoms and medical conditions self-reported by non-Twitter social media users during the COVID-19 pandemic, to determine how discussion of these symptoms and medical conditions changed over time, and to identify correlations between frequency of the top 5 commonly mentioned symptoms post and daily COVID-19 statistics (new cases, new deaths, new active cases, and new recovered cases) in the United States.
We used natural language processing (NLP) algorithms to identify symptom- and medical condition–related topics being discussed on social media between June 14 and December 13, 2020. The sample posts were geotagged by NetBase, a third-party data provider. We calculated the positive predictive value and sensitivity to validate the classification of posts. We also assessed the frequency of health-related discussions on social media over time during the study period, and used Pearson correlation coefficients to identify statistically significant correlations between the frequency of the 5 most commonly mentioned symptoms and fluctuation of daily US COVID-19 statistics.
Within a total of 9,807,813 posts (nearly 70% were sourced from the United States), we identified a discussion of 120 symptom-related topics and 1542 medical condition–related topics. Our classification of the health-related posts had a positive predictive value of over 80% and an average classification rate of 92% sensitivity. The 5 most commonly mentioned symptoms on social media during the study period were anxiety (in 201,303 posts or 12.2% of the total posts mentioning symptoms), generalized pain (189,673, 11.5%), weight loss (95,793, 5.8%), fatigue (91,252, 5.5%), and coughing (86,235, 5.2%). The 5 most discussed medical conditions were COVID-19 (in 5,420,276 posts or 66.4% of the total posts mentioning medical conditions), unspecified infectious disease (469,356, 5.8%), influenza (270,166, 3.3%), unspecified disorders of the central nervous system (253,407, 3.1%), and depression (151,752, 1.9%). Changes in posts in the frequency of anxiety, generalized pain, and weight loss were significant but negatively correlated with daily new COVID-19 cases in the United States (r=-0.49, r=-0.46, and r=-0.39, respectively;
COVID-19 and symptoms of anxiety were the 2 most commonly discussed health-related topics on social media from June 14 to December 13, 2020. Real-time monitoring of social media posts on symptoms and medical conditions may help assess the population’s mental health status and enhance public health surveillance for infectious disease.
The COVID-19 pandemic continues to spread worldwide, with more than 229 million confirmed cases and 4,7028,286 deaths in 188 countries as of September 21, 2021 [
Although prior studies have demonstrated that social media discussions can influence health-related beliefs and behaviors, more studies are needed to understand how social media plays a role during the pandemic [
As such, we created a dashboard to extract and monitor posts mentioning symptoms and medical conditions from social media sites other than Twitter over the course of the COVID-19 pandemic. In this study, we sought to answer the following questions: (1) what symptoms and medical conditions were people discussing on social media platforms other than Twitter during the COVID-19 pandemic? (2) How have discussions of symptoms and medical conditions on social media changed over a 6-month period during the pandemic? (3) Were daily fluctuations in health-related social media conversations associated with daily changes in COVID-19 statistics (new cases, new deaths, new active cases, and new recovered cases) in the United States?
We included English-language social networks and forums worldwide, such as Facebook public pages, Reddit, 4Chan, and the comments sections of news sites such as ABC News [
Furthermore, even though there is an overlap between the affordance among our sources and Twitter, Reddit users have more anonymity as they do not need to register an account to access the majority of the content, thus allowing for greater participation [
We partnered with Signals Analytics, an advanced analytics company, to obtain access to target data sources from a third-party data vendor (NetBase) and to conduct the analysis [
We also gathered data on COVID-19 cases from the COVID-19 dashboard developed by the Center for System Science and Engineering at Johns Hopkins University, which provides the most comprehensive and up-to-date information on COVID-19 trends [
In this study, all personal identifying information such as usernames, emails, and IP addresses were removed before analysis. The study was exempt from institutional review board review at Yale University as it used publicly available, anonymized data.
For the analysis of data on symptoms and medical conditions being discussed on social media platforms between June 14 (when many countries began to lift major COVID-19 restrictions) and December 13, 2020 (when the first shipment of the COVID-19 vaccine arrived in the United States), we began by applying NLP algorithms to process social media posts collected from data sources during the study period, and then classified these posts in accordance with symptoms and medical conditions being mentioned.
To accomplish this, NetBase ran a daily scheduled data extraction query that we designed for the study on over 300 million web-based data sources (
To evaluate the performance of the NLP algorithms and taxonomy classifications of symptoms and medical conditions, we applied the taxonomy to 4 sets of independent 100-post samples and calculated the positive predictive value and sensitivity of the classification (
Our taxonomy was organized into three levels: categories, subcategories, and topics. Symptoms and medical conditions were the 2 main categories in the taxonomy (
We also created content filters to retain posts mentioning COVID-19 for further analysis. We applied 2 filters, COVID-19 disease status and COVID-19 diagnostic methods, to identify discussions on COVID-19 disease status (tested positive or negative, symptomatic or asymptomatic, recovered, and exposed to a confirmed patient) and diagnostic methods (COVID-19 testing, self-diagnosis, and remote diagnosis). These more restrictive searches were conducted by activating the 2 additional filters using the NLP algorithm, and the resulting posts from that search may not indicate the author’s COVID-19 status.
To explore how the discussion of symptoms and medical conditions on social media changed from June 14 to December 13, 2020, we determined the number of posts that included a discussion of each symptom and medical condition over time using NLP classification (
Additionally, we compared the trends of the 5 most frequently mentioned symptoms and medical conditions from June 14 to August 31, 2021 (when the United States crossed the 6 million COVID-19 cases mark), to the trends observed from September 1 to December 13, 2020, by measuring the percent change between the 2 time periods in the number of posts including a discussion of each topic. We compared the 2 time periods to reveal changes in health-related conversations on social media at different stages of the pandemic, as prior literature focused primarily on the early stage of the pandemic (before June 2020). Our approach was also designed to contribute to a better understanding of the impact of COVID-19 on the public’s perceptions and attitudes toward different symptoms, medical conditions, and health care–seeking behaviors.
After social media posts were collected from sources, preprocessed, and classified in accordance with the taxonomy by NLP algorithms, our final sample included a total of 9,807,813 posts between June 14 and December 13, 2020, which mentioned at least 1 of the 120 symptoms or 1542 medical condition topics in our taxonomy (
Irrespective of subcategories classification, the 5 most commonly mentioned symptom topics were anxiety (201,303, 12.20%, of the total posts mentioning symptoms), generalized pain (189,673, 11.5%), weight loss (95,793, 5.8%), fatigue (91,252, 5.5%), and coughing (86,235, 5.2%), accounting for 40.2% of all symptom posts combined (
Number of posts on symptoms and medical conditions mentioned on social media platforms by taxonomy topic (June 14 to December 13, 2020; N=9,807,813).
Relevant taxonomy categories and subcategories (number of topics) | Number of posts with symptoms or medical conditions | Percentage of all posts on symptoms or all medical conditions (%) | |
|
|||
|
Neuropsychological symptoms (17) | 568,662 | 34.47 |
|
COVID-19–related symptomsa (22) | 501,178 | 30.38 |
|
Respiratory symptoms (7) | 128,134 | 7.77 |
|
Gastrointestinal symptoms (13) | 120,621 | 7.31 |
|
Dermal symptoms (16) | 99,453 | 6.03 |
|
Cardiovascular disease symptoms (4) | 34,014 | 2.06 |
|
Musculoskeletal symptoms (7) | 33,604 | 2.04 |
|
Other symptoms (34) | 163,881 | 9.93 |
|
|||
|
Infectious disease (80) | 6,052,068 | 74.18 |
|
Psychiatric or mental health disorders (21) | 484,505 | 5.94 |
|
Neurovascular and cardiovascular diseases (63) | 465,675 | 5.71 |
|
Respiratory disorders (17) | 165,404 | 2.03 |
|
Hematological and oncological disorders (127) | 164,159 |
2.01 |
|
Other disorders (1234) | 828,786 | 10.13 |
aCOVID-19–related symptoms were based on symptoms of COVID-19 (n=22) updated by the Centers for Disease Control and Prevention on December 22, 2020, which were as follows: runny nose, change in sense of taste, change in sense of smell, chills, bluish lips/face, inability to stay awake, fatigue, headache, sore throat, abdominal pain, vomiting, muscle pain/spasms, drowsiness, nausea, body aches, chest pain, itching/swelling, fever, confusion state, diarrhea, coughing, and difficulty breathing.
Frequency of the top 5 most discussed symptoms and medical conditions on social media by taxonomy topic (June 14 to December 13, 2020; N=9,807,813).
Relevant taxonomy categories and topics | Number of posts with topics related to symptoms or medical conditions | Percentage of posts on all topics related to symptoms or all medical conditions (%) | |
|
|||
|
Anxiety | 201,303 | 12.20 |
|
Generalized pain | 189,673 | 11.49 |
|
Weight loss | 95,793 | 5.81 |
|
Fatigue | 91,252 | 5.53 |
|
Coughing | 86,235 | 5.23 |
|
|||
|
COVID-19 | 5,420,276 | 66.44 |
|
Unspecified infectious disease | 469,356 | 5.75 |
|
Influenza | 270,166 | 3.31 |
|
Unspecified CNSa disorders | 253,407 | 3.11 |
|
Depression | 151,752 | 1.86 |
aCNS: central nervous system.
Within the COVID-19–related symptoms subcategory, fatigue (91,208, 32.9%) and coughing (86,222, 31.1%) were the most discussed COVID-19–related symptom topics (
After applying the COVID-19 disease status filter to all posts mentioning the top 5 most frequently mentioned symptoms and medical conditions, we noticed that within the posts classified with the medical condition of COVID-19, 62.9% had also discussed testing positive, and 9.1% of the discussions were related to asymptomatic COVID-19 (Table S2,
The pattern of changes in top 5 commonly mentioned posts of medical conditions or symptoms and the fluctuation of daily new COVID-19 cases in the United States were displayed in
Correlations between the frequency of the 4 most commonly discussed symptoms and daily recovered cases were significant, and their Pearson correlation coefficients were –0.43 for anxiety, –0.44 for generalized pain, –0.55 for weight loss, and –0.51 for coughing, which indicated a negative and moderate correlation among them (
When examining changes in the frequency of the top 5 most commonly mentioned symptom topic discussions over the 6-month study period, we noted a 24% increase in symptom posts mentioning anxiety, generalized pain, and fatigue during September 1-December 13, 2020 (vs June 14-August 31, 2020) (
Comparing changes in the number of posts on COVID-19 symptoms between June 14 and August 31, 2020, with those in September 1 to December 13, 2020 (N=277,401).
COVID-19–related symptoms per the Centers for Disease Control and Prevention’s definitiona | Posts mentioning this COVID-19 symptoms, n (%) | Posts during June 14-August 31, 2020, n | Posts during September 1-December 13, 2020, n | Changes in the number of posts, % |
Fatigue | 91,208 (32.88) | 36,876 | 54,332 | 47.33 |
Coughing | 86,222 (31.08) | 41,163 | 45,059 | 9.46 |
Fever | 59,906 (21.59) | 27,729 | 32,177 | 16.04 |
Headache | 41,693 (15.02) | 18,052 | 23,641 | 30.96 |
Vomiting | 39,103 (14.09) | 17,364 | 21,739 | 25.19 |
Difficulty breathing | 33,589 (12.11) | 16,917 | 16,672 | Decreased 1.45 |
Nausea | 29,103 (10.49) | 13,039 | 16,064 | 23.19 |
Itching/swelling | 28,337 (10.22) | 12,953 | 15,384 | 18.77 |
Sore throat | 14,694 (5.29) | 6424 | 8270 | 28.74 |
Diarrhea | 14,140 (5.09) | 6716 | 7424 | 10.54 |
Chest pain | 9412 (3.39) | 4255 | 5157 | 21.19 |
Abdominal pain | 9238 (3.33) | 4080 | 5158 | 26.42 |
Runny nose | 8283 (2.98) | 3029 | 5254 | 73.46 |
Body aches | 7871 (2.84) | 3540 | 4331 | 22.34 |
Change in sense of taste | 6510 (2.35) | 2447 | 4063 | 66.04 |
Muscle pain/spasms | 6321 (2.28) | 2816 | 3505 | 24.47 |
Change in sense of smell | 6192 (2.23) | 2340 | 3852 | 64.62 |
Confusional state | 3716 (1.34) | 1737 | 1979 | 13.93 |
Chills | 2879 (1.04) | 1141 | 1738 | 52.32 |
Drowsiness | 1256 (0.45) | 560 | 696 | 24.29 |
Bluish lips/face | 1019 (0.37) | 404 | 615 | 52.23 |
Inability to stay awake | 486 (0.18) | 195 | 291 | 49.23 |
aThe list of COVID-19 symptoms was updated on December 22, 2020, in accordance with the Centers for Disease Control and Prevention’s update. Our algorithms captured all posts mentioning any of these symptoms in the COVID-19 symptom subcategory; consequently, the posts may not necessarily represent patients discussing their own COVID-19 symptoms.
Associations between changes in new daily COVID-19 cases in the United States and the number of medical condition–related posts (June 13-December 13, 2020). (Note: the gray shaded area indicates daily active COVID-19 cases in the United States, while the colored curves showed fluctuations in posts mentioning different medical disorders during the study period). CNS: central nervous system.
Associations between changes in new daily COVID-19 cases in the United States and the number of symptoms posts (June 13-December 13, 2020). (Note: the gray shaded area indicated daily active COVID-19 cases in the United States, while the colored curves showed fluctuations in posts mentioning different symptoms during the study period).
In this study, we collected and analyzed web-based posts from forums and comments on news sites between June 14 and December 13, 2020. We found that a wide variety of symptoms and medical conditions topics were discussed on non-Twitter social media. While the vast majority of discussions were about COVID-19 infection and COVID-19–related symptoms (as defined by the CDC), neuropsychological symptoms (eg, anxiety) and other medical conditions (eg, infectious diseases and psychiatric disorders) were also frequently mentioned. Additionally, we noticed that changes in posts frequency of anxiety, generalized pain, and weight loss were significant but negatively correlated with daily new COVID-19 cases in the United States, and that the frequency of posts on anxiety, generalized pain, weight loss, fatigue, and the changes in fatigue positively and significantly correlated with daily changes in both new deaths and new active cases in the United States. As COVID-19 cases continued to rise globally, the cumulative volume of posts mentioning anxiety, generalized pain, fatigue, influenza, unspecified CNS disorders, and depression increased from September 1 to December 13, 2020 (compared to June 13 to August 31, 2020).
Our findings expand on previous observations regarding the mental health effects of the COVID-19 pandemic among social media users by presenting a more complete picture of health-related topics discussed on social media [
Further, as access to the internet becomes more widely available and with the anonymity of social media, people who face barriers to accessing health care and those who have mental health symptoms may use social media to speak openly about their health experiences and seek help [
As the pandemic progresses, obtaining information on the symptom profile of COVID-19 could help to better diagnose and treat the disease. There has been increasing recognition of the importance of extracting social media information to explore symptom experience and disease progression among patients with COVID-19 [
We also noticed that approximately 15% of these discussions were related to asymptomatic COVID-19. While an in-depth exploration of these posts using qualitative analysis or sentiment analysis is necessary to help verify the users’ COVID-19 disease status, our preliminary data indicate the potential for extracting information from social media to understand the full spectrum of symptoms experienced by patients with COVID-19. Interestingly, we noticed an increase of over 60% in the volume of posts mentioning less common COVID-19 symptoms such as changes in the senses of taste and smell during the second stage of our study period (September 1 to December 13, 2020). This surge may be partly due to improvements in knowledge and awareness of COVID-19 symptoms in the general population as the 2 symptoms were recently added to the COVID-19 symptom lists of the CDC and the World Health Organization (late April 2020 and early May 2020, respectively).
While there have been fluctuations in the volume of social media posts on a day-to-day basis, there appeared to be seasonal variation in the volume of discussion of symptoms and medical conditions. We noticed that the volume of most health-related discussions increased more from September 1 to December 13, 2020, than from June 14 to August 31, 2020. These changes may have been due to a combination of colder weather in the northern hemisphere and social distancing and limitations on daily life during the pandemic as well as the second wave of COVID-19, resulting in more social media users and more people being restricted indoors [
Our study has several limitations. First, information on geolocation, demographics, and COVID-19 disease status was not available for all social media users in the study, owing to various legal limitations (such as General Data Protection Regulation of the European Union). This might have introduced a sampling bias if there were significant differences between social media users’ characteristics in our project and the real world. However, by collaborating with social media analytics companies, we have maximized our ability to access thousands of social media data sources worldwide, thus minimizing the possibility of sampling bias. Additionally, the majority of social media users in our study were from the United States. The findings, therefore, may not be generalizable in their application to users located in other countries. Further, we did not conduct formal statistical analyses beyond comparing the trends differences in frequency of health-related posts and new COVID-19 cases; hence, further testing is needed to confirm the associations between patterns of changes in symptom/medical condition posts and the fluctuations of COVID-19 statistics over time. Finally, we did not perform sentiment analysis or qualitative analysis in the study and did not verify whether authors who discussed COVID-19–related topics had COVID-19 themselves. We hope to accomplish and report this analysis in a future study. We also hope that other studies on social media’s role in public health will replicate and validate our exploratory findings in non-Twitter social media platforms.
In this study, we classified web-based posts collected from June 14 to December 13, 2020, in accordance with discussions of symptoms and medical conditions. Neuropsychological symptoms such as anxiety were the most frequently mentioned symptom subcategory. Furthermore, COVID-19 infection was the most commonly mentioned medical condition. Our analysis also showed that frequency of anxiety and other general health symptoms posts, including generalized pain, weight loss, and fatigue, was significantly correlated with daily COVID-19 statistics in the United States. Additionally, health-related discussions were greater from September 1 to December 13, 2020, than from June 14 to August 31, 2020, aligning with the increase in COVID-19 cases in the United States during the winter months. These preliminary findings show promise for real-time monitoring of social media posts to measure the mental health status of a population during a global public health crisis and to assess the public’s main health needs that have not been captured or met by the existing health system. Future research may incorporate information from social media into predictive models for the detection of emerging infectious diseases.
Supplementary methods, figures, and tables.
application programming interface
Centers for Disease Control and Prevention
central nervous system
e-cigarette or vaping use-associated lung injury
natural language processing
World Health Organization
The authors thank Center for Outcomes Research and Evaluation Yale New Haven Hospital for their coordination of the project. AC and Pini Matzner from Signals Analytics had full access to the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Signals, the analytic company, was acquired and is now part of Skai. QD takes full responsibility for the data interpretation and writing. All authors contributed to the editing and the approving of the final version of the paper for publication. This work was supported by the project Insights about the COVID Pandemic Using Public Data IRES PD: 20-005872 with funding from the Foundation for a Smoke-Free World.
YL is supported by the National Heart, Lung, and Blood Institute (K12HL138037) and the Yale Center for Implementation Science. RD is supported by an American Heart Association Transformational Project Award (#19TPA34830013) and a Canadian Institutes of Health Research Project Grant (RN356054–401229). In the past 3 years, HMK received expenses and personal fees from UnitedHealth, IBM Watson Health, Element Science, Aetna, Facebook, the Siegfried and Jensen Law Firm, Arnold and Porter Law Firm, Martin/Baughman Law Firm, F-Prime, and the National Center for Cardiovascular Diseases in Beijing. He is an owner of Refactor Health and HugoHealth, and had grants and contracts from the Centers for Medicare & Medicaid Services, Medtronic, the US Food and Drug Administration, Johnson & Johnson, and the Shenzhen Center for Health Information. The remaining authors have no disclosures to report.