This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
The COVID-19 pandemic is severely affecting people worldwide. Currently, an important approach to understand this phenomenon and its impact on the lives of people consists of monitoring social networks and news on the internet.
The purpose of this study is to present a methodology to capture the main subjects and themes under discussion in news media and social media and to apply this methodology to analyze the impact of the COVID-19 pandemic in Brazil.
This work proposes a methodology based on topic modeling, namely entity recognition, and sentiment analysis of texts to compare Twitter posts and news, followed by visualization of the evolution and impact of the COVID-19 pandemic. We focused our analysis on Brazil, an important epicenter of the pandemic; therefore, we faced the challenge of addressing Brazilian Portuguese texts.
In this work, we collected and analyzed 18,413 articles from news media and 1,597,934 tweets posted by 1,299,084 users in Brazil. The results show that the proposed methodology improved the topic sentiment analysis over time, enabling better monitoring of internet media. Additionally, with this tool, we extracted some interesting insights about the evolution of the COVID-19 pandemic in Brazil. For instance, we found that Twitter presented similar topic coverage to news media; the main entities were similar, but they differed in theme distribution and entity diversity. Moreover, some aspects represented negative sentiment toward political themes in both media, and a high incidence of mentions of a specific drug denoted high political polarization during the pandemic.
This study identified the main themes under discussion in both news and social media and how their sentiments evolved over time. It is possible to understand the major concerns of the public during the pandemic, and all the obtained information is thus useful for decision-making by authorities.
In December 2019, the outbreak of COVID-19 in China was reported [
In past pandemic outbreaks, information exchange was relatively slow. However, with the popularization of the internet, 3.7 billion people worldwide (approximately 49.7% of the world’s population) commonly use web-based information [
News media web sites are used to report crisis situations worldwide. The articles on these sites are written by journalists and subject matter experts; therefore, people trust these sources of data. However, these channels failed to keep pace with the spread of the outbreak of COVID-19 [
On the other hand, social media is a well-known channel for news and information in the timely media environment, with one in three people worldwide engaging in social media and two-thirds of people using the internet [
Currently, almost 70% of Brazilians use the internet, 90% of them access the web on a daily basis, and Brazil is the country in the western hemisphere whose residents spend most time on social media per day [
Traditional news media focus substantial interest on health issues, especially when a new disease emerges. A number of researchers have exploited the importance of understanding the depiction of health issues in the news media. For instance, Washer [
These related studies focused on how traditional news media react to health events and the characterization of their reports. Our work differs by focusing on the analysis of social media and comparing it with traditional news media, as we are interested in showing the impact of the COVID-19 pandemic on people’s lives.
The research community is also interested in correlating pandemic events with information shared by people on social networks, especially Twitter. Several examples show how useful information can be extracted from social media to help understand pandemic behavior but also to enable organizations to act to improve people’s quality of life. For instance, Chew and Eysenbach [
Although several previous studies have separately assessed news coverage and social media in pandemic events, only a few of them have compared news coverage with social media (in contrast to other disasters [
Our work follows a similar approach to that of [
In this study, we describe a methodological approach to analyze the content of two main sources of web-based data to better understand the focus of each channel in disseminating information on COVID-19. Recent work in the literature (eg, [
The three main research questions that we are addressing in this study are:
RQ1: Does social media cover similar categories and types of topics to traditional news media about the COVID-19 pandemic?
RQ2: Do news web sites and social media mention the same types of entities?
RQ3: Are there differences in the sentiments of Twitter posts and news articles? Does the degree of sentiments change over time?
To answer these questions, we collected and analyzed data from the main news media web site from Brazil, namely Universo Online (UOL), and Twitter. Twitter is a very popular social media platform worldwide, and UOL is a very popular portal for news in Brazil. We proposed the generation of topic models for each data collection, their grouping in themes for sentiment analysis, the observation of theme-sentiment evolution on a time scale, and the extraction of named entities. One challenging aspect of this research is the adaptation of the proposed methods to the Brazilian Portuguese language; therefore, we adopted some tools and developed specific trained models. By comparing all the features extracted from news and social media data sets, we present some perceptions on how the COVID-19 pandemic is affecting Brazil.
We collected news articles and tweets related to COVID-19 in the Portuguese language from January to May 2020. To collect the tweets, we used the TwitterScraper Python library [
Regarding news collection, we gathered all the articles published in the COVID-19 section from the UOL portal. We chose UOL because this portal is responsible for publishing the
To better understand the collected data, we evaluated the statistics of the number of tokens published in each data source over time, where a token is an individual occurrence of a linguistic unit in speech or writing. The monthly distributions of the total number and percentage of tokens from both data sets are described in
Monthly statistics of tokens in news articles and tweets.
Tokens | Month, n (%) | |||||
|
|
January | February | March | April | May |
|
||||||
|
Unique tokens (n=134,845) | 4149 (3.07) | 10,093 (7.48) | 38,300 (28.40) | 43,327 (32.13) | 38,976 (28.90) |
|
Total tokens (n= 2,616,002) | 14,550 (0.55) | 62,008 (2.37) | 792,175 (30.28) | 953,441 (36.44) | 793,828 (30.34) |
|
||||||
|
Unique tokens (n=407,406) | 16,100 (3.95) | 29,511 (7.24) | 84,713 (20.79) | 122,169 (29.98) | 154,913 (38.02) |
|
Total tokens (n=14,155,346) | 106,619 (0.75) | 284,012 (2.00) | 2,191,226 (15.48) | 4,420,658 (31.23) | 7,152,831 (50.53) |
Distribution of tweets and news reports by day.
We also noted that the variation in the density of news and tweets over time (
The 24-hour temporal distributions of posted tweets and news articles.
Although the distribution trend is relatively consistent on each day of the week, the activity was significantly different between work periods and holidays. As shown in
The 24-hour temporal distribution of posts on Twitter and Universo Online.
The collected data contained a large amount of noise that needed to be filtered out before further analysis. First, we tokenized the text, and then we adopted the following steps to normalize the texts:
Lowercase: All tokens were converted to lowercase. By doing this, identical tokens were merged and the dimensionality of the text was reduced.
URL removal: People post URLs with text to provide supporting information about the text. These URL links became noisy data during the analysis. All URL links in the texts were replaced by a space.
Username: Some Twitter usernames in texts start with the symbol @ and are used to tag other users. In our investigation, we were focusing on COVID-19 and not on any targeted person; therefore, we replaced all usernames with white spaces. This step was applied only to tweets.
Punctuation: We removed all the punctuation symbols from the collected data because they did not contribute to our evaluation.
Stop words: Stop words refer to the most common words used in text. We eliminated the Portuguese stop words that contributed less to our evaluation. We used a list of Portuguese stop words provided by the Natural Language Toolkit framework.
White spaces: We removed all the extra white spaces between tokens or at the end of lines or paragraphs of the text.
In addition to the above steps, we used lemmatization and stemming in the preprocessing of the text. However, the results were not satisfactory because there are few tools with these functions in the Portuguese language, and these tools present results with low accuracy.
Topic models are particularly useful because they enable the inference of structure from a large data collection without the need for extensive manual interventions [
LDA requires the user to specify the number of topics, where this parameter provides control over the granularity of the discovered topics. A larger number of topics will produce more detailed topics (finer-grained), while a smaller number of topics will produce more general topics (coarser-grained). Therefore, there is no single value of the number of topics that is appropriate in all domains and types of problems. To discover the most appropriate number of topics, we performed several different LDA experiments, varying the number of topics from 1 to 30 for both data sources. As illustrated in
Coherence scores for the latent Dirichlet allocation. UOL: Universo Online.
After topic discovery, we manually categorized the topics in themes based on the first 10 words, as these terms are ranked by their probability of appearance. The topics were categorized in the following themes: Confirmed Cases, Economic Influences, Entertainment, Medical Supplies, Medical Treatment and Research, Political, Prevention and Control, and Stories.
Descriptions of the considered themes in this work.
Theme | Description |
Confirmed Cases | Mentions of confirmed cases of COVID-19, such as updated numbers of cases and mortalities |
Economic Influences | The influence of COVID-19 on the economy and society, such as the large number of unemployed people |
Entertainment | Cultural events, sports, or food, such as the interruption of soccer championships |
Medical Supplies | The medical supply situation in Brazil, such as the lack of respirators and use of masks |
Medical Treatment and Research | Mentions of medical treatment and research combating COVID-19, such as the use of hydroxychloroquine |
Political | Mentions of politicians and public officials and their responsibility |
Prevention and Control | Different aspects of prevention and control procedures, such as social isolation and lockdowns in cities |
Stories | Stories from people in Brazil who became ill or about the impact of COVID-19 on people’s lives |
Once we obtained the topics and themes, we assessed their similarity to understand if Twitter and traditional news media cover similar categories and types of topics related to COVID-19. To achieve this, we adopted the popular cosine similarity, which is the angle between the representation of two topics, as a measure to report the similarity among topics:
where
For the sentiment analysis, we identified the polarity of the opinion or emotion expressed in the texts. One challenge we faced was the lack of robust language resources to support sentiment analysis for the Portuguese language [
After the translation process. we used the VADER (Valence Aware Dictionary and Sentiment Reasoner) tool [
Examples of text translation and sentiment analysis.
Portuguese text | Translated text | VADERa sentiment |
|
why are sick people insisting on splashing on others? haven’t you heard of social distance when you’re sick? | –0.8 |
|
Do what you want, your body, your rules. End. | 0.0 |
|
Minister Paulo Guedes says that we will recover and in ”V” !!! |
0.8 |
aVADER: Valence Aware Dictionary and Sentiment Reasoner
Named entity recognition (NER) is particularly useful for identifying which terms in a text are mentions of entities in the real world and classifying them according to a set of categories. Although NER is not a new research field, it is not an easy task. The reasons for this are multifold. First, there is much work targeting English text, but studies focused on Portuguese text are still scarce [
Considering that the current state-of-the-art NER systems are based on neural architectures, we decided to use the spaCy2 library, which is based on the hierarchical attention network proposed in [
We trained a new blank spaCy Portuguese language model; the initial model had no trained entities. An important issue in generating NER models is the effort involved in obtaining training data. To address this issue, we adopted a semisupervised approach to create training data that is better explained as follows. After training data generation, we then shuffled and looped over the training data. For each instance, the model was updated by calling the update function, which steps through all the words of each sentence. At each word, the update function makes a prediction. It then consults the golden standards to determine whether the prediction is right. If it is wrong, the update function adjusts its weights so that the correct action will score higher next time. Our model was built using 100 iterations with a dropout rate of 0.2. Once trained, our NER model was saved, and it can be used to recognize named entities in previously unseen tweets and news.
Sample outcome of the trained NER model.
Regarding the training data set, our strategy consisted of using a list of keywords for each entity regarding COVID-19 as a set of seeds. The algorithm in
The algorithm iterates through the set of sentences
The training set generated by Algorithm 1 involves only a small degree of supervision, such as a set of keywords for each target entity, to start the learning process. To represent each type of text, we generated distinct training sets for news media and Twitter.
1
2
Input: A set P = {〈e, k⟩ | e ∈ E and k ∈ K}
3
4
5
6
7
8
9
10
11
12
13
Topics were analyzed for UOL and Twitter data sets according to the methods described in the previous section. Afterward, we organized the topics in themes as described in Table II. Topics and themes for UOL and Twitter are shown in
Topics and themes for Universo Online.
ID | Topic | Theme |
1 | people ( |
Prevention and Control |
2 | people ( |
Stories |
3 | president ( |
Political |
4 | cases ( |
Confirmed Cases |
5 | tests ( |
Medical Treatment and Research |
6 | people ( |
Prevention and Control |
7 | president ( |
Political |
8 | measures ( |
Prevention and Control |
9 | can ( |
Stories |
10 | disease ( |
Medical Treatment and Research |
11 | state ( |
Political |
12 | soccer ( |
Entertainment |
13 | countries ( |
Political |
14 | economy ( |
Economic Influences |
15 | pandemic ( |
Stories |
16 | health ( |
Medical Supplies |
17 | decision ( |
Political |
18 | workers ( |
Economic Influences |
19 | social nets ( |
Stories |
20 | masks ( |
Prevention and control |
Topics and themes for Twitter.
ID | Topic | Theme |
1 | president ( |
Political |
2 | instagram ( |
Medical Supplies |
3 | deaths ( |
Confirmed Cases |
4 | cases ( |
Confirmed Cases |
5 | health ( |
Medical Supplies |
6 | twitter ( |
Stories |
7 | people ( |
Stories |
8 | pandemic ( |
Stories |
9 | pandemic ( |
Economic Influences |
10 | true ( |
Political |
11 | people ( |
Confirmed Cases |
12 | social isolation ( |
Prevention and Control |
13 | government ( |
Political |
14 | death ( |
Confirmed Cases |
15 | do ( |
Economic Influences |
16 | corona ( |
Stories |
17 | coronavirus ( |
Medical Treatment and Research |
18 | situation ( |
Stories |
19 | pandemic ( |
Entertainment |
20 | chloroquine ( |
Medical Treatment and Research |
Similarity among Universo Online topics.
Similarity among Twitter topics.
The theme distributions between the UOL and Twitter collections are compared in
Distribution of themes. UOL: Universo Online.
According to the NER analysis method described in the last section, we compared the main mentions for each entity using word clouds, as this popular text analysis tool provides a visualization of word frequency in a source text while giving more prominence to words that occur more often. To facilitate the understanding of the most representative words by entity, we decided to show the 20 most frequently mentioned words in each entity. Words that were incorrectly extracted as belonging to an entity were manually removed. We assessed that our NER analysis method had an average accuracy of approximately 85% among the 20 most frequently mentioned terms. In
Word clouds showing the most frequent entity mentions in the Persons category: (a) Universo Online; (b) Twitter.
Word clouds showing the most frequent entity mentions in the Organizations category: (a) Universo Online; (b) Twitter.
Word clouds showing the most frequent entity mentions in the Disease category: (a) Universo Online; (b) Twitter.
Word clouds showing the most frequent entity mentions in the Symptoms category: (a) Universo Online; (b) Twitter.
Word clouds showing the most frequent entity mentions in the Drugs category: (a) Universo Online; (b) Twitter.
From the word clouds for all these entities, it is important to mention that the found terms are very coherent with their respective entity categories. This fact reinforces that the adopted NER method is valid for the Portuguese language and that this study reflects the Brazilian perception of the COVID-19 pandemic. By comparing both formal and social media, it can be noted that there is no substantial difference regarding the main terms. However, people’s discussions on Twitter have much sparser terms than those on UOL, while the terms in latter seem to be more diverse. Another important difference between the collections can be seen in the entity distribution graph in
Entity distributions by data set. DIS: Disease; DRUG: Drugs; ORG: Organization; PER: Person; SYMP: Symptoms.
Once topics were obtained for all posts in a collection, we classified every document by its topic with highest probability and applied the previously described sentiment analysis. We then grouped all posts by weekly intervals of time, summing the number of documents in each theme and calculating the sentiment averages.
Universo Online sentiment analysis over time.
Twitter sentiment analysis over time.
From a general point of view, we can observe that UOL articles and Twitter posts were concerned about the same main COVID-19 topics and themes. For instance, the topics and themes were very similar for both types of media, and this was reflected in the most common entity mentions. This study suggests that formal news media and social media influence each other; we found a representative cross-reference in the Organization entity graph.
The main differences found between UOL and Twitter pertain to the distribution of the main themes, diversity of entities, and overall sentiment about subjects related to COVID-19. Formal media naturally refers more to official entities and their recommendations. This can be seen in its top themes (Political and Prevention and Control), top entity groups (Organization and Disease), and diversity of entity mentions. Twitter, in contrast, is very focused on personal opinions and cases, as demonstrated by its top theme (Stories) and entity groups (Disease and Drugs). Additionally, social media tended to have a more negative polarity for all themes, while formal media seemed to present almost neutral polarity on average. Together with the very high number of collected tweets during the period, which shows that discussion about the disease was very active, we can observe the severity of the pandemic in Brazil and people’s concerns about it.
It is remarkable how the subject of COVID-19 was the target of political polarization in Brazil. This theme was frequently discussed on both formal and social media, with higher negative sentiments over time. Drugs was the second most common entity in social media discussions, and it was very focused on the use of chloroquine to treat patients with COVID-19. A suggested hypothesis to explain this finding is that Brazil's government stated many times that this drug could help treat COVID-19 while minimizing the severity of the disease. In fact, in all the periods examined in this research, the government and formal media positioned themselves in opposite fields in this discussion, which is reflected in the high number of citations to political organization entities and in the disproportional reference to this specific drug.
Finally, by applying the proposed methodology, it was possible to observe the main information being conveyed and how people were reacting to it. This provides a way to monitor the evolution of a pandemic and its effects. Moreover, we believe this information can be useful for researchers and authorities to identify potentially controversial aspects, address possible misinformation, and establish better public policies for action and communication with the population.
We discuss some limitations that can be attributed to this study as follows.
We retrieved data using a set of keywords; therefore, our data may have excluded tweets from users who wrote about the COVID-19 pandemic using different target keywords. A further limitation is that Twitter and UOL do not publish data about the profiles of their users, such as age, gender, or social class. Therefore, it was not possible to perform a stratified analysis of the users, and the results thus may not reflect the entire Brazilian population. A possible hypothesis is that different media reach different segments of society (eg, news media sites are accessed more frequently by more educated people); therefore, these differences may be reflected in the discovered topic distributions and sentiments. Thus, our findings may not be generalizable to other social media platforms or other communication media, such as television or radio. Moreover, the presented results for the selected vehicles may present some bias. For instance, a specific news media source may present a political leaning that can affect the sentiment about some themes. Therefore, while it is not our focus to explore possible bias and its impact on the results, caution is advised before assuming their generalization.
People rely on data published on the web to better understand recent global crises, and this is also occurring during the COVID-19 pandemic. News media web sites and social media are two distinguished channels of timely information. In this paper, we have proposed a methodological approach to analyze this type of media and to answer some questions regarding the COVID-19 pandemic in Brazil. The results presented and discussed in this study are particularly important because they
make it possible to understand the difference between two data sources in how they cover global crises. In addition, this paper provides a method that uses several computational techniques to process textual social media in a language other than English. As the main contribution, this method resulted in observations that can aid understanding of the COVID-19 pandemic, with a better and more meaningful sentiment timeline.
In future work, we intend to extend this study to include data from longer periods of time, even after the pandemic ends. The idea is to understand how existing media platforms and people will react when they return to a normal situation and whether some trauma will remain. Additionally, we think that the proposed methodology is useful for studying other events of interest, such as other catastrophes and elections. Therefore, we intend to improve it by implementing a tool and applying it to new study cases.
latent Dirichlet allocation
Machine Learning for Language Toolkit
named entity recognition
severe acute respiratory syndrome
Universo Online
Valence Aware Dictionary and Sentiment Reasoner
This work was funded by Samsung Ocean Center, a research and development project at the State University of Amazonas.
None declared.