This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
At the time of this writing, the coronavirus disease (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources, and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much of the conversation about these phenomena now occurs online on social media platforms like Twitter.
In this paper, we describe a multilingual COVID-19 Twitter data set that we are making available to the research community via our COVID-19-TweetIDs GitHub repository.
We started this ongoing data collection on January 28, 2020, leveraging Twitter’s streaming application programming interface (API) and Tweepy to follow certain keywords and accounts that were trending at the time data collection began. We used Twitter’s search API to query for past tweets, resulting in the earliest tweets in our collection dating back to January 21, 2020.
Since the inception of our collection, we have actively maintained and updated our GitHub repository on a weekly basis. We have published over 123 million tweets, with over 60% of the tweets in English. This paper also presents basic statistics that show that Twitter activity responds and reacts to COVID-19-related events.
It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This data set could also help track COVID-19-related misinformation and unverified rumors or enable the understanding of fear and panic—and undoubtedly more.
The first cases of coronavirus disease (officially named COVID-19 by the World Health Organization [WHO] on February 11, 2020) were reported in Wuhan, China, in late December 2019; the first fatalities were reported in early 2020 [
Preventative measures implemented by national, state, and local governments now affect the daily routines of millions of people worldwide [
We describe a Twitter data set about COVID-19-related online conversations that we are sharing with the research community. People all over the world take to Twitter to express opinions and engage in dialogue in a public forum, and, with Twitter’s open application programming interface (API), has proven to be an invaluable resource for studying a wide range of topics. Twitter has long been used by the research community as a means to understand dynamics observable in online social networks, from information dissemination [
We began collecting data in real time from Twitter, with the earliest tweets dating to January 21, 2020, by tracking COVID-19-related keywords and accounts. Here, we describe the data collection methods, document initial data statistics, and provide information about how to obtain and use the data.
We have been actively collecting tweets since January 28, 2020, leveraging Twitter's streaming API [
Our collection relies upon publicly available data and is hence registered as IRB (institutional review board) exempt by the University of Southern California IRB (approved protocol UP-17-00610). We release the data set with the stipulation that those who use it must comply with Twitter’s Terms and Conditions [
By continuously monitoring Twitter's trending topics, keywords, and sources associated with COVID-19, we did our best to capture conversations related to the outbreak.
Twitter's streaming API returns any tweet containing the keyword(s) in the text of the tweet, as well as in its metadata; therefore, it is not always necessary to have each permutation of a specific keyword in the tracking list. For example, the keyword “Covid” will return tweets that contain both “Covid19” and “Covid-19.” We list a subset of the keywords and accounts that we are following in
A sample of the keywords that we are actively tracking in our Twitter collection; see the GitHub repository for a full list of all tracked keywords (v1.8—May 8, 2020) [
Tracked since | Keyword |
1/21/2020 | Coronavirus; Corona; CDC; Ncov; Wuhan; Outbreak; China |
1/22/2020 | Koronavirus; Wuhancoronavirus; Wuhanlockdown; N95; Kungflu; Epidemic; Sinophobia |
2/16/2020 | Covid-19 |
3/2/2020 | Corona virus |
3/6/2020 | Covid19; Sars-cov-2 |
3/8/2020 | COVID–19 |
3/12/2020 | COVD; Pandemic |
3/13/2020 | Coronapocalypse; CancelEverything; Coronials; SocialDistancing |
3/14/2020 | Panic buying; DuringMy14DayQuarantine; Panic shopping; InMyQuarantineSurvivalKit |
3/16/2020 | chinese virus; stayhomechallenge; DontBeASpreader; lockdown |
3/18/2020 | shelteringinplace; staysafestayhome; trumppandemic; flatten the curve |
3/19/2020 | PPEshortage; saferathome; stayathome |
3/21/2020 | GetMePPE |
3/26/2020 | covidiot |
3/28/2020 | epitwitter |
3/31/2020 | Pandemie |
Account names that we are actively tracking in our Twitter collection (v1.8—May 8, 2020).
Tracked since | Account name |
1/22/2020 | PneumoniaWuhan; CoronaVirusInfo; V2019N; CDCemergency; CDCgov; WHO; HHSGov; NIAIDNews |
3/15/2020 | DrTedros |
Our data collection will continue uninterrupted for the foreseeable future. As the pandemic continues to run its course, we anticipate that the amount of data will grow significantly. The data set is available on GitHub [
There are a few known gaps in the data, which are listed in
All of the Tweet ID files are stored in folders that indicate the year and month the tweet was posted (YEAR-MONTH). The individual Tweet ID files each contain a collection of Tweet IDs, with the file names all beginning with the prefix “coronavirus-tweet-id-” followed by the year, month, date, and hour the tweet was posted (YEAR-MONTH-DATE-HOUR).
We note that if a tweet has been removed from the platform, researchers will not be able to obtain the original Tweet.
List of all releases and their statistics.
Release version | Release date | Data collection period | Tweets, n |
v1.0 | 3/17/2020 | 3/05/2020 - 3/12/2020 | 8,919,411 |
v1.1 | 3/23/2020 | 1/21/2020 - 3/12/2020 | 63,616,072 |
v1.2 | 3/31/2020 | 1/21/2020 - 3/21/2020 | 72,403,796 |
v1.3 | 4/11/2020 | 1/21/2020 - 4/03/2020 | 87,209,465 |
v1.4 | 4/13/2020 | 1/21/2020 - 4/10/2020 | 94,671,486 |
v1.5 | 4/20/2020 | 1/21/2020 - 4/17/2020 | 101,771,227 |
v1.6 | 4/26/2020 | 1/21/2020 - 4/24/2020 | 109,013,655 |
v1.7 | 5/04/2020 | 1/21/2020 - 5/01/2020 | 115,929,358 |
v1.8 | 5/11/2020 | 1/21/2020 - 5/08/2020 | 123,113,914 |
Known gaps in the data set in UTC (v1.8—May 8, 2020).
Date | Time |
2/1/2020 | 4:00 - 9:00 UTC |
2/8/2020 | 6:00 - 7:00 UTC |
2/22/2020 | 21:00 - 24:00 UTC |
2/23/2020 | 0:00 - 24:00 UTC |
2/24/2020 | 0:00 - 4:00 UTC |
2/25/2020 | 0:00 - 3:00 UTC |
3/2/2020 | Intermittent internet connectivity issues |
Our 9th release spans January 21, 2020, through May 8, 2020. The data set available now contains tweets from January 21, 2020 (22:00 UTC), through May 8, 2020 (21:00 UTC), with 123,113,914 tweets. The language breakdown of the tweets can be found in
Breakdown of the most popular languages and the number of associated tweets (v1.8—May 8, 2020).
Language | ISOa | Tweets (N=123,113,914), n (%) |
English | en | 80,698,556 (65.55) |
Spanish | es | 13,848,449 (11.25) |
Indonesian | in | 4,196,591 (3.41) |
French | fr | 3,762,601 (3.06) |
Portuguese | pt | 3,451,196 (2.80) |
Japanese | ja | 2,897,046 (2.35) |
Thai | th | 2,754,627 (2.24) |
(undefined) | und | 2,711,649 (2.20) |
Italian | it | 1,615,916 (1.31) |
Turkish | tr | 1,308,989 (1.06) |
aISO: International Organization for Standardization.
In order to use any Twitter-facing libraries, including hydration software, users must first apply for a Twitter developer account and obtain the necessary authentication tokens [
The GitHub community has also generously contributed scripts to enable researchers to hydrate the Tweet IDs using
We present an initial analysis of our collected data set that verifies that Twitter discourse statistics reflect major events at the time, and leverage Business Insider [
We tracked the frequency of COVID-19-related hashtags, specifically those that contain the substrings “wuhan,” “coronavirus,” and “covid” throughout our collection period (
Usage of hashtags containing the substrings “wuhan,” “covid,” and “coronavirus” over time. COVID-19: coronavirus disease; WHO: World Health Organization.
We then examined the percentage of total tweets posted in different languages (
There was also a significant spike in tweets from Italy when the first case related to COVID-19 was reported in Lodi, Italy, and first death was seen in Veneto [
Tweets in Spanish, Italian, and Japanese over time (our multilingual database began data collection after January 28, 2020).
Verified users on Twitter have been identified by Twitter as accounts of public interest and are verified to be authentic accounts [
Number of tweets from verified users over time. COVID-19: coronavirus disease; WHO: World Health Organization.
There are several limitations to our data set. We collect our data set leveraging Twitter’s free streaming API, which only returns 1% of the total Twitter volume, and the volume of tweets we collected continues to be dependent on our filter endpoint and network connection [
While our data set is a multilingual data set, containing tweets in over 67 languages, the keywords and accounts we have been tracking and continue to track have been mostly English keywords and accounts. Thus, there is a significant bias in favor of English tweets in our data set over tweets in other languages.
Despite these limitations, our data collection gathers over 1 million tweets a day from the 1% of tweets available to us through Twitter’s API, and our data set contains on average 35% non-English tweets. Our collection begins in late January, capturing tweets during many major developments, and we plan on continuing collecting tweets for the foreseeable future.
application programming interface
coronavirus disease
institutional review board
World Health Organization
The authors gratefully acknowledge support from the Defense Advanced Research Projects Agency (DARPA); contract #W911NF-17-C-0094.
EC was responsible for data curation. All authors contributed to the writing of this manuscript.
None declared.