Published on 25.08.20 in Vol 6, No 3 (2020): Jul-Sep
Preprints (earlier versions) of this paper are available at http://preprints.jmir.org/preprint/20794, first published May 28, 2020.
Big Data, Natural Language Processing, and Deep Learning to Detect and Characterize Illicit COVID-19 Product Sales: Infoveillance Study on Twitter and Instagram
Background: The coronavirus disease (COVID-19) pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel “infodemic,” including the online marketing and sale of unapproved, illegal, and counterfeit COVID-19 health products including testing kits, treatments, and other questionable “cures.” Enabling the proliferation of this content is the growing ubiquity of internet-based technologies, including popular social media platforms that now have billions of global users.
Objective: This study aims to collect, analyze, identify, and enable reporting of suspected fake, counterfeit, and unapproved COVID-19–related health care products from Twitter and Instagram.
Methods: This study is conducted in two phases beginning with the collection of COVID-19–related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter application programming interface for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing (NLP) and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence.
Results: We collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March to April for Twitter and February to May for Instagram. After applying our NLP and deep learning approaches, we identified 1271 tweets and 596 Instagram posts associated with questionable sales of COVID-19–related products. Generally, product introduction came in two waves, with the first consisting of questionable immunity-boosting treatments and a second involving suspect testing kits. We also detected a low volume of pharmaceuticals that have not been approved for COVID-19 treatment. Other major themes detected included products offered in different languages, various claims of product credibility, completely unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods.
Conclusions: Results from this study provide initial insight into one front of the “infodemic” fight against COVID-19 by characterizing what types of health products, selling claims, and types of sellers were active on two popular social media platforms at earlier stages of the pandemic. This cybercrime challenge is likely to continue as the pandemic progresses and more people seek access to COVID-19 testing and treatment. This data intelligence can help public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms better remove and prevent this content from harming the public.
JMIR Public Health Surveill 2020;6(3):e20794
The novel coronavirus (2019-nCoV; also known as severe acute respiratory syndrome coronavirus 2 [SARS-Cov-2]) and associated diagnosis, the coronavirus disease (COVID-19), has created a global crisis. Its broad effect has not been seen since the days of the 1918 influenza pandemic that impacted 200-700 million people (1/3 of the world’s population at the time) and resulted in global mortality of 50-100 million . Impacting virtually every corner of the world after initially appearing in Wuhan, China, COVID-19’s threat to humanity is broad [ ]. Measures to fight the threat, including social distancing, quarantine, and limited commercial activity, are now the global norm, along with travel restrictions and other measures put into place in an effort to contain the pandemic [ ].
With the advent of social media, an information-sharing culture, and technological dispersion throughout the world to access these platforms (ie, mobile, broadband access) the more than 2.9 billion global social media users now have more information resources to help them understand and protect themselves against the coronavirus . Indeed, social media platforms represent one of the most accessible sources of health information and are now being used by agencies such as the World Health Organization (WHO), US Centers for Disease Control and Prevention, US Food and Drug Administration (FDA), and others [ , ]. Social media conversations are also important for understanding public sentiment, user behavior, and disease transmission dynamics during outbreaks. For example, Twitter has been used extensively for “infoveillance” approaches to assess past outbreaks such as H1N1, Zika virus, and the Ebola outbreak [ - ].
Yet accompanying the strong utility of internet technologies and social media to positively impact outbreak response and communication is a nefarious underpinning: a criminal element that is now across and within social media seeking to capitalize on confusion, fears, and the acute needs of the public. Labeled by the WHO as an “infodemic,” where there is an overabundance of information, some of which includes misinformation and enables COVID-19–related cybercrime, this parallel information epidemic is now a serious challenge to ensuring the success of public health objectives of mitigating the spread of COVID-19 [, ]. Beyond misinformation about the etiology and basic facts of COVID-19, which the WHO is trying to counter with its COVID-19 “Myth Busters” website, other forms of COVID-19–related cybercrime are now widespread [ ].
Documented COVID-19–related cybercrimes include fake coronavirus applications that are actually malware, phishing scams using email, text message campaigns and robocalls, economic scams regarding government assistance or relief, and a host of suspect and counterfeit COVID-19 products now sold online [, ]. Numerous news outlets have reported the use of online platforms including popular social media sites as a source for suspect COVID-19–related health products [ ]. For example, COVID-19 “cures” have appeared across major electronic commerce (e-commerce) sites including Amazon.com (which reported removing a million fake COVID-19 product listings), Shopify store vendors, and other reselling and auction platforms such as eBay [ , ]. Unapproved COVID-19 test kits, both serologic as well as reverse transcriptase polymerase chain reaction tests, are being sold by multiple sources including Twitter, Instagram, and Reddit [ ]. Finally, the dark web has been identified as a source for counterfeit COVID-19 therapeutics, including biological products such as blood plasma [ ].
Hence, there is significant need to assess the characteristics of illegal online COVID-19 product marketing and sale at different stages of the pandemic. In response, this study used big data, natural language processing (NLP), and machine learning to identify the marketing and sale of suspect and unapproved COVID-19 cures, testing kits, and other questionable treatments at earlier stages of the pandemic on two popular social media platforms: Twitter and Instagram. We also describe an approach to visualize findings in a customized data dashboard to enable public health intelligence and reporting to authorities.
This retrospective big data study was conducted in two phases: (1) data collection using the public streaming Twitter application programming interface (API) and the use of web scraping on Instagram to collect social media posts filtered for COVID-19–related keywords and (2) data analysis using NLP to isolate topic clusters related to COVID-19 product sales combined with a deep learning algorithm to classify a larger volume of social media posts for classification of “signal” posts (ie, posts confirmed as associated with COVID-19 product marketing and selling; seefor summary). Data storage and analysis was conducted on an on-premise deep learning workstation in combination with a series of virtual machines deployed on Amazon Web Service cloud-computing. Additional details of the data collection, processing, and analysis are available in .
This study first applied a systematic approach to conduct data mining on Twitter by filtering the public streaming API for keywords associated with COVID-19 to collect a large corpus of general COVID-19–related conversations from March 3 to April 11, 2020. The same set of keywords were used to collect data from Instagram using a web scraper built in the programming language Python. We identified general COVID-19–related keywords based on manual searches on each of the platforms, which included different iterations of “COVID-19” (eg, “covid19,” “corona,” “coronavirus,” “coronavid19”), with these keywords converted into hashtags to conduct searches on Instagram. Text of tweets and Instagram posts were captured, as well as retweets and other metadata including likes; favorites; comments; replies; use of similar hashtags; and associated media, hyperlinks, and metadata of posts (eg, time stamp, geolocation, and account information). This metadata was primarily used to identify any potential temporal trends associated with selling posts, account characteristics of sellers, interaction of posts with other users, geospatial information, and to characterize hyperlinks to external websites that were imbedded in selling posts.
After collecting an initial corpus of tweets and Instagram posts using general COVID-19 keywords, we then filtered the corpus for additional terms we believed to be associated with the marketing and sale of illegal, suspect, counterfeit, and otherwise misleading COVID-19–related products and treatments as first identified in manual searches. A full list of all filtered terms used in this study is available in.
Data Analysis Using Unsupervised and Supervised Machine Learning Approaches
After collecting Twitter and Instagram posts, and then filtering for illegal marketing and sales terms, we processed the data by removing hashtags and stop words prior to textual analysis. To our knowledge, there is no existing training set related to detecting suspect COVID-19 products in the context of the current pandemic. This necessitated using a combination of unsupervised and supervised machine learning approaches to detect an initial training set of “signal” posts from each platform that were then used to train a supervised machine learning classifier using a deep learning model.
We first used an “unsupervised” NLP approach to group and summarize all the content of filtered social media data stratified by different product groups of filtered terms. This was accomplished by assessing the entire corpora of COVID-19 filtered data using the biterm topic model (BTM) to both identify initial signal posts in the absence of labelled data and to curate an initial labelled training set for supervised machine learning purposes (seefor additional details). We have used BTM in prior published studies to detect social media conversations related to substance use behavior, illicit drug diversion, online wildlife trafficking, and corruption-related activities [ - ].
Signal posts detected in our BTM phase were then used as our training set for a deep learning classifier designed to conduct supervised classification on the entire corpus of filtered social media posts. For this study, we adopted an existing deep learning model used to detect online controlled substance and illicit drug sales as previously published by authors . Although the original deep learning model was trained on social media posts labelled for illegal online drug sales, the signal texts of these two data sets contained very similar features (eg, specific “seller information” and “product information” features). Hence, the pretrained model helped us detect these specific “selling” features targeted for COVID-19 sellers and products. This was due to the fact that our corpus of social media posts was already purposely filtered for COVID-19 keywords (ie, not illicit drug-related terms).
Hence, this combination of unsupervised and supervised machine learning approaches enabled us to quickly develop a data collection and analysis approach for an emerging infoveillance challenge given the rapidity and large volume of COVID-19–related data and the evolving nature of the pandemic itself.
After classification by our deep learning algorithm, posts that were output by the model and classified as possible “signal” were then manually annotated to confirm if they were associated with illegal marketing and sales of COVID-19 health-related products (seefor coding scheme details). First, coders independently used a binary coding approach (ie, signal vs nonsignal) to verify if posts included the sale of a COVID-19 health product and if a contact or purchase method was made available. The purpose of this binary coding scheme was to eliminate “noise” in the data set, including COVID-19 news, regulatory product announcements, user discussions about treatments and testing, and legitimate warnings from public health, law enforcement, and other sources about COVID-19 fraud and cybercrime that were not related to product marketing or sales.
Second, we classified signal posts based on what specific COVID-19 product was being offered individually or concurrently (eg, testing kits, protective equipment, masks, and pharmaceuticals). We also conducted content analysis to characterize strategies used to market and sell products using an open inductive coding scheme based on previous work characterizing online drug sellers [, - ]. These characteristics included the method of contacting seller, method of payment (if reported), purported modality of order or purchase, and availability of hyperlinks to other internet sources enabling sale.
Coders individually selected parent topic classifications, removed duplicate topics, and evaluated thematic concurrence by independently coding the entire sample of output posts from our machine learning phase. The third, fourth, fifth, and sixth authors coded posts independently and achieved high intercoder reliability (κ=0.92). In case of inconsistent results, authors reviewed and conferred on the correct classification with the first and last authors who have previously published on the subject.
Availability of Data and Materials
Data collected on social media platforms is available on request from authors, subject to appropriate deidentification.
Ethics Approval and Consent to Participate
Ethics approval and consent to participate was not required for this study. All information collected from this study was from the public domain, and the study did not involve any interaction with users. Indefinable user information was removed from the study results.
Data was collected from March 3 to April 11, 2020, via the Twitter public API stream and from February 5 to May 7 via the web scrapper built for Instagram. During this period, we collected a total of 6,029,323 tweets and 204,597 Instagram posts that included a COVID-19 general term and that were also filtered for terms associated with suspect marketing and sale of COVID-19 products. After using our deep learning algorithm to classify all posts filtered for marketing and sales terms, we manually annotated and confirmed 1271 tweets of which 1042 were unique (seefor Twitter examples) and 596 Instagram posts (see for Instagram examples) associated with questionable sales of COVID-19–related products.
Based on the periods of data collection and terms used, we generally observed that there was a first spike or “wave” of social media posts related to fake cures and unproven treatments including home remedies, traditional medicines, supplements, essential oils, and other unproven products. This was followed by a second and much larger wave of posts, including offers for sale of suspect COVID-19 testing, screening, and diagnostic products (seefor timeline). Hence, we observed that the volume of suspect COVID-19 products on Twitter and Instagram appeared to materialize in two distinct infodemic waves during this relatively early period of the pandemic, with the volume of topics changing over time as news, misinformation, and rumors regarding potential COVID-19 treatments, supplies, testing availability, and other conversations evolved (see ).
|Infodemic wave and COVID-19a–related product||Posts, n|
|Twitter (n=1271)b||Instagram (n=596)b|
|Testing kits and PPEc||1028||571|
aCOVID-19: coronavirus disease.
bTotal does not add up to sum of posts in waves due to some posts having co-occurring COVID-19–related products.
cPPE: personal protective equipment.
The first infodemic wave involved posts related to a variety of unproven treatments (eg, including posts with terms such as “antiviral,” “antibiotic,” and products claiming “immunity boosting” benefits) along with products that were subject to regulatory warnings by the FDA (eg, silver colloidal and chlorine). During this time period of observed fake cures and unproven treatments, news events including claims by InfoWars founder Alex Jones and televangelist Jim Bakker that colloidal silver could treat COVID-19 were followed by regulatory warnings by the FDA, likely leading to increased interest and selling activity on social media for similar products. Other similar rumors regarding preventative measures and COVID-19 treatments were also circulating on the internet and social media at the time .
The second wave included terms and posts primarily selling COVID-19 testing kits (eg, terms included “IgM/IgG,” “rapidtest,” and “detectionkit”) in combination with other supplies (eg, masks, protective personal equipment, gloves, and miscellaneous protective gear). During this second wave we observed two distinct spikes of increased volume of posts on or around March 5-10 and April 7-10, 2020. The first spike in March coincided with widespread news coverage about increasing numbers of COVID-19 cases; discussion from state governments about where to get access to testing; press releases from companies discussing development of testing services, such as Quest Diagnostics announcement on March 5, 2020, about new testing services it was developing; and news about testing products undergoing evaluation by the FDA (including under emergency use authorization). The second peak in April coincided with news about testing sites opening and expanding, concerns about a US nationwide shortage of testing capacity, and possible underreporting due to testing backlogs.
Finally, we analyzed the data set for terms associated with promising therapeutics that at the time were announced as possible off-label treatments or were undergoing testing and clinical validation. This included the drugs hydroxychloroquine, chloroquine, remdesivir (proprietary name Veklury, Gilead Sciences), favipiravir (proprietary names Avigan, Abigan, FabiFlu), lopinavir/ritonavir (proprietary name Aluvia, Kaletra, AbbVie Inc), that collectively represent a mix of both proprietary and nonproprietary pharmaceutical treatments, including those that had already been approved by the FDA for non–COVID-19 indications (eg, hydroxychloroquine is approved by the FDA to treat malaria and lupus) and those that are experimental and unapproved drugs. Though we detected some posts in this category, the volume was low relative to waves 1 and 2.
COVID-19 Product Characteristics
In the first wave, which was detected in the earliest stages of the study period from March 3 to April 4, 2020, 242 tweets and 6 Instagram posts (248/1867, 13.28% of all signal posts) advertised the sale of or promoted the use of immune-boosting COVID-19 prevention and treatment products. Herbal products included three general categories: (1) premade herbal or nontraditional remedies; (2) instructions on how to create herbal concoctions and cocktails with purported immunoprotective benefits specific to COVID-19; (3) and other posts including dietary supplements and food products claiming to prevent COVID-19, such as colloidal silver. Other highly questionable products that did not fit into a specific category included a “portable hospital” device that claimed to use a negative ion current to treat COVID-19 and other viruses (seefor screenshots).
Premade herbal remedies included products represented as traditional herbal Eastern medicines and compounds but also included consumer items such as lavender spray, pawpaw trees, xylitol, and cow dung with claims of immunoprotective benefits for COVID-19. Sellers of herbal remedies tended to market themselves as doctors or healers with specific reference to Ayurvedic, Eastern, or nontraditional medicine. The descriptive text in some of these posts had misleading claims that combinations of herbal remedies could cure the virus. Moreover, other posts claimed that consumption or proximity to garlic or lomatium could treat COVID-19. Some of the posts used misleading marketing claims such as “approved” or “authorized,” despite these products having no known formal approval for COVID-19 uses.
The second wave included the majority of signal posts detected in this study (1028 tweets and 571 Instagram posts, 73.86%) involving the marketing, sale, and distribution of unapproved COVID-19 testing kits (see) and were detected from March 6 to April 10, 2020. Most of these posts advertised their testing products as IgM/IgG tests, generally a type of test that detects fluctuating antibody concentrations to determine the presence or absence of SARS-CoV-2. These products were mainly advertised as “rapid test” kits or testing supplies containing colloidal gold. Though there are official commercial rapid lab-based tests to detect IgM/IgG antibodies, in the United States, none are authorized to be sold direct-to-consumer.
An additional category of testing kit posts included products purportedly approved as at-home kits or “DIY.” However, it should be noted that as of April 21, 2020, only one home testing kit had been approved by the FDA, a home sample collection kit named Pixel by LabCorp (Laboratory Corporation of America). The Pixel kit is only for sample collection at home and the swab samples must be sent to LabCorp processing centers to process COVID-19 results. Other examples of questionable products included those that claimed they could detect COVID-19 by using a fingerstick test or through saliva and urine. Some of the rapid testing kit posts detected in this study also alleged COVID-19 results within minutes using at-home testing and even included questionable claims about the percent accuracy of their tests.
Overall, social media posts involving suspect COVID-19 testing products exhibited similar and identifiable patterns including a picture and description of the specific type of COVID-19 test, the contact information of how to purchase the test kits, and pricing information. Many posts included a claim and mark for a “CE marking,” which is a certification mark that a product conforms with applicable health, safety, and environmental protection standards for the European Economic Area but does not mean the product has been approved by regulatory authorities for COVID-19 screening or diagnosis. Some posts also included users claiming to sell FDA-approved COVID-19 testing equipment (with some products that included spurious FDA labelling in images). Pictures of specific COVID-19 testing kits included variations of the labeled box and materials of the testing kit itself, stock photos of a testing kit, or testing kit packaging. For some posts, the labeling on purported testing kits were written in different languages. Additionally, sellers advertised bundled packages that included COVID-19 tests offered concurrently with personal protective equipment (PPE).
Specific to PPE, we detected 535 out of 1867 (28.66%) posts that offered the sale of masks, gloves, and other protective gear in conjunction with tests. Posts mentioned sales available via individual purchases, wholesale, or in bundles with equipment such as temperature gauges, protective suits, hand sanitizers, and immunity boosting kits. Additionally, compounds such as silver hydrosol, colloidal silver, and antimicrobial copper were advertised as medical supplies that could confer immune boosting benefits and help with COVID-19 prevention in a variety of ways. PPE and supply posts often included the cost and approximate shipping time, with some linked to an external medical supply company e-commerce site.
Finally, we detected a small volume of posts offering the sale of COVID-19–related therapeutics, none that, at the time, had been approved for the treatment of COVID-19 (see), which were detected from March 5, 2020, to April 13, 2020. The majority of posts reviewed for these therapeutic-related filtered terms contained noise and were not engaged in the online sale of actual pharmaceuticals. On Instagram, we only detected 19 posts purportedly selling hydroxychloroquine and chloroquine, 2 posts selling remdesivir, and 1 post selling favipiravir. For Twitter, we detected 5 tweets selling hydroxychloroquine. All tweets selling hydroxychloroquine also concurrently sold PPE.
COVID-19 Seller Metadata Characteristics
Sellers used key selling arguments common in e-commerce marketing that included offers of home delivery, free shipping, or discount codes to lower the price of COVID-19 testing kits and other products identified. Marketing tactics also included key selling argument terms such as “great news,” “flash sale,” “reliable,” “rapid,” “bulk sale,” and “immediate response” to give prospective buyers a sense of urgency and promote availability of products that were generally in scarcity in the legitimate supply chain during the study period. Other keywords included product descriptions that users could easily understand and identify including “immunity spray,” “Corona Kit,” and “IgM/IgG.” Because hashtags provide a way for users to curate topics of common interest, many posts included hashtags of the specific product they were selling (eg, #hydroxychloroquine, #IgM/IgG #test, and #testkit) in combination with general COVID-19 tags (eg, #coronavirus, #COVID9, and #rapidtest).
Generally, profiles of sellers included metadata and images that made them appear to originate from individual users. However, upon closer inspection, some of these accounts appeared to be cloned accounts with identical profile pictures and similar usernames that varied by only one or more characters to another more established and likely legitimate social media account. Accounts that were not represented as individuals or had affiliations were generally represented as medical supply or pharmaceutical companies. Individual and organizational accounts claimed to carry inventory of various COVID-19 testing kits and PPE, with alternative medicine–related accounts also selling various herbal remedies. Most posts contained pictures that included the product package, contents of the package and additional text, or had a general illustration of the SARS-CoV-2 virus. Some posts also included hyperlinks to external sites selling COVID-19 products, including 124 twitter posts (90 unique hyperlinks) and 41 Instagram posts (25 unique hyperlinks).
Contact information to enter into a transaction generally included instructions and details for direct messaging, WhatsApp numbers, email addresses, WeChat, and Skype for direct contact with seller. Some posts for testing kits also included hyperlinks to external e-commerce sites for purchase. Still other posts had descriptive text that linked to the user profile for additional contact information. A number of different languages were identified in the descriptive text of selling posts including English, Chinese, Japanese, Spanish, German, Arabic, Hindi, Russian, Ukrainian, Thai, and some others. For posts in a non-English language, coders self-translated those in Chinese, Japanese, Spanish, and Hindi, as coauthors spoke and read these languages. For other languages, the study team relied on Google Translate to assess the content of posts and if they were signals.
We noted that our deep learning classifier focuses on the detection of “selling” arguments (in the English language) and the presence of contact information from a seller. Hence, it is possible that not all non-English COVID-19 selling posts were detected, though non-English signal posts may contain the features subject to classification. The presence of non-English language posts and characters likely indicates that signal posts targeted non-US audiences and social media users, even though the majority of users on both of these platforms are located in the United States [- ]. However, determining more precise geolocation of users was difficult as only 87 tweets and 134 Instagram posts had geotagged information available.
Generally, the metadata associated with the majority of signal posts indicated that there was medium to low levels of interaction with other social media users based on the number of likes, favorites, or retweets (the general metric of how much sharing and dissemination a post is getting on social media). The majority of tweets or Instagram posts had few likes, retweets, and followers. No signal posts were retweeted more than 50 times; 9 (0.86%) were retweeted more than 10 times, and 1033 (99.13% of unique tweets) were retweeted less than 10 times. For Instagram posts, the average number of “likes” for a signal post was 12.5, with 87 (15.5%) having more than 10 likes and 473 (84.4%) posts having less than 10 likes. For the interaction that was observed, we noticed that there was more interaction between sellers and other users in the comments section on Instagram compared to replies on Twitter. There were exceptions, with one detected twitter post from an account with over 97,000 Twitter followers and 1.5 million Instagram followers advertising sale of COVID-19 at-home finger stick IgG/IgM test on both Twitter and Instagram from what was characterized as a “LEGIT” supplier.
Although the majority of signal posts included contact information and instructions on purchasing the product, pricing information was included for less than 30 posts, primarily advertising sales of COVID-19 testing kits. The prices of testing kits ranged from US $4-$398 (all currencies converted to US dollars) for offers of individual kits as well as bulk orders. Individual kits were priced as low as US $4 to a maximum of US $375 with a mean cost of US $64.63 (SD $92.96) and a median cost of US $20.61/kit. Bulk kits were priced in the range of US $30.76-$398 for 25-50 kits/box with a mean cost of US $168.70 (SD $175.88). A questionable product described as a “portable hospital” device that claimed to use a negative ion current to treat COVID-19 and other viruses was priced at US $6000 (seefor screenshots). Posts advertising availability of large quantities of testing kits also mentioned kits could be purchased at a cheaper price if ordered in bulk (hundreds to thousands). A few posts also included links to major e-commerce platforms such as eBay or AliExpress. Fiat currency was not limited to US dollars but included Euros, Pound Sterling, Indian Rupee, Philippine Peso, and other currencies. Additionally, payment transactions could be effectuated through PayPal or cryptocurrency such as Bitcoin.
This study used big data and machine learning approaches to detect and characterize illegal offers of sale for COVID-19 products on Twitter and Instagram. Overall, the total volume of illegal selling posts detected was low relative to the total volume of COVID-19 conversations collected (our nonfiltered general COVID-19 data set over this time period had over 165 million tweets and more than 272,000 Instagram posts), though the number of tweets and Instagram posts collectively were over 1000 representing a clear risk to patient safety. A possible reason for the small percentage of signal posts was that our data collection approach started with general COVID-19–related social media posts that were not specific to illegal sales but instead filtered for these terms after data collection was complete. As the overall volume of COVID-19 social media posts was extremely high, a more purposeful sampling approach focused on COVID-19–related health products or testing kits may have yielded a corpus with more signal.
Despite these limitations, we nevertheless identified over 1000 suspect selling posts, with the majority related to unapproved COVID-19 testing kits, which were detected at a time when access to legitimate COVID-19 testing in countries like the United States was extremely limited [, ]. Based on the language, currency, and content of these posts, this infoveillance challenge also appears to be global, though the majority of posts detected were in the English language, reflecting the fact that most social media users on Twitter and Instagram are located in the United States. Far fewer posts were detected for therapeutic products, though separate research conducted by our own group and others have turned up various drugs (including hydroxychloroquine, chloroquine, and favipiravir), vaccines, and even blood plasma offered for sale via illegal online pharmacies, e-commerce sites, and on the dark web [ , ].
The lack of signal posts for COVID-19 therapeutics may indicate that product segmentation on different parts of the internet is occurring. Specifically, illegal online pharmacies and dark web marketplaces may have already been selling these products outside of the context of treating COVID-19, as many of these drugs are already approved for other indications, diminishing the opportunity or need for direct sales to consumers via social media. Consumer demand for drugs may have also been muted as the number of confirmed COVID-19 cases at the time was relatively low and there was limited evidence regarding the efficacy of these products or their given active pharmaceutical ingredient to treat COVID-19. Instead, a widespread lack of access to testing may have made getting a COVID-19 diagnosis a priority before seeking treatment, reflected in our high detection of suspect COVID-19 testing kits.
Importantly, the presence of this type of criminal activity and fraud on social media is not a new phenomenon, as cybercriminals are keen to take advantage of the anonymity, convenience, and accessibility to the public that these platforms offer to advance crimes of opportunity. In the case of COVID-19, we are arguably in the midst of a “cyber syndemic,” where the public health consequences of COVID-19 simultaneously interact with the unique risks associated with the internet and social media together, which can worsen the spread of the disease. Specifically, the posts detected in this study can bring both economic and health harm by introducing unproven, substandard, falsified, and counterfeit health products to those afflicted by COVID-19, leading to financial loss while also increasing the risk of disease spread by negatively influencing health behaviors .
Reflecting the real-world consequences of COVID-19–related crimes, the US Federal Trade Commission estimates that there have already been US $40 million in losses due to COVID-19 fraud . Law enforcement groups such as the US Customs and Border Protection have intercepted hundreds of fake COVID-19–related products at borders and have launched several initiatives such as Homeland Security Investigations’ “Operation Stolen Promise” and the S.T.O.P COVID-19 Fraud Campaign [ ]. The US Federal Bureau of Investigation reported a 300% increase in fraud and cybercrime scams since 2019-nCoV appeared [ ]. Operation Pangea, an Interpol-led takedown of illicit internet sites, focused its March 2020 activities on COVID-19 scams [ ]. It found extensive and growing fraud for coronavirus medical “treatments,” cures, and protective equipment as well as, more recently, sales of all forms of chloroquine [ ]. The European Anti-Fraud Office also announced that the European Union will dedicate resources to target fake coronavirus medical and protection products being sold online.
However, combatting COVID-19 cybercrime, and more specifically illegal online sales of COVID-19 health products, is a dynamic challenge. Existing difficulties of interdicting the global illegal online trafficking of counterfeit and falsified products are accelerated and accentuated during a pandemic, as information rapidly changes, misinformation proliferates, and platforms struggle to self-regulate large and diverse volumes of content on the pandemic. In contrast, black markets can be adaptive to these types of crises, with scammers seeking to take advantage of confusion and heightened concerns about safety and health risks to target vulnerable consumers with fake products and treatments .
To address these challenges, a data-driven public health intelligence approach is needed. Specifically, although the results of this study are informative to the characteristics of illegal COVID-19 online sales during early stages of the outbreak, they are nevertheless static, only reflecting the degree of risk to the public at a single point-of-time. Instead, active surveillance of illegal COVID-19 digital marketplaces is needed, along with the use of visualization tools that can provide needed data intelligence to understand the constantly changing dynamics of this infodemic threat. Recognizing this need, we have developed a prototype data dashboard on the open source platform Redash that can be used by public health officials, drug regulatory authorities, and law enforcement agencies that visualizes our ongoing big data digital surveillance work to detect and classify illegal marketing, sales, and trafficking of COVID-19–related products on social media platforms and other parts of the internet. The public version of the dashboard can be viewed at . The dashboard reports and visualizes characteristics of illegal selling posts including location (if available), generates a list of top-related hashtags, captures images of suspect products, and analyzes other metadata about selling activity. We have shared a version of this dashboard with colleagues at the WHO and FDA in hopes that it can help improve and accelerate content removal, increase awareness of the risks to consumers, and lead to a safer online environment in the midst of this ongoing pandemic.
Our study provides a snapshot of the characteristics of illegal online sales of COVID-19–related health products on two popular social media platforms: Twitter and Instagram. It also details an innovative methodology using a combination of unsupervised and supervised machine learning to detect illegal sales during a global pandemic. Unfortunately, illegal online sales of COVID-19 health products are likely to continue and possibly accelerate as this health emergency continues to progress. A “flattening of the curve” will not halt the progression of this parallel infodemic, as the public continues to desperately seek access to COVID-19 testing, therapeutics, and an eventual vaccine. As legitimate news about promising and new COVID-19 treatments and countermeasures becomes available, scammers and counterfeiters will inevitably seek to capitalize on desperation and high demand from global citizens who simply want to be safe and prepared against this historic disease. Future studies should continue to explore the dynamic nature of the COVID-19 cyber syndemic and build solutions to prevent its digital spread.
JL collected the data. All authors designed the study, conducted the data analyses, wrote the manuscript, and approved the final version of the manuscript.
Conflicts of Interest
TM, JL, MN, and MC are employees of the startup company S-3 Research LLC. S-3 Research is a startup funded and currently supported by the National Institutes of Health – National Institute on Drug Abuse through a Small Business Innovation and Research contract for opioid-related social media research and technology commercialization. Authors report no other conflict of interest associated with this manuscript.
Additional details regarding study methods.DOCX File , 116 KB
Coding scheme for social media posts.DOCX File , 17 KB
- Patterson KD, Pyle GF. The geography and mortality of the 1918 influenza pandemic. Bull Hist Med 1991;65(1):4-21. [Medline]
- Guan W, Ni Z, Hu Y, Liang W, Ou C, He J, et al. Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med 2020 Apr 30;382(18):1708-1720. [CrossRef]
- Lau H, Khosrawipour V, Kocbach P, Mikolajczyk A, Schubert J, Bania J, et al. The positive impact of lockdown in Wuhan on containing the COVID-19 outbreak in China. J Travel Med 2020 May 18;27(3) [FREE Full text] [CrossRef] [Medline]
- Merchant RM. Evaluating the potential role of social media in preventive health care. JAMA 2020 Jan 10. [CrossRef] [Medline]
- Bujnowska-Fedak M, Waligóra J, Mastalerz-Migas A. The internet as a source of health information and services. Adv Exp Med Biol 2019;1211:1-16. [CrossRef] [Medline]
- Beck F, Richard J, Nguyen-Thanh V, Montagni I, Parizot I, Renahy E. Use of the internet as a health information resource among French young adults: results from a nationally representative survey. J Med Internet Res 2014 May 13;16(5):e128 [FREE Full text] [CrossRef] [Medline]
- Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J Med Internet Res 2009 Mar 27;11(1):e11 [FREE Full text] [CrossRef] [Medline]
- Chew C, Eysenbach G. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLoS One 2010 Nov 29;5(11):e14118 [FREE Full text] [CrossRef] [Medline]
- Fu K, Liang H, Saroha N, Tse ZTH, Ip P, Fung IC. How people react to Zika virus outbreaks on Twitter? A computational content analysis. Am J Infect Control 2016 Dec 01;44(12):1700-1702. [CrossRef] [Medline]
- Young R, Tully M, Dalrymple K. #Engagement: use of Twitter chats to construct nominal participatory spaces during health crises. Inf Commun Soc 2017 Mar 27;21(4):499-515. [CrossRef]
- Zarocostas J. How to fight an infodemic. Lancet 2020 Feb 29;395(10225):676 [FREE Full text] [CrossRef] [Medline]
- Hao K, Basu T. The coronavirus is the first true social-media "infodemic". MIT Technology Review. 2020 Feb 12. URL: https://www.technologyreview.com/s/615184/the-coronavirus-is-the-first-true-social-media-infodemic/ [accessed 2010-03-16]
- COVID-19 cyberthreats. Interpol. URL: https://www.interpol.int/en/Crimes/Cybercrime/COVID-19-cyberthreats [accessed 2020-05-27]
- Saltzman M. Coronavirus pandemic generates new fraud strains: COVID-19 scams on computers, smartphones Internet. USA Today. 2020 Apr 04. URL: https://www.usatoday.com/story/tech/columnist/2020/04/04/coronavirus-scams-going-viral-attacking-computers-and-smartphones/2939240001/ [accessed 2020-05-27]
- Carrns A. Bogus vaccines. Fake testing sites. Virus frauds are flourishing. The New York Times. 2020 Apr 17. URL: https://www.nytimes.com/2020/04/17/your-money/coronavirus-fraud.html [accessed 2020-05-27]
- Heilweil R. Coronavirus scammers are flooding social media with fake cures and tests. Vox. 2020 Apr 17. URL: https://www.vox.com/recode/2020/4/17/21221692/digital-black-market-covid-19-coronavirus-instagram-twitter-ebay [accessed 2020-05-27]
- How many fake COVID-19 cures has Amazon.com had to remove? Government Technology. 2020 Mar 02. URL: https://www.govtech.com/question-of-the-day/Question-of-the-Day-for-03022020.html [accessed 2020-04-01]
- Frenkel S, Alba D, Zhong R. Surge of virus misinformation stumps Facebook and Twitter. The New York Times. 2020 Mar 08. URL: https://www.nytimes.com/2020/03/08/technology/coronavirus-misinformation-social-media.html [accessed 2020-04-01]
- Predit R. FDA warns Americans to beware of fake COVID-19 test kits. US News. 2020 Mar 23. URL: https://www.usnews.com/news/health-news/articles/2020-03-23/fda-warns-americans-to-beware-of-fake-covid-19-test-kits [accessed 2020-04-01]
- Whitney L. The dark web: where coronavirus fraud, profiteering, malware, and scams are discussed. TechRepublic. 2020 Mar 30. URL: https://www.techrepublic.com/article/the-dark-web-where-coronavirus-fraud-profiteering-malware-and-scams-are-discussed/ [accessed 2020-04-01]
- Kalyanam J, Katsuki T, R G Lanckriet G, Mackey TK. Exploring trends of nonmedical use of prescription drugs and polydrug abuse in the Twittersphere using unsupervised machine learning. Addict Behav 2017 Feb;65:289-295. [CrossRef] [Medline]
- Mackey TK, Kalyanam J, Katsuki T, Lanckriet G. Twitter-based detection of illegal online sale of prescription opioid. Am J Public Health 2017 Dec;107(12):1910-1915. [CrossRef] [Medline]
- Xu Q, Li J, Cai M, Mackey TK. Use of machine learning to detect wildlife product promotion and sales on Twitter. Front Big Data 2019 Aug 27;2. [CrossRef]
- Li J, Xu Q, Shah N, Mackey TK. A machine learning approach for the detection and characterization of illicit drug dealers on Instagram: model evaluation study. J Med Internet Res 2019 Jun 15;21(6):e13803 [FREE Full text] [CrossRef] [Medline]
- Liang BA, Mackey TK. Online availability and safety of drugs in shortage: a descriptive study of internet vendor characteristics. J Med Internet Res 2012 Feb 09;14(1):e27 [FREE Full text] [CrossRef] [Medline]
- Kalyanam J, Mackey TK. A review of digital surveillance methods and approaches to combat prescription drug abuse. Curr Addict Rep 2017 Sep 18;4(4):397-409. [CrossRef]
- Orizio G, Rubinelli S, Schulz PJ, Domenighini S, Bressanelli M, Caimi L, et al. "Save 30% if you buy today". Online pharmacies and the enhancement of peripheral thinking in consumers. Pharmacoepidemiol Drug Saf 2010 Sep;19(9):970-976. [CrossRef] [Medline]
- Lytvynenko J. Here's a running list of the latest hoaxes spreading about the coronavirus. BuzzFeed News. 2020 Mar 16. URL: https://www.buzzfeednews.com/article/janelytvynenko/coronavirus-fake-news-disinformation-rumors-hoaxes [accessed 2020-07-15]
- Kulshrestha J, Kooti F, Nikravesh A, Gummadi K. Geographic dissection of the Twitter network. 2012 Presented at: Sixth International AAAI Conference on Weblogs and Social Media; June 4–7, 2012; Dublin, Ireland.
- Bakerman J, Pazdernik K, Wilson A, Fairchild G, Bahran R. Twitter geolocation: a hybrid approach. ACM Trans Knowl Discov Data 2018 Apr 27;12(3):1-17. [CrossRef]
- Ajao O, Hong J, Liu W. A survey of location inference techniques on Twitter. J Information Sci 2015 Nov 20;41(6):855-864. [CrossRef]
- Sharfstein JM, Becker SJ, Mello MM. Diagnostic testing for the novel coronavirus. JAMA 2020 Mar 09. [CrossRef] [Medline]
- Weaver C, Ballhaus R. Coronavirus testing hampered by disarray, shortages, backlogs. Wall Street J 2020.
- Franklin S. Bogus cures and unapproved treatments for COVID-19 flood the internet. LegitScript. 2020 Mar 26. URL: https://www.legitscript.com/blog/2020/03/bogus-cures-and-unapproved-treatments-for-covid-19-flood-the-internet/ [accessed 2020-05-27]
- Broadhurst R, Ball M, Jiang C. Availability of COVID-19 related products on Tor darknet markets. AIC. 2020 Apr 30. URL: http://www.aic.gov.au/publications/sb/sb24 [accessed 2020-05-27]
- Hatmaker T. Twitter broadly bans any COVID-19 tweets that could help the virus spread. TechCrunch. 2020 Mar 18. URL: https://social.techcrunch.com/2020/03/18/twitter-coronavirus-covid-19-misinformation-policy/ [accessed 2020-05-27]
- Porter J. Facebook and Instagram to remove coronavirus misinformation. The Verge. 2020 Jan 31. URL: https://www.theverge.com/2020/1/31/21116500/facebook-instagram-coronavirus-misinformation-false-cures-prevention [accessed 2020-05-27]
- Hern A. Twitter to remove harmful fake news about coronavirus. The Guardian. 2020 Mar 19. URL: http://www.theguardian.com/world/2020/mar/19/twitter-to-remove-harmful-fake-news-about-coronavirus [accessed 2020-04-01]
- Statt N. Major tech platforms say they’re ‘jointly combating fraud and misinformation’ about COVID-19. The Verge. 2020 Mar 16. URL: https://www.theverge.com/2020/3/16/21182726/coronavirus-covid-19-facebook-google-twitter-youtube-joint-effort-misinformation-fraud [accessed 2020-03-16]
- Newton PN, Bond KC, 53 signatories from 20 countries. COVID-19 and risks to the supply and quality of tests, drugs, and vaccines. Lancet Glob Health 2020 Jun;8(6):e754-e755 [FREE Full text] [CrossRef] [Medline]
- FTC COVID-19 complaints. Federal Trade Commission. URL: https://www.ftc.gov/system/files/attachments/coronavirus-covid-19-consumer-complaint-data/covid-19-daily-public-complaints.pdf [accessed 2020-05-27]
- CBP officers seize fake COVID-19 test kits at LAX. US Customs and Boarder Protection. 2020 Mar 14. URL: https://www.cbp.gov/newsroom/national-media-release/cbp-officers-seize-fake-covid-19-test-kits-lax [accessed 2020-04-01]
- England R. FBI sees cybercrime reports increase fourfold during COVID-19 outbreak. Engadget. 2020 Apr 20. URL: https://www.engadget.com/fbi-cybercrime-complaints-increase-fourfold-covid-19-091946793.html [accessed 2020-05-27]
- Interpol. Global operation targets rise in counterfeit COVID-related medical items. React Weekly 2020 Apr 4;1798(1):2-2. [CrossRef]
- Winder D. There’s no COVID-19 cure online: $14 million seized in fake pharma as 121 arrested. Forbes. 2020 Mar 24. URL: https://www.forbes.com/sites/daveywinder/2020/03/24/theres-no-covid-19-cure-online-14-million-seized-in-fake-pharma-as-121-arrested/ [accessed 2020-04-01]
- McDougal T, Mackey T. How black markets have adapted to - and shaped - the COVID-19 crisis. Vision of Humanity. 2020. URL: http://visionofhumanity.org/economists-on-peace/how-black-markets-have-adapted-to-and-shaped-the-covid-19-crisis/ [accessed 2020-07-16]
- Covid19 Fake Social-media Surveillance (public). Redash. URL: http://18.104.22.168/public/dashboards/tkgBY3iXox3AZnCa2d5wU5ziogWmOhF69newT2R6?org_slug=default&p_from=2007-04-01&p_to=d_now&p_platform=%5B%22instagram%22, %22twitter%22,%22tumblr%22,%22reddit%22,%22youtube%22,%22darkweb%22%5D
- Yang X, Luo J. Tracking illicit drug dealing and abuse on Instagram using multimodal analysis. ACM Trans Intell Syst Technol 2017 Jul 17;8(4):1-15. [CrossRef]
- Santosh KC. AI-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data. J Med Syst 2020 Mar 18;44(5):93 [FREE Full text] [CrossRef] [Medline]
- Schaumberg AJ, Juarez-Nicanor WC, Choudhury SJ, Pastrián LG, Pritt BS, Prieto Pozuelo M, et al. Interpretable multimodal deep learning for real-time pan-tissue pan-disease pathology search on social media. Mod Pathol 2020 May 28. [CrossRef] [Medline]
- Lazaridis M, Axenopoulos A, Rafailidis D, Daras P. Multimedia search and retrieval using multimodal annotation propagation and indexing techniques. Signal Processing Image Commun 2013 Apr;28(4):351-367. [CrossRef]
|API: application programming interface|
|BTM: biterm topic model|
|COVID-19: coronavirus disease|
|e-commerce: electronic commerce|
|FDA: Food and Drug Administration|
|NLP: natural language processing|
|PPE: personal protective equipment|
|SARS-CoV-2: severe acute respiratory syndrome coronavirus 2|
|WHO: World Health Organization|
|2019-nCoV: novel coronavirus|
Edited by T Sanchez; submitted 28.05.20; peer-reviewed by H Zhao, A Fittler; comments to author 10.07.20; revised version received 17.07.20; accepted 03.08.20; published 25.08.20
©Tim Ken Mackey, Jiawei Li, Vidya Purushothaman, Matthew Nali, Neal Shah, Cortni Bardier, Mingxiang Cai, Bryan Liang. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 25.08.2020.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.