This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
To harness the full potential of social media for epidemiological surveillance of drug abuse trends, the field needs a greater level of automation in processing and analyzing social media content.
The objective of the study is to describe the development of supervised machine-learning techniques for the eDrugTrends platform to automatically classify tweets by type/source of communication (personal, official/media, retail) and sentiment (positive, negative, neutral) expressed in cannabis- and synthetic cannabinoid–related tweets.
Tweets were collected using Twitter streaming Application Programming Interface and filtered through the eDrugTrends platform using keywords related to cannabis, marijuana edibles, marijuana concentrates, and synthetic cannabinoids. After creating coding rules and assessing intercoder reliability, a manually labeled data set (N=4000) was developed by coding several batches of randomly selected subsets of tweets extracted from the pool of 15,623,869 collected by eDrugTrends (May-November 2015). Out of 4000 tweets, 25% (1000/4000) were used to build source classifiers and 75% (3000/4000) were used for sentiment classifiers. Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machines (SVM) were used to train the classifiers. Source classification (n=1000) tested Approach 1 that used short URLs, and Approach 2 where URLs were expanded and included into the bag-of-words analysis. For sentiment classification, Approach 1 used all tweets, regardless of their source/type (n=3000), while Approach 2 applied sentiment classification to personal communication tweets only (2633/3000, 88%). Multiclass and binary classification tasks were examined, and machine-learning sentiment classifier performance was compared with Valence Aware Dictionary for sEntiment Reasoning (VADER), a lexicon and rule-based method. The performance of each classifier was assessed using 5-fold cross validation that calculated average F-scores. One-tailed
In multiclass source classification, the use of expanded URLs did not contribute to significant improvement in classifier performance (0.7972 vs 0.8102 for SVM,
The study provides an example of the use of supervised machine learning methods to categorize cannabis- and synthetic cannabinoid–related tweets with fairly high accuracy. Use of these content analysis tools along with geographic identification capabilities developed by the eDrugTrends platform will provide powerful methods for tracking regional changes in user opinions related to cannabis and synthetic cannabinoids use over time and across different regions.
To design effective prevention, intervention, and policy measures, public health professionals require timely and reliable information on new and emerging drug use practices and trends [
Twitter is a microblogging service provider and social network platform that was launched in 2006. Currently, Twitter reports 310 million monthly active users [
Because of the high volume of data generated by Twitter users and availability of geographic information, analysis of tweets can help identify geographic and temporal trends [
Several prior studies used manual coding to classify cannabis, alcohol, and other drug-related tweets by sentiment [
Although several prior studies reported on the development of automated approaches to analyze tobacco and ecigarette-related tweet content [
The study builds on interdisciplinary collaboration that combines drug abuse and computer science research to develop eDrugTrends, a highly scalable infoveillence platform for real-time processing of social media data related to cannabis and synthetic cannabinoid use. Development of eDrugTrends platform is based on previous research and infrastructure created by our research team, including Twitris (for analysis of Twitter data) [
The key goal of this study is to describe the development and performance of machine learning classifiers to automatically identify tweets by the source/type of communication (personal, official/media, retail) and sentiment (positive, negative, neutral) expressed in cannabis- and synthetic cannabinoid–related tweets. Because prior research identified distinct linguistic and sentiment patterns in personal communication tweets compared with tweets generated by organizational entities [
The eDrugTrends platform [
The Wright State University institutional review board reviewed the protocol and determined that the study meets the criteria for Human Subjects Research exemption 4 because it is limited to publicly available tweets. Tweets used as examples were modified slightly to ensure the anonymity of Twitter users who had posted them.
Manual coding was conducted to develop a labeled data set to be used as a “gold standard” for machine learning classifiers. First, 3 drug abuse researchers or “domain experts” (RD, FL, RC) conducted preliminary “open” coding [
Development of the manually labeled data set involved several phases of coding conducted by the first and third authors. To obtain a more balanced dataset, less common categories (eg, negative or retail-related tweets) were purposefully oversampled (for more details, see
The sample of 4,000 manually labeled tweets was split into two subsamples–1000 were used to train source classifier, and 3000 were allocated for sentiment classification. Information on the manually labeled tweet numbers by category for each subsample is provided in
Because the study aimed to integrate source and sentiment classification by focusing on sentiment in personal communication tweets only, source classification can be seen as a preprocessing step that is done before sentiment classification. First, 1000 tweets were used to train a source classifier (
Development of source classifiers focused only on tweets with URLs. Because all media- and retail-related tweets contained URLs, tweets without URLs could be automatically classified as belonging to the personal communication category. To select 1000 tweets with URLs for source classifier, approximately equal numbers of tweets were randomly sampled from each category–330 official/media-related, 340 retail-related, and 330 tweets that contain URLs from personal communication.
Summary information about the machine learning classification models used in the study is presented in
First, performance of source classifiers was assessed for multiclass classification (media, retail, personal). Next, the best performing machine learning algorithm in multiclass classification was selected to assess 3 binary classification tasks: (1) media versus the remaining tweets, (2) retail versus the remaining tweets, and (3) personal communication tweets versus the remaining tweets (
Sentiment classification tested 2 approaches: Approach 1 applied sentiment classification to all tweets, regardless of their source/type, using all 3000 manually labeled tweets (1292 positive, 921 negative, 787 neutral/unidentifiable), and Approach 2 applied sentiment classification to tweets identified as personal communications only, excluding retail and media-related tweets. For this approach, the sample of 3000 tweets was first processed using the best performing source classifier (developed for this study) to identify personal communication tweets, which resulted in a sample of 2633 tweets (
Performance of sentiment classifiers was examined for multiclass (positive, negative, neutral) and for binary classification tasks. Binary classification focused on positive versus negative tweets to examine how well sentiment classifiers performed on reliable categories (as determined by reliability assessment), excluding neutral/unidentifiable group that reached a low level of agreement among human coders. To test Approach 1 (all tweets, regardless of source/type), binary classification used a data set of 2213 tweets that was obtained after removing 787 neutral tweets from the sample of 3000. To test Approach 2 (personal communication tweets only), binary classification used a dataset of 2007 tweets that was obtained after removing 626 neutral/unidentifiable tweets from the sample of 2633 (
In addition, the study used a lexicon and rule-based method VADER that was developed for the analysis of social media texts [
• Multiclass classification [logistic regression (LR), naive bayes (NB), support vector machines (SVM)]:
° Personal versus media versus retail (n=1000)
• Binary classification (using classifier that showed the best results in multiclass classification):
° Personal versus the rest (n=1000)
° Retail versus the rest (n=1000)
° Media versus the rest (n=1000)
• Multiclass classification (LR, NB, SVM):
° Personal versus media versus retail (N=1000)
• Binary Classification (using classifier that showed the best results in multiclass classification):
° Personal versus the rest (n=1000)
° Retail versus the rest (n=1000)
° Media versus the rest (n=1000)
• Multiclass classification (LR, NB, SVM):
° Positive versus negative versus neutral/unknown (N=3000)
• Binary Classification (LR, NB, SVM):
° Positive versus negative (N=2213; neutral excluded)
• Multiclass classification (LR, NB, SVM):
° Positive versus negative versus neutral/unknown (N=2633)
• Binary Classification (LR, NB, SVM):
° Positive versus negative (N=2007; neutral/unknown excluded)
To build classifiers, the tweets were tokenized and all words were processed to convert uppercase letters to lowercase. Because prior research suggests that stop words and complete forms of words can be useful sentiment indicators, particularly in brief texts such as tweets, stop words were retained, and no stemming was applied [
The performance of each classifier was assessed by 5-fold cross validation, which is a commonly used method for the evaluation of classification algorithms that diminishes the bias in the estimation of classifier performance [
Source classification (Approach 1) that used short URLs demonstrated good performance (
Performance of multiclass source classifiers.
Algorithm | Precision | Recall | F-Score | |
LRa | 0.8007 | 0.7946 | 0.7938 | |
NBb | 0.8023 | 0.7926 | 0.7936 | |
SVMc | 0.8059 | 0.7976 | 0.7972 | |
LR | 0.8062 | 0.8026 | 0.8013 | |
NB | 0.8005 | 0.7972 | 0.7953 | |
SVM | 0.8141 | 0.8119 | 0.8102 | |
Approach 1 | SVM vs LR, |
|||
Approach 2 | SVM vs LR, |
|||
Approach 1 vs 2 | LR1 vs LR2, |
aLR: logistic regression.
bNB: naive bayes.
cSVM: support vector machines.
Performance of SVM source classifiers on binary classification for each source category.
Type of classification | Precision | Recall | F-Score | |
Media | 0.8873 | 0.8278 | 0.8477 | |
Retail | 0.8723 | 0.7913 | 0.8117 | |
Personal | 0.8755 | 0.7976 | 0.8200 | |
Media | 0.8958 | 0.8639 | 0.8769 | |
Retail | 0.8881 | 0.8155 | 0.8357 | |
Personal | 0.9020 | 0.8572 | 0.8736 | |
Approach 1 | Personal vs Media, |
|||
Approach 2 | Personal vs. Media, |
|||
Approach 1 vs 2 | Personal1 vs. Personal2, |
aValues that show statistically significant differences.
Performance of both source classification approaches was also assessed on binary classification tasks. Because SVM showed slightly better performance in multiclass classification than NB or LR (although not statistically significant), it was selected for evaluation on 3 binary classification tasks using the 1000 tweets: (1) media-related tweets versus the rest of tweets, (2) retail-related tweets versus the rest of tweets, and (3) personal tweets versus the rest of tweets (
For general sentiment classification approach that classified all 3000 tweets regardless of their source, SVM results showed better precision (0.7147) than other machine learning classifiers, but LR achieved better recall (0.6763) (
Before sentiment classification Approach 2 could be applied, the sample of 3000 tweets had to be processed to extract personal communication tweets. Because the SVM source classifier with unshortened URLs showed better performance than other classifiers (
Performance of multiclass sentiment classifiers (positive, negative, neutral).
Algorithm | Precision | Recall | F-Score | |
LRa | 0.7047 | 0.6763 | 0.6703 | |
NBb | 0.7101 | 0.6693 | 0.6683 | |
SVMc | 0.7147 | 0.6691 | 0.6723 | |
VADERd | 0.5213 | 0.5261 | 0.5116 | |
LR | 0.7145 | 0.6996 | 0.6931 | |
NB | 0.7539 | 0.6914 | 0.6980 | |
SVM | 0.7442 | 0.7021 | 0.7062 | |
VADER | 0.5153 | 0.5211 | 0.5030 | |
Approach 1 | SVM vs LR, |
|||
Approach 2 | SVM vs LR, |
|||
Approach 1 vs 2 | LR1 vs LR2, |
aLR: logistic regression.
bNB: naive bayes.
cSVM: support vector machines.
dVADER: Valence Aware Dictionary for sEntiment Reasoning.
eValues that show statistically significant differences.
Performance of binary sentiment classifiers (positive vs negative).
Algorithm | Precision | Recall | F-Score | |
LRa | 0.8700 | 0.8495 | 0.8516 | |
NBb | 0.8797 | 0.8491 | 0.8540 | |
SVMc | 0.8803 | 0.8513 | 0.8557 | |
LR | 0.8878 | 0.8728 | 0.8752 | |
NB | 0.8892 | 0.8629 | 0.8666 | |
SVM | 0.8964 | 0.8757 | 0.8800 | |
Approach 1 | SVM vs LR, |
|||
Approach 2 | SVM vs LR, |
|||
Approach 1 vs 2 | LR1 vs LR2, |
aLR: logistic regression.
bNB: naive bayes.
cSVM: support vector machines.
dValues that show statistically significant differences.
As shown in
The most discriminative unigram and bigram features that were identified by chi-square test reflect thematic groups as pertinent to sentiment categories: “want,” “love,” “need” for positive, in contrast to “don’t,” “shit,” “fake” for negative tweets (
The results of this study provide an example of the use of supervised machine learning methods to categorize cannabis- and synthetic cannabinoid–related content on Twitter with fairly high accuracy. To classify tweets by source/type of communication, an SVM algorithm that used expanded URLs produced the best results, in particular as demonstrated by binary classification tasks. For sentiment classification, the SVM algorithm that focused on “personal communication” tweets, in particular classifying positive versus negative tweets only, performed better than a more general approach that included all tweets regardless of the source.
Integration of the 2 dimensions of content analysis tasks—identification of type of communication and sentiment—represents a novel approach. Identification of sentiment in user-generated tweets (personal communications) carries greater relevance for drug abuse epidemiology research than an approach that does not separate personal from media- and retail-related tweets. Use of these content analysis tools along with geographic identification features currently functional in the eDrugTrends platform [
Overall, our machine learning methods for sentiment classification demonstrated substantially better performance than the lexicon and rule-based method VADER [
Our study demonstrates that content analysis and manual coding of drug-related tweets is not an easy task even for human coders with substantial experience in drug abuse research and qualitative content analysis. This is consistent with prior studies that have reported high level of ambiguity and lack of context as complicating factors in content analysis of tweets [
One of the limitations of our study is that we did not include development of machine learning classification methods to identify relevant and irrelevant tweets (eg, cases were “spice” may refer not to synthetic cannabinoids but to food seasoning). Relevance of extracted data was monitored using appropriate keyword combinations and blacklisted words [
Future research will assess performance of these techniques to analyze tweets mentioning other drugs of abuse and will also extend them to automate extraction of more detailed thematic information from drug-related tweets. In addition, because many tweets contain visual information to convey meaning, machine learning–based image classification would add an additional dimension and improve the accuracy of overall tweet content classification. In the future, we will examine the feasibility of separating true neutral tweets from unidentifiable group to improve sentiment analysis.
This is one of the first studies to report successful development of automated content classification tools to analyze recreational drug use–related tweets. These tools, as a part of eDrugTrends platform, will help advance the field’s technological and methodological capabilities to harness social media sources for drug abuse surveillance research. Our future deployment of the eDrugTrends platform will generate data on emerging regional and temporal trends and inform more timely interventions and policy responses to changes in cannabis and synthetic cannabinoid use practices.
Source classification: coding guidelines used to manually annotate tweets as personal, retail-, and media-related communications.
Sentiment classification: coding guidelines used to manually annotate tweets as expressing positive, negative, or neutral/unidentifiable sentiment.
Description of the development of manually labeled data set.
Information about the manually labeled tweets included in subsets to train source and sentiment classifiers.
Commonly occurring words in unshortened URLs by source/type category.
Top 10 most discriminative unigram and bigram features for source classification.
Top 10 most discriminative unigram and bigram features for sentiment classification.
logistic regression
naive bayes
support vector machines
Valence Aware Dictionary for sEntiment Reasoning
This study was supported by the National Institute on Drug Abuse (NIDA), Grant No. R01 DA039454 (Daniulaityte, PI; Sheth, PI). The funding source had no further role in the study design, in the collection, analysis, and interpretation of the data, in the writing of the report, or in the decision to submit the paper for publication.
None declared.