This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
Despite concerns about their health risks, e‑cigarettes have gained popularity in recent years. Concurrent with the recent increase in e‑cigarette use, social media sites such as Twitter have become a common platform for sharing information about e-cigarettes and to promote marketing of e‑cigarettes. Monitoring the trends in e‑cigarette–related social media activity requires timely assessment of the content of posts and the types of users generating the content. However, little is known about the diversity of the types of users responsible for generating e‑cigarette–related content on Twitter.
The aim of this study was to demonstrate a novel methodology for automatically classifying Twitter users who tweet about e‑cigarette–related topics into distinct categories.
We collected approximately 11.5 million e‑cigarette–related tweets posted between November 2014 and October 2016 and obtained a random sample of Twitter users who tweeted about e‑cigarettes. Trained human coders examined the handles’ profiles and manually categorized each as one of the following user types: individual (n=2168), vaper enthusiast (n=334), informed agency (n=622), marketer (n=752), and spammer (n=1021). Next, the Twitter metadata as well as a sample of tweets for each labeled user were gathered, and features that reflect users’ metadata and tweeting behavior were analyzed. Finally, multiple machine learning algorithms were tested to identify a model with the best performance in classifying user types.
Using a classification model that included metadata and features associated with tweeting behavior, we were able to predict with relatively high accuracy five different types of Twitter users that tweet about e‑cigarettes (average
This study provides a method for classifying five different types of users who tweet about e‑cigarettes. Our model achieved high levels of classification performance for most groups, and examining the tweeting behavior was critical in improving the model performance. Results can help identify groups engaged in conversations about e‑cigarettes online to help inform public health surveillance, education, and regulatory efforts.
E‑cigarettes have gained popularity among adults and youth in recent years. Following sustained increases in the use of e-cigarettes by adults from 2010 to 2013 [
Concurrent with the rapid rise in e‑cigarette use, advertising and sharing of information about e‑cigarettes have proliferated in recent years. Although advertisements for tobacco products have been banned on television since 1971 in the United States, e‑cigarette advertising via television, magazines, outdoor, radio, and Web-based channels has increased dramatically between 2010 and 2013. Approximately 24 million adolescents were exposed to e‑cigarette advertising in 2014 [
Social media has become a particularly important platform for sharing information about e‑cigarettes. The majority of youth (81%) and adults (74%) in the United States use some form of social media [
Monitoring trends in e‑cigarette–related social media activity requires timely assessment of the content of posts and the types of users generating the content to inform regulatory and surveillance efforts. In 2016, the Food and Drug Administration (FDA) finalized a rule extending the agency’s authority to regulate e‑cigarettes, which includes federal provisions requiring companies that sell e‑cigarettes to include warning statements about nicotine on advertising and promotional materials, including content on digital/social media. To ensure that e‑cigarette companies are complying with these advertising and labeling restrictions, FDA will need to identify and monitor websites and social media accounts maintained by these companies. Furthermore, as public health researchers continue to use social media data to track and understand emerging issues concerning e‑cigarettes, they will need to be able to distinguish between the content from individuals who may be the target of Web-based e‑cigarette advertising (eg, young adults) and the content from e‑cigarette companies, marketers, or spammers who may be posting content for commercial purposes. Such information could also be useful in the development and targeting of social media campaigns to prevent e‑cigarette use.
The proliferation and variety of Web-based information sharing about e‑cigarettes presents challenges in differentiating content from different types of social media users. Previous studies have used a range of techniques to identify Twitter accounts that are purely automated (
This study demonstrates a novel methodology for automatically classifying Twitter users who tweet about e‑cigarette–related topics into five categories of users—individuals, vaper enthusiasts, informed agencies, marketers, and spammers. We used a supervised machine learning approach to predict different types of Twitter users based on their metadata and tweeting behavior. We tested different models, evaluated model performance, and discussed features that are most predictive of each user type. This study expands on previous research studying the content and the types of users who tweet about e‑cigarettes [
Using a supervised machine learning approach, we developed models to predict different types of Twitter users who tweet about e‑cigarettes. First, a random sample of Twitter handles that have tweeted about e‑cigarettes was obtained, and our trained human coders examined the handles’ profiles and manually labeled a specific user type for each handle. Next, Twitter metadata and a sample of tweets for each labeled user were gathered, and features that reflect users’ metadata and tweeting behavior were created. In the final steps, multiple machine learning algorithms to identify a model with the best classification performance were tested.
Using Twitter’s enterprise application programming interface (API) platform, Gnip, we collected e‑cigarette–related tweets posted between November 2014 and October 2016. A comprehensive search syntax was developed with 158 keywords, including terms such as
Six coders were trained using the protocol and practice data to classify the user types manually. For each user, the coders reviewed the user’s profile page on Twitter, which included a profile description and a sample of recent tweets on their timeline, which may have included e‑cigarette and non-e‑cigarette topics. Random samples of Twitter users were double coded until at least 300 labeled cases were obtained per user type. Coding discrepancies were resolved by an adjudicator. In total, 4897 users were manually classified according to the user type definitions (see
Approach to classifying Twitter users who tweet about e-cigarettes.
Manual classification of Twitter users who tweet about e-cigarettes: user type definitions and proportion of each type in manually labeled sample.
Type | Definition | Sample, N | |
Individual | The account of a real person whose Twitter profile information and tweets reflect their individual thoughts and interests. An individual is someone whose primary post content is not about vaping. | 2168 | |
Vaper enthusiast | The account of a person or organization whose primary content is related to promoting e‑cigarettes but is not primarily trying to sell e‑cigarettes or related products. | 334 | |
622 | |||
News media | The account of a newspaper, magazine, news channel, etc. News media does not include vaping-specific news sources. | ||
Health community | The account of a public health organization, coalition, agency, or credible individual affiliated with an organization. These may also be the accounts of organizations with authority on a topic that should be thought of as |
||
752 | |||
Marketer | An account marketing e‑cigarette or vaping products. These accounts can belong to a Web-based or brick-and-mortar retailer or an individual who is an affiliate marketer. | ||
Information aggregator | An account that primarily aggregates information about e‑cigarettes/vaping and where most or all tweets are news articles related to e‑cigarettes/vaping. This account could also aggregate vaping coupons or deals. | ||
An account that does not fall into one of the other coding categories. These accounts often post on a broad range of topics unrelated to this project, and their content can be nonsensical. Anecdotally, it was observed that many of these accounts exhibited |
1021 |
aDuring manual annotation of data, we initially categorized subtypes of informed agency (ie, news media and health community) and marketer (ie, marketer and information aggregator) user types, but we did not identify sufficient numbers of user handles for these subtypes to conduct meaningful analyses. Thus, during the feature selection and modeling phases, we collapsed across user subtypes to define five total user types.
Next, we built out the feature space for 4897 labeled users, extracting the metadata provided by the Twitter API and engineering our own features that were derived from the users’ tweet text (see
The Twitter API provides basic profile information about a user such as screen name, location, bio, number of friends, number of followers, and total number of tweets. The API also provides the actual tweet text and underlying metadata associated with tweet text that was used in this study to characterize the tweeting behavior (eg, retweet) of the users. These types of metadata features have a demonstrated utility in characterizing different types of users [
In addition to the metadata, the users’ tweet text data were also examined to capture their tweeting behavior. It was hypothesized that tweeting behaviors would vary across different user types (eg, individuals are likely to tweet about more diverse topics than marketers). Studies have shown that linguistic content of social media posts is particularly useful because it illustrates the topics of interest to a user and provides information about their lexical usage that may be predictive of certain user types [
To capture the users’ tweeting behavior, 58 features derived from the behavioral and linguistic content of the account profile and the tweet text were generated; summary statistics of sets of users’ tweets were also created. To generate these features, a variety of text mining techniques were used to capture the distribution characteristics of the users’ tweeting behavior and word usage. For example, the minimum, maximum, median, mean, and mode for how many times an e‑cigarette keyword was used per tweet was calculated. A term frequency-inverse document frequency matrix of each user’s corpus of tweets (up to 200 tweets) was also created, and the pair-wise cosine similarity between each tweet was calculated. For each user, the mean and standard deviation of the set of cosine similarity values, which provided a sense of the semantic diversity and consistency of a user’s vocabulary, was calculated. After generating the behavioral features, we dropped nine features in our dataset that had more than 10% missing data. Then, a mean imputation was performed on the derived features that had 10% or less missing data.
To determine the best model for classifying the user types, several different algorithms were built and compared using the features described in phase 2. Before modeling, the data were split into a training set (85%) and a test set (15%), using stratified sampling to preserve the relative ratio of classes across sets. To construct our models, a stratified 10-fold cross-validation on the training set was first run and eight different classifiers as well as a
On the basis of these results, the GBRT algorithm was used to classify the testing dataset. The GBRT approach builds an additive model in a forward stage-wise fashion [
To determine the best tuning values for the hyperparameters in our model, a fourfold grid-search cross-validation on the training dataset was run (see
The metadata-only model (72.7%) achieved lower
Classification of Twitter users who tweet about e-cigarettes: Gradient Boosting Regression Trees (GBRT) results comparing full model and metadata-only model.
User type | Full model (metadata + derived data) | Metadata-only model | ||||
Recall, % | Precision, % | Recall, % | Precision, % | |||
Individual | 91.1 | 92.3 | 89.8 | 83.6 | 86.2 | 81.2 |
Vaper enthusiast | 47.1 | 40.0 | 57.1 | 16.2 | 12.0 | 25.0 |
Informed agency | 84.4 | 78.5 | 91.3 | 70.0 | 67.7 | 72.4 |
Marketer | 81.2 | 85.9 | 77.0 | 65.6 | 72.6 | 59.9 |
Spammer | 79.5 | 81.1 | 78.0 | 74.8 | 71.9 | 78.0 |
Average | 83.3 | 83.7 | 83.3 | 72.7 | 73.7 | 72.3 |
To further examine variations in the predictive performance across user types, a confusion matrix illustrating predicted and actual user types was generated.
A two-dimensional (2D) plot of the feature space was also constructed to better understand the extent to which the user types fall into naturally separated clusters (see
Distributions of manually labeled versus model-predicted classification of Twitter users who tweet about e-cigarettes.
Two-dimensional t-SNE visualization of Twitter users who tweet about e-cigarettes.
Ten most important features in predicting Twitter users who tweet about e-cigarettes across all user types.
Featuresa | Proportion of feature importance among all variables, % |
Statuses count | 5.1 |
Followers count | 4.1 |
Original tweet raw keyword count | 3.7 |
Profile description keyword count | 3.3 |
Original tweet cosine similarity mean | 3.2 |
Retweet cosine similarity mean | 3.0 |
Friends count | 3.0 |
Retweet raw keyword count | 3.0 |
Listed count | 2.9 |
Original tweet URL count mean | 2.7 |
Favorites count | 2.7 |
aMost important feature among each user type—Individual: favorites count (4.9%); Vaper enthusiast: retweet raw keyword count (8.3%); Informed agency: followers count (6.5%); Marketer: original tweet raw keyword counts (8.9%); Spammer: statuses count (8.1%).
Partial dependence plots of top features by user type for users who tweet about e-cigarettes.
To better understand the contribution of each variable in our modeling outcome, each variable was evaluated using Gini Importance, which is commonly used in ensembles of decision trees as a measure of a variable’s impact in predicting a label that also takes into account estimated error in randomly labeling an observation according to the known label distributions [
Partial dependence plots (PDPs) illustrate the dependence between a target function (ie, user type) and a set of target features.
In summary, we developed algorithms with relatively high performance in predicting different types of Twitter users that tweet about e‑cigarettes. The rates of precision and recall for
We achieved the best performance in predicting individuals, informed agencies (news media and health agencies), and marketers. In contrast, vaper enthusiasts were challenging to predict and were commonly misclassified as marketers. There are several reasons why this may be the case. First, it is possible that there were not enough labeled cases of vaper enthusiasts for the machine learning models; there were only 334 labeled cases of vaper enthusiasts (6.8% of all labeled users) compared with 622 to 2168 cases for the other classes. Second, vaper enthusiasts are an evolving group of individuals, and their tweeting behavior may therefore vary more than other established user types such as informed agencies (eg, news media and health agencies). Third, our definition of vaper enthusiasts may not have been distinct enough from marketers; a vaper enthusiast was defined as a user whose primary objective is to
Given the overlap between vaper enthusiasts and marketers, a possible strategy to improve predictive performance might be to combine the two groups. In fact, in their study, Kavuluru and Sabbir [
In this study, the top features that were most predictive of each user type were also examined. Individuals like more tweets than nonindividuals; informed agencies have more followers than their counterparts; marketers use more e‑cigarette words in their original tweets than nonmarketers; vaper enthusiasts retweet e‑cigarette content more than nonvaper enthusiasts; and more frequent tweeting behavior is indicative of spammers. Given the infancy of this research, the findings of this study should be viewed as an initial inquiry into classifying different types of users who tweet about e‑cigarettes. Future studies should build on this work and examine other features that may be predictive of these classes of users. For example, other researchers have examined features such as sentiment of tweets [
Our study has several limitations. First, because of resource constraints, we only collected the 200 most recent tweets for the users in our dataset, and some users had less than 200 tweets in total. Previous studies examining Twitter metadata and linguistic features to predict sociodemographic characteristics of users (eg, gender and age) have extracted up to 3200 tweets per handle, but other researchers have also found that having more than 100 tweets per handle did not necessarily improve the model performance [
This is the first study we are aware of that has examined methods to predict a broad set of different types of users tweeting about e‑cigarettes. Previous studies have examined either the topic of e‑cigarette tweets [
In conclusion, this study provides a method for classifying five different types of users who tweet about e‑cigarettes. Our model achieved high levels of classification performance for most groups; examining tweeting behavior was critical in improving the model performance. The results of our approach can help identify groups engaged in conversations about e‑cigarettes online to help inform public health surveillance, education, and regulatory efforts. Future studies should examine approaches to improve the classification of certain user groups that were more challenging to predict (eg, vaper enthusiasts).
List of Twitter metadata features and derived behavioral features used in models to classify Twitter users who tweet about e-cigarettes.
Modeling results of different machine learning algorithms to classify Twitter users who tweet about e-cigarettes.
Overview of machine learning algorithms examined.
Importance of features in models to predict Twitter users who tweet about e-cigarettes.
application programming interface
Food and Drug Administration
Gradient Boosted Regression Trees
institutional review board
partial dependence plots
t-Distributed Stochastic Neighbor Embedding
two-dimensional
The authors thank Paul Ruddle II for his guidance on the study design and development of the data collection infrastructure. The authors also thank Geli Fei for his contributions in evaluating text-based features, Bing Liu for his feedback on the analytical approach, and Margaret Cress and Kelsey Campbell for leading the manual classification task. This work was supported by a grant from the National Cancer Institute (R01 CA192240). The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
AK conceptualized the study, secured funding, directed study implementation, and led the writing of the manuscript and the revisions. TM led the analysis, implemented the machine learning methods, interpreted the results, produced figures, wrote the Methods and Results sections, and revised the manuscript. RC contributed to the study design, data collection, analysis, and manuscript review. ME assisted with the writing of the manuscript. JN provided feedback on the analysis and the manuscript.
None declared.