Identifying Adverse Effects of HIV Drug Treatment and Associated Sentiments Using Twitter

Background Social media platforms are increasingly seen as a source of data on a wide range of health issues. Twitter is of particular interest for public health surveillance because of its public nature. However, the very public nature of social media platforms such as Twitter may act as a barrier to public health surveillance, as people may be reluctant to publicly disclose information about their health. This is of particular concern in the context of diseases that are associated with a certain degree of stigma, such as HIV/AIDS. Objective The objective of the study is to assess whether adverse effects of HIV drug treatment and associated sentiments can be determined using publicly available data from social media. Methods We describe a combined approach of machine learning and crowdsourced human assessment to identify adverse effects of HIV drug treatment solely on individual reports posted publicly on Twitter. Starting from a large dataset of 40 million tweets collected over three years, we identify a very small subset (1642; 0.004%) of individual reports describing personal experiences with HIV drug treatment. Results Despite the small size of the extracted final dataset, the summary representation of adverse effects attributed to specific drugs, or drug combinations, accurately captures well-recognized toxicities. In addition, the data allowed us to discriminate across specific drug compounds, to identify preferred drugs over time, and to capture novel events such as the availability of preexposure prophylaxis. Conclusions The effect of limited data sharing due to the public nature of the data can be partially offset by the large number of people sharing data in the first place, an observation that may play a key role in digital epidemiology in general.


Supplementary
Most common tokens (left column) in the sample of 316,081 tweets. Total number of appearances (right column) and number of tweets it appears (middle column).

Token
Tweets Total

Feature extraction
Hereafter we give the definition of all features extracted from each tweet: • modalcount: number of times the words "should", "shoulda", "can", "could", "may", "might", "must", "ought", "shall", "would", and "woulda" occur in the tweet; • futurecount: number of times the words "going", "will", "gonna", "should", "shoulda", "ll", "d" occur in the tweet; • personalcount: number of times the words "i", "me", "my", "mine", "ill", "im", "id", "myself" occur in the tweet; • negative: number of times the words "not", "wont", "nt", "shouldnt", "couldnt" occur in the tweet; • secondpron: number of times the words "you", "youll", "yours", "yourself" occur in the tweet; • thirdpron: number of times the words "he", "she", "it", "his", "her", "its", "himself", "him", "herself", "itself", "they", "their", "them", "themselves" occur in the tweet; • relatpron: number of times the words "that", "which", "who", "whose", "whichever", "whoever", "whoever" occur in the tweet; • dempron: number of times the words "this", "these", "that", "those" occur in the tweet; • indpron: number of times the words "anybody", "anyone", "anything", "each", "either", "everyone", "everything", "neither", "nobody", "somebody", "something", "both", "few", "many", "several", "all", "any", "most", "none", "some" occur in the tweet; • intpron: number of times the words "what", "who", "which", "whom", "whose" occur in the tweet; • percent: number of % symbols in the tweet; • posnoise: number of times the words "new", "pill", "state", "states", "stats", "drug", "people", "approved", "approve", "approves", "approval", "approach", "prevention", "prevent", "prevents", "prevented" occur in the tweet; • pharmacy: number of times the words "cvs", "hospital", "pharmacy", "doctor", "walgreens", "target", "clinic", "meds", "medication", "medications" occur in the tweet; • is_notenglish: number of times words contained in a list of words extracted from annotated tweets as not English occur in the tweet; • regularpast: number of words ending with ed contained in the tweet; • gerund: number of words ending with ing contained in the tweet; • nment: number of words ending with ment contained in the tweet; • nfull: number of words ending with full contained in the tweet; • tagadj: ratio of the number of adjectives tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagverb: ratio of the number of verbs tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagprep: ratio of the number of prepositions tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagnoun: ratio of the number of nouns tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagconj: ratio of the number of conjunctions tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagadv: ratio of the number of adverbs tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagto: ratio of the number of to tagged using NLTK [1] in the tweet by the total number of words in the tweet; • tagdeterm: ratio of the number of determinants tagged using NLTK [1] in the tweet by the total number of words in the tweet; • sis_noise: ratio of the similarity of the tweet with a corpus of annotated noise tweets by its uncertainty. To compute the similarity we first create a sparsity matrix of the tokens in the annotated corpus, then count the number of times the token appears in the tweet and divide by the number of elements in the corpus. We use scikit-learn [2] library in several parts of the definition of sis noise; sparsity matrix of the tokens in the annotated corpus, then count the number of times the token appears in the tweet and divide by the number of elements in the corpus. We use scikit-learn [2] library in several parts of the definition of sis noise; • in_english: number of words in corpus of English words [1] divided by number of words in corpuses of Spanish, Portuguese, French, German, Dutch, Italian, Russian, Swedish, and Danish [1]. We add one in both numerator and denominator to avoid dividing by zero.
• bigrams noise: number of bigrams found in tweet that are contained in list of bigrams of noise annotated bigrams corpuses divided by the total number of bigrams from annotated corpuses; • bigrams signal: number of bigrams found in tweet that are contained in list of bigrams of signal annotated bigrams corpuses divided by the total number of bigrams from annotated corpuses; • isolation: number of keywords contained in tweet minus one; • common_noise: sum of the weights of each word contained in most common 25% of words in noise annotated tweets; • common_signal: sum of the weights of each word contained in most common 25% of words in signal annotated tweets; • wordscount: number of words in tweet; • tweetlength: number of characters in tweet.

Foreign Language Removal
We extracted features from tweets to be able to feed a machine learning algorithm, and separate noise from signal more efficiently. Nevertheless, even if we had the best semantic features, the machinery would have difficulties in separating tweets that are non-English. In this section we propose a method to suppress almost all of foreign tweets without losing much signal.