This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
With the increasing popularity of Web 2.0 applications, social media has made it possible for individuals to post messages on adverse drug reactions. In such online conversations, patients discuss their symptoms, medical history, and diseases. These disorders may correspond to adverse drug reactions (ADRs) or any other medical condition. Therefore, methods must be developed to distinguish between false positives and true ADR declarations.
The aim of this study was to investigate a method for filtering out disorder terms that did not correspond to adverse events by using the distance (as number of words) between the drug term and the disorder or symptom term in the post. We hypothesized that the shorter the distance between the disorder name and the drug, the higher the probability to be an ADR.
We analyzed a corpus of 648 messages corresponding to a total of 1654 (drug and disorder) pairs from 5 French forums using Gaussian mixture models and an expectation-maximization (EM) algorithm
The distribution of the distances between the drug term and the disorder term enabled the filtering of 50.03% (733/1465) of the disorders that were not ADRs. Our filtering strategy achieved a precision of 95.8% and a recall of 50.0%.
This study suggests that such distance between terms can be used for identifying false positives, thereby improving ADR detection in social media.
Adverse drug reactions (ADRs) cause millions of injuries worldwide each year and require billions of Euros in associated costs [
There are now sites for consumers that enable patients to report ADRs. Patients who experience ADRs want to contribute drug safety content, share their experience, and obtain information and support from other Internet users [
Three recently published review articles showed that the use of social media data for ADR monitoring was increasing. Sarker et al analyzed 22 studies that used social media data. They observed that publicly available annotated data remained scarce, thus making system performance comparisons difficult [
When the methods relied on the development of lexicons, these studies were generally limited in the number of drugs studied or the number of target ADRs. For example, Benton et al focused on 4 drugs [
Other authors focused on detecting user posts mentioning potential ADRs. Some of them combined social media with other knowledge sources such as Medline [
In 13 studies using automatic processing based on data mining to analyze patient declarations, 7 studies aimed at identifying the relationships between disease entities and drug names. Five of these studies used machine learning methods. Qualitative analyses of forums and mailing list posts show that it may be used to identify rare and serious ADRs (eg, [
Therefore, the main challenge lay in identifying a combination of methods that could reduce the overall number of misclassifications of potential ADRs from patient’s posts. In all such studies, the authors analyzed messages that contained references to both a drug and a disorder or symptom. ADRMine, a machine learning–based concept extraction system [
However, ADR messages from social media are not only factual descriptions about adverse events [
Before robust conclusions can be drawn from social media regarding ADRs, the biggest problem with automated or semiautomated methods is distinguishing between genuine ADRs and other types of cooccurrence (eg, treatments and context) between drugs and diseases in messages. To quote Golder [
These examples indicate that methods are required to eliminate such false positives. The Detec’t project developed by Kappa Santé [
The current technological challenges include the difficulty for text mining algorithms to interpret patient lay vocabulary [
After the review of multiple approaches, Sarker et al [
Semantic filtering relies on semantic information, for example, negation rules and vocabularies, to identify messages not corresponding to an ADR declaration. Liu and al [
Powell et al [
The second type of filtering was based on statistical approaches using the topic models method [
Wei and al [
We propose adding a filter based on Gaussian mixtures models to reduce the burden of other entities, that is, disorders that are mentioned in the messages but are not ADRs. The objective was to optimize ADR detection by reducing the number of false positives. We hypothesized that the shorter the distance between the disorder name and the drug, the higher the probability to be an ADR. The approach was applied to the Detec’t corpus.
We used a version of the Detec’t database that contained 17,703,834 messages corresponding to 350 drugs. The messages were extracted from 20 general health forums, all in French, using a custom Web crawler to browse the selected forums and scrape messages. The forums scraped do not restrict users with a limited number of characters in the message. Detec’t contains the messages extracted and associated metadata, namely users’ aliases and dates.
The Detec’t database was created in 2012 by Kappa Santé [
The messages that constitute our dataset came from (1) doctissimo, (2) atoute.org, (3) e-santé, (4) santé médecine, and (5) aufeminin. These are popular general forums dedicated to health with an average of 89,987 unique visitors a day in 2016. Users must register to be able to post a message in these forums.
We randomly extracted 700 messages from the Detec’t database related to 3 drugs from 3 different therapeutic classes: Teriflunomide, Insulin Glargine, and Zolpidem.
Of these, 52 messages did not contain any disease entity and were removed from the list. The remaining 648 messages were both manually annotated and automatically processed. Processing was performed in 5 steps; the method is summarized in
Summary diagram.
Regarding disorders, we used the Medical Dictionary for Regulatory Activities dictionary (MedDRA), which is the international medical terminology developed under the auspices of the International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). The MedDRA dictionary is organized by system organ class (SOC) and divided into high-level group terms (HLGT), high-level terms (HLT), preferred terms (PT), and lowest-level terms (LLT). Synonymous LLT are grouped under a unique identifier labeled as preferred terms (PT).
We used a lexicon built in-house by Kappa Santé. The lexicon was derived from the French version of MedDRA 15.0 and was extended by adding more lay medical vocabulary. A fuzzy grouping technique was used to group commonly misspelled words or closely spelled words under one term. The grouping was performed at the MedDRA LLT level. The fuzzy grouping algorithm temporarily strips all vowels (except the first one), strips double or triple consonants from extracted words, and then compares them to see if they are the same. For example, “modeling” and “modelling” would be grouped together [
An ADR is a sign or symptom caused by taking a medication. ADRs may occur following a single dose or prolonged administration of a drug or result from the combination of 2 or more drugs.
A disorder concept corresponds to a sign or symptom, a disease, or any pathological finding. In the context of a message, a disorder may:
Either play the role of an adverse event, (ADR) for example, “I took aspirin, it gave me a terrible headache.” These are considered “true ADRs.”
Or correspond to a condition that is not reported by the patient as an ADR, for example, “I had a headache so I took aspirin.”
With the objective of distinguish between ADRs and disorders that were not ADRs, 2 experts manually annotated the messages to identify true ADRs.
The annotators labeled each disorder entity in the messages as (1) « ADR » if the patient reported the disorder as a possible ADR in his or her message, or (2) « other entity » if the disorder was not reported as an ADR in the message.
This annotated dataset was used as a gold standard.
The standardization of our approach required preprocessing the dataset to avoid some cases of poor data quality.
The character separation method involved inserting whitespaces around every punctuation character. This separation was necessary due to the poor data quality to optimize disorder identification.
Because we used the R software (a language and environment for statistical computing provide by the R core team in Vienna) to process and analyze data, and given that R discriminates between lowercase and uppercase words, we used the “tm” Package (a text mining framework for R software) to convert the document text to lower case and remove extra whitespaces [
The last step was the tokenization of messages. Word segmentation provides a list of words in each message and their positions in the post.
Data preprocessing steps.
The objective of the named entity recognition module was to identify 2 types of entities in a patient’s post: drug names and disorders.
As the extended lexicon for disorders that we developed contained colloquial terms as well as expressions with spelling and/or grammatical errors, lexicon matching was performed using exact match methods after stemming of both messages and expressions in the lexicon.
Drug names in the messages were automatically identified using fuzzy matching and stored in the Detec’t database as message metadata. To minimize the impact of misspelled words, each word was first stemmed using a Porter stemmer, an algorithm meant to remove inflection from a word [
All of the other terms in the messages were mapped to our extended version of MedDRA, which includes colloquialisms, abbreviations, and words with spelling errors. Lexicon matching was implemented as string matching using regular expressions. The granularity of the disorder concepts extracted from the messages corresponded to the LLT level of MedDRA.
After the preprocessing step, the position of each entity in the message was calculated. We defined a word as a continuous series of characters between 2 whitespaces. The distance between a drug “a” and a disorder “b” in a message was defined as the number of words separating the two entities:
Distance (a,b)=(position of b) − (position of a)
The following data were automatically collected:
The disorder name (corresponding to b) detected in the message and the corresponding LLT.
The MedDRA preferred term associated to the disorder term.
The overall position of the disorder term in the message.
Relative position of the detected disorder to the product’s name (before or after).
The distance between the disorder term and the drug name.
Length of the message (expressed in number of words)
When the product name appears several times in a message, the algorithm evaluates the distances between a disorder and all drug name occurrences. The pairs identified are deduplicated. The only pair considered is the one that minimizes the distance to the drug name.
We used Gaussian mixture models for the disorder clustering using “mclust” R package [
Observed density of distances between disorder terms and drug names.
We processed a total of 648 messages from 5 French forums written from 2002 to 2013. The named entity recognition module automatically identified 320 unique disorders corresponding to 268 PTs (see
All 1654 of the identified disorders were manually annotated as true ADRs or not. Among them, the experts identified 11.42% (189/1654) of ADRs and 88.57% (1465/1654) of other entities.
As shown in
QQ-plot in
The clustering method clusters the detected disorder concepts based on their distances (expressed as a number of words) to the product name in each message. To achieve this goal, we used Gaussian mixture models and EM algorithm [
Documents under review found consisting of (n) words.
Disorders automatically identifies (MedDRA system organ class [SOC] level).
Normal Q-Q plot.
Repartition of true adverse drug reactions (ADRs) and other entities by distances.
Distance distribution also varies greatly with a viewing averaging 20.32 and a median value of 11.0. The distances vary between 1233 before the drug name and 1510 after.
We applied a supervised clustering method with three fixed clusters (
Cluster 1 corresponds to distances in the (−220, −57) union (+78, +211) interval, that is, between 220 and 57 words before the drug name or in the interval between 78 and 211 words after the drug name. Cluster 1 contains 441 disorders. Among them, 6.6% (29/441) of the disorders found in cluster, are true ADRs.
Cluster 2 corresponds to distances in the (−56, −1) union (+2, +77) interval (ie, between 56 and 1 words before the drug name or between 2 and 77 words after the drug name). Cluster 2 contains 889 disorders. In cluster 2, 17.7% (157/889) of the disorders are true ADRs.
Supervised clustering results.
Supervised clustering contingency table.
Clusters | Other entities (%) | ADRsa (%) | Total |
Cluster 1 | 412 (93.4) | 29 (6.6) | 441 |
Cluster 2 | 732 (82.3) | 157 (17.7) | 889 |
Cluster 3 | 321 (99.1) | 3 (0.9) | 324 |
Total | 1465 | 189 | 1654 |
aADRs: adverse drug reactions.
Cluster 3 corresponds to distances between 1233 and 222 words before the drug name or between 212 and 1510 words after. Cluster 3 contains 324 disorders. Among them, 0.9% (3/324) are ADRs and 321 are other entities.
We tested two filtering strategies. The objective was to filter out the entities that were not ADRs.
In the first filtering strategy, we merged clusters 1 and 3 (
Filtering by merging of clusters 1 and 3.
Clusters | Other entities (%) | ADRsa (%) | Total |
Clusters 1 and 3 | 733 (95.8) | 32 (4.2) | 765 |
Cluster 2 | 732 (82.3) | 157 (17.7) | 889 |
Total | 1465 | 189 | 1654 |
aADRs: adverse drug reactions.
In the context of ADR detection, the use of this approach to remove disorders of clusters 1 and 3 induces a 50.03% reduction of potential false positives.
The ability to detect false ADRs achieved a precision score of 95.8% and a recall of 50.0%. In other terms, almost all (>95%) of the pairs that were filtered out were not true ADRs, but the system detected only 50.03% (733/1465) of the false positives.
A second filtering strategy involved merging clusters 1 and 2 (
Filtering by merging clusters 1 and 2.
Clusters | Other entities (%) | ADRsa (%) | Total |
Cluster 3 | 321 (99.1) | 3 (0.9) | 324 |
Clusters 1 and 2 | 1144 (86.0) | 186 (14.0) | 1330 |
Total | 1465 | 189 | 1654 |
aADRs: adverse drug reactions.
The union of clusters 1 and 2 contains 98.4% (186/189) of the true ADRs present in the dataset. Given that cluster 3 contains only 1.6% (3/189) of the ADRs in our study, exclusion of cluster 3 leads to erroneously ignoring only three relevant adverse events.
Using this filtering strategy, our detection of disorders that are not ADRs achieved a precision of 99.07% and a recall of 21.9%. In other terms, 99.07% (321/324) of the pairs that were filtered out were false positive, but the system detected only 21.91% (321/1465) of the non-ADRs.
We demonstrated that the meaning of a disorder term in a message varies considerably based on its distance to the drug name. Noticeably, before any filtering strategy, cluster 3 contained only three ADRs. The higher the distance between the disorder and the drug name is, the lower the probability that the disorder might be an ADR. Specifically, in cluster 3, 99.1% (321/324) of the disorder terms did not correspond to ADRs. Our approach based on distance measurement enabled us to filter out other (non-ADRs) entities from the detected disorders. The first strategy enabled us to automatically filter out 49.96% (732/1465) of the disorders that were not ADRs. The second strategy filtered out 78.08% (1144/1465) of the disorders that were not ADRs. Consequently, we obtained a significant improvement in identifying non-ADRs (false positives) in messages. Such filtering can be used as a first step to optimize the screening of ADRs by reducing the false positive rate.
Patient’s adverse drug event discussions in forums are more informal and colloquial than biomedical literature and clinical notes. When messages in social media are mined to detect ADRs declarations, these informal chats lead to many noisy false positives. The use of filtering methods improves ADR detection in the huge data source that is social media [
However, the use of distance (as number of words) has not been used for ADR detection, and the usage of this type of information for ADR classification is novel. Sarker and Gonzalez [
One challenge is the comparison of the different filtering methods and their evaluation on equivalent datasets. We evaluated our method on a corpus that was not specific to an adverse event. We relied on MedDRA, which encompasses the complete spectrum of possible ADRs.
Some limitations regarding the effectiveness of our filtering method should be noted. The main limitation is that our classification process is less efficient when the disorder term is closer to the drug name in the message. Another limitation is that the distance approach has been developed and tested on a French corpus and must be adapted to different languages. Finally, this approach is based on the number of words between drug names and disorder entities in messages and is therefore not applicable to some forms of social media such as Twitter because a tweet would not contain a sufficient number of words to satisfy a sufficient disparity of the disorders detected. The insufficient disparity would not allow our filter to effectively classify the disorders.
Many patients express sentiments when posting about drug associated events in social media, and (quoting Sarker and Gonzalez in [
We have demonstrated that the distance between the disorder and the drug in a message influences the probability of a disorder to be a genuine ADR. The use of distance between entities on patient posts from social media enabled us to filter out false positives from the detected disorders, and thus, to optimize ADR screening.
Exhaustive list of disorders found at MedDRA preferred terms (PT) and system organ class (SOC) levels.
active semisupervised clustering based two-stage text classification
adverse drug reaction
adverse event
example adaption for text categorization
expectation-maximization
FDA’s Adverse Event Reporting System
high-level group terms
high-level terms
lowest-level terms
Maximum Entropy
Medical Dictionary for Regulatory Activities
medical subject headings
Naïve Bayes
positive examples and negative examples labeling heuristics
preferred terms
system organ class
support vector machine
None declared.