TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

Background Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner. Objective This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed. Methods We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines. Results The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes. Conclusions We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP.


Introduction
The goal of this project is to create an annotated corpus of symptoms, diseases, and drugs mentions in sentences taken from PubMed articles and from tweets. In the rest of the document we will refer to the symptoms, diseases, and drugs using their capitalized form (SYMPTOMS, DISEASES, and DRUGS) when we talk about the generic entities that are to be annotated. For this study we focus on a closed set of DRUGS (Table A1, in the appendix), although the study is not limited to them, and any mention of any DRUG in our closed list of DRUGS should be annotated. This document provides a definition of SYMPTOM, DRUG, and DISEASE, the relations between these entities, and the guidelines to be followed during the annotation. In case there is some question not covered in this document, please send an e-mail to: nestoralvaro@nii.ac.jp . In these guidelines we describe each one of the 3 types of entities (DRUGS, SYMPTOMS, and DISEASES), the attributes of the entities, and the 3 types of relations (reason to use, outcomepositive, outcome-negative) that are to be annotated.

Annotation of entities
An entity can be a single word such as "tiredness" as it appears in the sentence "the patient was experiencing tiredness", or a span of text such as "could not move from the couch" obtained from the sentence "I worked out so hard that when I got back home I could not move from the couch". Both entities refer to the concept "tiredness symptom" (MEDDRA Code 10043890). We provide a list of entities, the definition of each one and an example in Table 1. The only DRUGS to be annotated are those appearing in Table A1. In the following examples DRUGS are highlighted in green, SYMPTOMS in blue, and DISEASES in red.

Entity
Definition Example

DRUG
Any of the marketed medicines that appears in the SIDER database (http://sideeffects.embl.de/), which is also listed in our closed set of drugs (See Table A1).
The prescription included Lexapro.

Annotation of attributes
The entities (DRUGS, SYMPTOMS and DISEASES) have some attributes that will be annotated to clarify some concepts. We provide a list of attributes for the entities, the definition of each one, the values each attribute can take, and an example in Table 2. Some attributes have default values (in bold and highlighted in the table) which will be used when no attribute is chosen. In the following examples DRUGS are highlighted in green, SYMPTOMS in blue, and DISEASES in red.

Polarity
Indicates whether the entity is negated or not. The negation has to be a linguistic negation ("not", "don't"...).
• Positive: The entity is not negated. Default value. • Negative: The mention of the entity is negated.
"I took prozac and now I don't have a headache" Prozac: polarity=positive (left blank) Headache: polarity=negative

Person
Indicates whether the entity is affecting the "1st", "2nd", "3rd" person, or whether there is no information. This attribute is based on the original sender.
• 1st: The entity is described from a "first person" point of view. The entity is directly impacting the author of the text. Relates a first hand experience. • 2nd: The entity is described from a "second person" point of view. The entity is impacting another person whom the author knows. • 3rd: The entity is described from a "third person" point of view. The entity is impacting someone not directly related with the author of the text. • Not available: There is no clear reference to whom the entity is impacting. Default value.
"I took prozac and now I don't have a headache" Prozac: Person=1st The entity is described in first person.

Modality
Indicates whether the entity is stated in an "actual", "hedged", "hypothetical" or "generic" way.
• Actual: These mentions have already happened or are being scheduled (without hedging) to happen. Default value. • Hedged: These mentions include lexical ("seems", "likely", "suspicious", "possible", "consistent with"), or phrasal ("I suspect that...", "It would seem likely that") hedging. These entities are strongly implied, but, for safety, liability, or due to lack of comprehensive evidence, are not stated as a fact. • Hypothetical: Will often follow "if" statements ("If X happens, then we'll use Y to treat Z") or other sorts of conditionals ("Depending on the patient's response, we might treat A "The patient did not report nausea". Nausea: Modality=Actual "The patient may have undergone a mild stroke" Stroke: Modality=hedged "We suspect either achalasia or pseudoachalasia here" Achalasia: Modality= Hypothetical Pseudoachalasia: Modality= Hypothetical "Adderall should not be taken with B or with C"). • Generic: When the mention is done in a general sense. These usually occur when putting justifications of decisions, or rationales for changing care.

Exemplifica tion
Indicates whether the entity is presented using an example or a description. Only to be used when the entity is presented through an exemplification.
• Positive: When an exemplification is used to present the entity. • Negative: The entity is not presented through an example. Default value. "I will not be able to get up unless I take my Adderall" I will not be able to get up: Exemplification=True Indicates "lack of energy" (SNOMED ID: 248274002) Adderall: Exemplification=Negative (value left blank).

Duration
Indicates whether the entity's lasting span is "Intermittent", "Regular", "Irregular". If the duration is not indicated the attribute is left empty. In the case of DRUGS this attribute refers to the time span when the DRUG has been taken.
• Regular: The entity has a continued lasting span. • Intermittent: The lasting span of the entity has been recurring. • Irregular: There is indicated that there is no pattern in the lasting span of the entity. • Not available: When the duration is not indicated. Default value.
"I had a strong headache last night, so I took prozac." Prozac: Duration=not available (the value will be left empty) Headache: Duration="Irregular" "I have been on Prozac for 5 years now" Prozac: Duration="Regular"

Severity
Indicates whether the seriousness of an entity is "Mild", or "Severe". If the severity is not indicated the attribute is left empty. This attribute does not apply to DRUGS.
• Mild: There is gentle (not acute, nor serious) severity of the entity. • Severe: There is a grave or critical seriousness of the entity. • Not available: When the severity of the entity is not indicated. Default value.
"I had a strong headache last night, so I took prozac." Prozac: Severity=not available (the value will be left empty) Headache: Severity="Severe"

Status
Indicates whether the duration of the entity is "Complete", or "Continuing". If the duration is not indicated the attribute is left empty. In the case of it already wore off." Prozac: Status ="Complete"

Sentiment
Indicates whether the entity is perceived as "positive", "negative" or "neutral". If the entity is perceived as "neutral" this attribute is left empty.
• Positive: The entity is referenced as something good. • Negative: The entity is referenced as something bad. • Neutral: There is no clear point of view towards the referenced entity. Default value "I had a strong headache last night, so I took prozac." Prozac: Sentiment=neutral (the value will be left empty) Headache: Sentiment="Negative"

Entity identifier
Indicates the identifier for that entity.

Annotation of relations
A relation represents the existing connection between two entities. In our annotations we allow 4 types of relations. DISEASES and SYMPTOMS are not related. We provide a list of relations, the definition of each one and an example in Table 3. In the following examples DRUGS are highlighted in green, SYMPTOMS in blue, and DISEASES in red.

Relation Definition Example
Reason to use Represents the relation appearing when a SYMPTOM or DISEASE leads to the use of some DRUG.
Prozac is indicated for patients with major depressive disorder.

Outcome-positive
Represents the relation between a DRUG, and an expected or unexpected SYMPTOM or DISEASE appearing after the DRUG consumption. The outcome has to be positive.
I wish I was prescribed adderall, I'd lose weight.

Outcome-negative
Represents the relation between a DRUG, and an expected or unexpected SYMPTOM or DISEASE appearing after the DRUG consumption. The outcome has to be negative.
The most common adverse events reported for fluoxetine were impulsivity and poor concentration. It is important to notice that the annotation tool validates the origin-entity and the end-entity of each relation. This means that: • "Reason to use" relation: Has to start on a "SYMPTOM" or a "DISEASE" and be directed towards a "DRUG". • "Outcome-positive" relation: Has to start on a "DRUG", and be directed towards a "SYMPTOM" or "DISEASE". • "Outcome-negative" relation: Has to start on a "DRUG", and be directed towards a "SYMPTOM" or "DISEASE".

Practical issues
In the following examples DRUGS are highlighted in green, SYMPTOMS in blue, and DISEASES in red.

What to annotate? Entities
• Each mention of an entity should be annotated exactly once. Each annotation should refer to exactly one mention of the entity. All the entities should be annotated each time they are mentioned. • Annotate mentions with morphological variations such as adjectives.
• Hashtags, whenever present, will be included in the annotation span. ▪ In the sentence "I had a terrible #headache" the concept to be annotated is #headache (including the hashtag) • Synonyms or descriptions for SYMPTOMS and DISEASES should be annotated.
• Example: "I Took Adderall and now I'm gonna be up for hours" ▪ "up for hours" should be annotated as a synonym of "Sleeplessness" (notation "10041017" in MEDDRA) • The annotations should only include the entity mention, keeping it as specific as possible, and annotate the most specific entity mentions and select the best-matching Concept ID from SIDER database (for DRUGS) or MedDRA ontology (for SYMPTOMS and DISEASES) .
• For instance, the complete phrase "partial seizures" (ID: 10061334) should be preferred over "seizures" (ID: 10039910) as it is more specific. • If present, the mention span should include terms such as disease, syndrome, disorder, infection. • Mentions of cancer, tumour, neoplasm, or infection, and other generic mentions to DISEASES/SYMPTOMS additional information, can be annotated, although it may happen that the identifier for that concept is not contained in the list of concepts.
• In this case the ID for the concept would be "-1" • An entity could be an acronym.
• A long form, short form pair should be annotated as two mentions. Example: "Attention deficit hyperactivity disorder (ADHD)". In this case "Attention deficit hyperactivity disorder" and "ADHD" should be annotated separately. • This study is focused in a closed set of DRUGS (Table A1).
• That list of DRUGS also includes the brand names for these DRUGS. ▪ Any mention of any of this DRUGS (including the brand names) has to be always annotated. • Those drugs have different brand names and trade names. These variants have to be annotated too. ▪ For example, the table contains "Adderall", but "Adderall XR" and it should be annotated (using the DRUG identifier for "Adderall", 3007) • Lists and co-ordinations are phrases which mention multiple entities in a complex way. A simple illustrative example is "breast and ovarian cancer", which refers to the entities "breast cancer" and "ovarian cancer".
• These constructs often overlap or do not explicitly mention some terms.
• As the tool allows discontinuous annotations each entity should be annotated one time. One annotation would be "breast cancer" and the second annotation would be "ovarian cancer". • A retweet is a re-posting of someone else's Tweet. In this case the tweet will be considered as if the user re-posting it would be author of the tweet. Retweets are indicated by the string "RT" at the beginning of the message. • Example: "RT I took prozac and now I don't have a headache" ▪ This example is a retweet of "I took prozac and now I don't have a headache", so it would be annotated as if it were "I took prozac and now I don't have a headache" • Prozac: Person=1st • The entity is described in first person. • Headache: Person=1st • The entity is described in first person. • There are some cases when DRUGS/SYMPTOMS/DISEASES are used as an indicator of other entity. In those cases the entity used for the reference should be annotated • Example: "The patient took ADHD prescription stimulants" ▪ ADHD should be annotated as a SYMPTOM ▪ "ADHD prescription stimulants" should not be annotated as there is no drug in the list that could be found by looking for that concept. • Example: "The patient received fatigue treatment" ▪ "fatigue" should be annotated as a symptom.
▪ "fatigue treatment" should not be annotated as there is no drug in the list that could be found by looking for that concept.