JMIR Publications

JMIR Public Health and Surveillance


Citing this Article

Right click to copy or hit: ctrl+c (cmd+c on mac)

Published on 24.02.17 in Vol 3, No 1 (2017): Jan-Mar

This paper is in the following e-collection/theme issue:

    Original Paper

    Analysis of Patient Narratives in Disease Blogs on the Internet: An Exploratory Study of Social Pharmacovigilance

    1Chugai Pharmaceutical Co Ltd, Drug Safety Data Management Department, Tokyo, Japan

    2Chugai Pharmaceutical Co Ltd, Pharmacovigilance Department, Tokyo, Japan

    3Chugai Pharmaceutical Co Ltd, Medical Information Department, Tokyo, Japan

    4Chugai Pharmaceutical Co Ltd, Clinical Science & Strategy Department, Tokyo, Japan

    Corresponding Author:

    Shinichi Matsuda, MSc

    Chugai Pharmaceutical Co Ltd

    Drug Safety Data Management Department

    2-1-1 Nihonbashi-Muromachi, Chuo-ku

    Tokyo, 1038324


    Phone: 81 3 3273 1226

    Fax:81 3 3281 0815



    Background: Although several reports have suggested that patient-generated data from Internet sources could be used to improve drug safety and pharmacovigilance, few studies have identified such data sources in Japan. We introduce a unique Japanese data source: tōbyōki, which translates literally as “an account of a struggle with disease.”

    Objective: The objective of this study was to evaluate the basic characteristics of the TOBYO database, a collection of tōbyōki blogs on the Internet, and discuss potential applications for pharmacovigilance.

    Methods: We analyzed the overall gender and age distribution of the patient-generated TOBYO database and compared this with other external databases generated by health care professionals. For detailed analysis, we prepared separate datasets for blogs written by patients with depression and blogs written by patients with rheumatoid arthritis (RA), because these conditions were expected to entail subjective patient symptoms such as discomfort, insomnia, and pain. Frequently appearing medical terms were counted, and their variations were compared with those in an external adverse drug reaction (ADR) reporting database. Frequently appearing words regarding patients with depression and patients with RA were visualized using word clouds and word cooccurrence networks.

    Results: As of June 4, 2016, the TOBYO database comprised 54,010 blogs representing 1405 disorders. Overall, more entries were written by female bloggers (68.8%) than by male bloggers (30.8%). The most frequently observed disorders were breast cancer (4983 blogs), depression (3556), infertility (2430), RA (1118), and panic disorder (1090). Comparison of medical terms observed in tōbyōki blogs with those in an external ADR reporting database showed that subjective and symptomatic events and general terms tended to be frequently observed in tōbyōki blogs (eg, anxiety, headache, and pain), whereas events using more technical medical terms (eg, syndrome and abnormal laboratory test result) tended to be observed frequently in the ADR database. We also confirmed the feasibility of using visualization techniques to obtain insights from unstructured text-based tōbyōki blog data. Word clouds described the characteristics of each disorder, such as “sleeping” and “anxiety” in depression and “pain” and “painful” in RA.

    Conclusions: Pharmacovigilance should maintain a strong focus on patients’ actual experiences, concerns, and outcomes, and this approach can be expected to uncover hidden adverse event signals earlier and to help us understand adverse events in a patient-centered way. Patient-generated tōbyōki blogs in the TOBYO database showed unique characteristics that were different from the data in existing sources generated by health care professionals. Analysis of tōbyōki blogs would add value to the assessment of disorders with a high prevalence in women, psychiatric disorders in which subjective symptoms have important clinical meaning, refractory disorders, and other chronic disorders.

    JMIR Public Health Surveill 2017;3(1):e10




    Current Pharmacovigilance

    The World Health Organization defines pharmacovigilance (PV) as the science and activities related to the detection, assessment, understanding, and prevention of adverse effects or any other drug-related problems [1]. In this era of what Edwards calls “information explosion,” we must rethink PV [2] to effectively incorporate a variety of data sources while ensuring the timely decision-making that is crucial to avoiding unnecessary harm caused by adverse events (AEs) in real-world health care practice.

    Current PV activities depend heavily on voluntary, spontaneous AE reports obtained from health care professionals (HCPs). It is generally accepted that one advantage of spontaneous reporting is its speed at detecting AE signals as early as possible. However, it is also acknowledged that spontaneous reports by HCPs alone may not be enough to capture all AE signals in a timely fashion. Because some symptomatic AEs can be expected to be reported only by patients who have firsthand experience of drug treatment [3], incorporating patient-generated data into PV is one of the most important challenges [4]. Several studies have suggested that self-reporting by patients is useful for catching AE signals earlier, and many countries have implemented patient AE reporting schemes [5-8]. The Japanese regulatory authority started preliminary implementation of a self-reporting system for patients in March 2012 [9,10]; however, the system is still under development and will require more time to be used effectively in a routine PV system [11].

    Prior Research on Applying Internet Resources in Pharmacovigilance

    Analyzing information on the Internet would add significant knowledge about public health, as shown in Eysenbach’s study outlining the framework of infodemiology and infoveillance [12]. In PV, there has been recent growing interest in utilizing patient-generated Internet resources such as social media [13-17]. A survey conducted in 2001 and 2002 in the United States showed that the Internet is an important resource for the public; approximately 40% of respondents there obtained information on health-related topics through Internet sources [18]. In response to the increasing use of social media to share health care information, the US Food and Drug Administration announced in 2015 that they had started a collaboration with PatientsLikeMe [19], a patient networking website, to apply patient-generated data to risk management activities [20]. In Europe, the Medicines and Healthcare products Regulatory Agency in the United Kingdom started working with the WEB-RADR project in 2014 to develop a mobile phone app that helps HCPs and patients report AEs to national health care authorities [21]. The European Medicines Agency has also released guidelines on good pharmacovigilance practices, of which Module VI requires companies having the European Union marketing authorization to monitor the Internet or digital media under their management or responsibility for potential reports of suspected adverse reactions [22]. These ongoing efforts are expected to lead to important developments in PV. Like Americans and Europeans, approximately 39% of Japanese obtain health information via the Internet [23]. However, to our knowledge, no studies have explicitly identified such Japanese data sources for use in PV.

    Patient-Generated Data and Study Objectives

    Our motivation was to take the first step toward enhancing PV by considering the application of patient-generated data sources in Japan. In this study, we focused on the potential use of health-related disease blogs called tōbyōki. The term tōbyōki translates literally to “an account of a struggle with disease,” and this form of writing predates the Internet. Although it is difficult to pinpoint when patients started writing tōbyōki, a sociological study has reported that the number of tōbyōki has been increasing in Japan since the 1970s [24]. In these diary-like accounts, patients record observations about their lives and diseases in handwritten journals. Recently, some patients have started sharing their tōbyōki as blogs on the Internet.

    It has already been suggested that analyzing tōbyōki blogs is useful for understanding patients’ feelings when they receive a cancer diagnosis [25], although there was no discussion on their potential use in PV. In this study, we introduce a growing database called TOBYO, which is a collection of a broad range of tōbyōki blogs on the Internet [26]. The objective of this exploratory study was to address the following questions: (1) what kinds of data elements exist in the TOBYO database? (2) what are the differences in population distribution between the TOBYO database and other external databases generated by HCPs? (3) what kinds of analytic approaches are useful to obtain insights from the TOBYO database? and (4) can the TOBYO database be useful for PV?

    To achieve our objective, we conducted 2 analyses (Analysis A and Analysis B). In Analysis A, we used the whole TOBYO database to describe data elements and understand the overall characteristics of this database. In Analysis B, we used a data subset of selected disorders from the TOBYO database to explore the usefulness of the database in greater detail. Here, we focused on depressive disorders and rheumatoid arthritis (RA) because these conditions were expected to entail subjective patient symptoms such as discomfort, insomnia, and pain. Finally, we included a discussion of the potential of the TOBYO database and practical challenges from the PV perspective.


    Data Source

    In this study, we considered health-related tōbyōki blogs as a resource for patient-generated data. Some examples of excerpts from tōbyōki blogs are shown in Table 1. As shown in these examples, patients shared information about AEs, drug name, dosage, and AE-related distress.

    Table 1. Example of excerpts from tōbyōki blog.
    View this table

    The TOBYO database consisted of a Web-based collection of tōbyōki blogs written in Japanese [26] and maintained by Initiative Inc (Tokyo, Japan). The overall flow of data in the TOBYO database is shown in Figure 1. Blogs written in Japanese were identified and extracted daily from the Internet using a proprietary crawling method. Before being registered in the TOBYO database, each tōbyōki blog was manually checked to judge whether it was a tōbyōki blog or noise, which was excluded. Each blog registered to the TOBYO database met all of the following selection criteria: (1) Language criteria: blogs written in plain Japanese language without extensive use of emoticons, symbols, or colloquial expressions were included; (2) Blogger criteria: blogs written by patients or their families were included. Blogs not written by patients or their families, such as those by manufacturers or HCPs who were providing medical care, were excluded (because such blogs generally described the HCP’s records and did not contain a patient perspective); and (3) Content criteria: blogs containing at least ten pages of tōbyōki entries on patients’ actual experiences were included. Blogs comprising excerpts from news media, books, health-related websites, or treatment guidelines were excluded. Blogs intended for marketing or promotion of commercial services or religious or political beliefs were also excluded.

    At the time of registration in the TOBYO database, information on gender, age at onset, and the primary disorder of each patient was determined by checking the profile or introduction page of each tōbyōki blog and stored as structured data for each patient. Text-based data in tōbyōki blogs were stored as unstructured data for each patient.

    Figure 1. Overall flow of data in the TOBYO database.(1) This study focuses on tōbyōki blogs that are publicly available on the Internet. Generally, there is a substantial volume of noise (white) unrelated to tōbyōki blogs (shaded). (2) Based on selection criteria described in Methods, filtering of tōbyōki blogs is performed manually, (3) and noise such as blogs written by companies is excluded. (4) Appropriate tōbyōki blogs are registered in the TOBYO database and stored for additional analysis.
    View this figure

    Analysis A: Using the Whole TOBYO Database

    Demographic Characteristics of the TOBYO Database

    To understand the demographic characteristics of the TOBYO database, structured data elements such as gender, age at onset, and frequently mentioned primary disorders were summarized in contingency tables. We also evaluated demographic characteristics by comparing population pyramids for the TOBYO database and 2 external databases generated by HCPs. The first HCP-generated database was the Japanese Adverse Drug Event Report (JADER) database maintained by the Pharmaceuticals and Medical Devices Agency. It comprised individual case safety reports (ICSRs) about the occurrences of serious adverse drug reactions (ADRs) for drugs approved in Japan. Similar to a previous report [27], we obtained the JADER dataset updated in September 2016 and extracted all ICSRs to create a population pyramid for the JADER database. The other HCP-generated database was the Japanese health insurance claims database maintained by Japan Medical Data Center, Ltd (Tokyo, Japan). It comprised medical claims information submitted from medical institutions to health insurance organizations for both corporate employees and their dependents [28]. Using this database, we created a population pyramid by determining the number of patients who had at least one record of drug prescription or disease from January 2011 to December 2015. As an additional comparison, we used national statistical surveillance data on all citizens living in Japan and publicly available through the Japanese government’s website [29].

    Distribution of Disorders in the TOBYO Database

    To understand the distribution of primary disorders in the TOBYO database, frequently mentioned disorders were summarized. The name of each disorder was independently reviewed by 2 reviewers (ST and MS) and coded using Medical Dictionary for Regulatory Activities (MedDRA) version 19.1. MedDRA is a widely used, standardized medical terminology developed by the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use [30]. Both reviewers had at least two years of experience in processing and evaluating ICSRs.

    Additional Characteristics of the TOBYO Database

    We analyzed additional characteristics of tōbyōki blogs that might be useful to understand the data. Behavioral characteristics about writing tōbyōki blogs, such as the time and day of week for blog postings, were determined for all postings accompanied by relevant identifiable information. Continuity of tōbyōki blogs was calculated by counting the number of days from the first entry to the latest update for each patient.

    Analysis B: Using Subset of Selected Disorders in the TOBYO Database

    Mining Events Appearing in Tōbyōki Blogs

    As depicted in Figure 2, we applied natural language processing techniques to unstructured text-based data to prepare each dataset, which were then analyzed to answer specific questions (eg, what identifying words are frequently used by a particular population?). In this study, we extracted 2 different sets of tōbyōki blogs from the TOBYO database, 1 for patients with depression and 1 for patients with RA, and we prepared separate datasets containing all unstructured text written by patients with each disorder. We then analyzed the drugs and medical events mentioned in each dataset.

    To process the unstructured text, we first performed a morphological analysis using MeCab, an open-source Japanese segmentation tool [31], to break down each text into words. This preprocessing approach is commonly used to delimit words in texts that do not delimit words with spaces, which is a characteristic of the Japanese language [32]. Because tōbyōki blogs contained many entries unrelated to disease, such as those related to everyday life, making the data noisy, we also identified the 100 most frequently mentioned drugs in each dataset (depression and RA). Then, for each dataset, we extracted every sentence containing at least one of the 100 most frequently mentioned drugs identified earlier, and these extracted sentences were used for subsequent analysis. This approach enabled us to focus on drug-related contexts rather than on everyday diary-like content. As mentioned earlier, 2 reviewers (ST and MS) independently reviewed summary tables containing the 300 most frequently mentioned words in each dataset (depression and RA) to identify medical events (eg, name of symptom, diagnosis, and disorder), which were coded using MedDRA. Because original descriptions written by patients tended to have some degree of ambiguity (eg, words such as suffering, feeling down, feeling unwell), discrepancies in coding sometimes occurred between the results of the 2 reviewers. The reviewers discussed any such discrepancies and determined a single appropriate Preferred Term in accordance with the standard guidance for MedDRA coding procedures (MedDRA Term Selection: Points to Consider [33]). Any discrepancies in coding results were resolved by discussion.

    In addition to identifying medical events frequently observed in tōbyōki blogs, we examined differences in the types and frequencies of events between tōbyōki blogs and existing HCP-generated data sources. For this purpose, we compared medical terms frequently observed in tōbyōki blogs (as identified earlier) and those frequently observed in the JADER database. Using the JADER database, we first produced separate tables of the 30 most frequent ADRs reported for 4 biological drugs approved for RA (adalimumab, etanercept, infliximab, and tocilizumab were selected because they were the first 4 biologics approved in Japan around 2000 and were thus expected to contain enough data for comparison) and that of the 30 most frequent ADRs reported for 4 selective serotonin reuptake inhibitors approved for depression (escitalopram, fluvoxamine, paroxetine, and sertraline were selected because these were widely prescribed and also used in the previous study [14]). Then by comparing these lists of events from tōbyōki blogs and the JADER database, we identified the words appearing in both databases and those appearing in either database. This focused comparison based on frequently appearing events enabled us to highlight the major characteristics of these databases. This process of review and comparison was carried out independently by the 2 aforementioned reviewers (ST and MS).

    Figure 2. Work flow for morphological analysis and preparation of datasets.
    View this figure
    Visualization of Tōbyōki Blog Contents

    Because visualization approaches could be useful for PV, we used all sentences containing at least one of the above 100 drugs to calculate Jaccard coefficients to measure the similarity between term pairs. Jaccard coefficients index the degree of cooccurrence between term pairs by showing how much the terms overlap. For instance, Figure 3 shows the calculation of the Jaccard coefficient for drug A and verb X [34].

    Figure 3. Calculation of the Jaccard coefficient.
    View this figure

    Using these Jaccard coefficients, we visually represented the words associated with depression or RA in word clouds. In the word clouds, the size of each word reflected the frequency with which the word appeared in text (ie, the more frequently a word appeared, the larger the word was shown in the word cloud). The colors of each word were randomly assigned and did not have any meaning. Word clouds could be used in PV to achieve an initial, intuitive understanding of data. We also created a word cooccurrence network for patients with RA to evaluate the occurrence of words in conjunction with the names of 4 biological drugs approved for RA. Word cooccurrence network analysis could be used in PV to explore terms related to specific drugs.

    Statistical software R, JMP software version 11.2.1 (SAS institute), and Microsoft Excel were used for the analysis.

    Ethics Approval

    The study protocol was reviewed and approved by the nonprofit MINS Institutional Review Board [35]. The board waived informed consent because the data source did not contain personal information. In addition, we presented the data at the group level rather than at the individual level.


    Analysis A: Using the Whole TOBYO Database

    Demographic Characteristics of the TOBYO Database

    As of June 4, 2016, the tōbyōki blogs aggregated in the TOBYO database comprised 54,010 blogs representing 1405 disorders. The blogs were started from 1994 to 2016, but more than 90% of them were started from 2005 to 2015.

    As shown in Table 2, information on gender could be identified in most of the blogs (99.60%, 53,794/54,010). More blogs were written by female bloggers (68.80%, 37,161/54,010) than by male bloggers (30.80%, 16,633/54,010). Of approximately 40% of tōbyōki blogs in the TOBYO database with information on age at onset, more than half were written by people less than 50 years old. The peak age at onset was 20-34 years (24.44%, 13,201/54,010), followed by 35-49 years (16.35%, 8830/54,010) and less than 20 years (16.16%, 8730/54,010).

    Table 2. Distribution of gender and age at onset in TOBYO database.
    View this table

    We found apparent differences in population distribution between the TOBYO database and existing data sources such as the Japanese health insurance claims database, JADER database, and national population statistics (Figure 4). Compared with national statistics as a standard, the population in the TOBYO database tended to be younger and contained relatively more females than males. In contrast, the population of the JADER database was older with no particular gender differences between ages. The health insurance claims database did not include people older than 75 years, but data for the young to middle-aged group seemed to be abundant with no particular gender differences between age groups.

    Figure 4. Comparison of population distribution between the TOBYO database and external databases.
    View this figure
    Distribution of Disorders in the TOBYO Database

    As shown in Table 3, the most frequently appearing disorders in the TOBYO database were breast cancer (9.23%, 4983/54,010), depression (6.58%, 3556/54,010), infertility (4.50%, 2430/54,010), RA (2.07%, 118/54,010), and panic disorder (2.02%, 1090/54,010). These disorders were observed more frequently in females than in males in the TOBYO database. Categorization of disorders according to disease organ groups by a proprietary TOBYO classification system similar to MedDRA classification showed that the frequently appearing categories were neoplasms benign, malignant, and unspecified (31.20%, 16,851/54,010), psychiatric and behavior disorders (22.84%, 12,334/54,010), kidney, urological, or genital disorders (8.34%, 4507/54,010), and muscular, bone, or articular disorders (8.28%, 4471/54,010; Table 4).

    Table 3. Primary disorders frequently described in the TOBYO database.
    View this table
    Table 4. Category for primary disorders frequently described in the TOBYO database.
    View this table
    Additional Characteristics of the TOBYO Database

    We also highlighted unique data elements by analyzing behavioral characteristics of writing tōbyōki blogs and found that most writers updated their blogs between 9 PM and 0 AM (Figure 5). No particular patterns were observed according to which days of the week blog entries were posted. About 40% of the blogs in the TOBYO database (36.81%, 19,879) had continued for more than 3 years.

    Figure 5. Additional characteristics of the TOBYO database.
    View this figure

    Analysis B: Using Subset of Selected Disorders in the TOBYO Database

    Mining Events Appearing in Tōbyōki Blogs

    Comparison of depression (Table 5) and RA (Table 6) events in tōbyōki blogs and the JADER database showed apparent differences in the types and frequencies of events observed. Subjective, symptomatic terms and general terms for patients tended to be frequently observed in tōbyōki blogs (eg, anxiety, headache, and pain), whereas more technical, medical terms (eg, syndrome and abnormal laboratory test result) tended to be observed frequently in the JADER database. Exceptionally, the fact that “interstitial lung disease” in patients with RA was observed frequently in both tōbyōki blogs and the JADER database suggested relatively high attention for this event.

    Table 5. Comparison of events in patients with depression in the TOBYO database and JADER database.
    View this table
    Table 6. Comparison of events in patients with rheumatoid arthritis in the TOBYO database and JADER database.
    View this table
    Visualization of Contents in Tōbyōki Blogs

    As depicted in Figures 6 and 7, “take” (as in “take medicine”) was the most frequent word in the datasets for depression and RA, suggesting that extraction of tōbyōki blog content containing the 100 most frequently mentioned drugs helped focus the data. Among patients with depression (Figure 6), sleep-related terms such as “lie down,” “sleep (noun),” “sleep (verb),” “sleepiness,” “awakening,” and “awaken” were observed, indicating that patients shared information about their disease conditions. We also found therapy-specific words such as “adverse effects,” “antidepressant agent,” “depression drug,” and “withdrawal symptoms.” Among patients with RA (Figure 7), pain-related terms such as “pain,” “painful,” “swelling,” and “stiffness” were frequently noted, indicating that these were important words for characterizing RA.

    As depicted in Figure 8, the words “rheumatism,” “give relief,” “pain,” and “painful” were located at the center of the word cooccurrence networks of the 4 biological drugs considered in this study, meaning that these words were frequently used in association with all 4 drugs. The characteristics of each drug were also observed in the margins of the word cooccurrence networks. For example, adalimumab and etanercept, administered as subcutaneous injections, were associated with the word “self-injection,” and infliximab and tocilizumab, administered as intravenous infusions, were associated with the word “infusion.”

    Figure 6. Word cloud: visualization of words frequently observed in tōbyōki blogs of patients with depression.(a) English version translated from the original Japanese and (b) the original Japanese version.
    View this figure
    Figure 7. Word cloud: visualization of words frequently observed in tōbyōki blogs of patients with rheumatoid arthritis.(a) English version translated from the original Japanese and (b) the original Japanese version.
    View this figure
    Figure 8. Word co-occurrence network: visualization of words occurring with biological drugs in tōbyōki blogs of patients with rheumatoid arthritis.Because the original language is Japanese, English translations are shown together.
    View this figure


    Principal Findings

    Patient-generated data is likely to play a key role in improving PV [36]. In Japan, however, a system of self-reporting by patients is still being considered [10] and no patient-generated data resources have been explicitly identified. As one option for such a resource, this study evaluated the TOBYO database from the PV perspective.

    In the whole TOBYO database, more blogs were written by female bloggers, and fewer blogs were written by people older than 50 years (Table 2). These findings were consistent with the results of a general survey of Internet usage in Japan [23]. Reflecting the fact that a higher percentage of tōbyōki blogs were written by women, the most frequently appearing disorders in the TOBYO database tended to have a higher prevalence in women: breast cancer, cervical cancer [37], RA [38], and panic disorder [39] (Table 3). Additional analysis of tōbyōki blogs would be more realistic for these disorders with a high prevalence in women. Our findings also suggested the relevance to frequently appearing disorders such as psychiatric disorders with subjective symptoms that have important clinical meaning, refractory disorders, autoimmune disorders, and other chronic disorders.

    As shown in Tables 5 and 6, tōbyōki blogs written by patients with depression or RA contained symptomatic, subjective terms rather than the medical diagnosis or other medical terms. This revealed a difference between tōbyōki blogs and the JADER database generated by HCPs and implied that the TOBYO database might have the advantage of enabling the analysis of patient-level outcomes that could not be captured in existing data sources. Indeed, previous research has shown that psychiatric events are difficult to identify in health care administrative databases because physicians have difficulty detecting them and patients avoid reporting the symptoms to their physicians [40]. Another interesting possibility is that even if a patient reporting system is implemented, patients may not voluntarily report events that they do not consider to be AEs, as suggested by a previous research conducted on patients with Parkinson’s disease [41]. In such a situation, in which patients themselves do not consider the possibility of AEs, the TOBYO database can be useful for capturing initial symptoms as AE signals.

    We confirmed the feasibility of analyzing patient narratives using text mining to draw insights from tōbyōki blogs. Word clouds suggested characteristic words associated with selected conditions, such as “sleeping” and “anxiety” with depression and “pain” and “painful” with RA. This suggested that tōbyōki blogs were a useful resource for understanding characteristic information for each disorder. We were also able to identify words commonly associated with the 4 biological drugs located at the center of the word cooccurrence networks (Figure 8). The common words revealed in this study were not particularly noteworthy, but further research using the same approach with different drugs or disease areas might be useful for exploring drug safety concerns such as unknown AEs. For example, a report analyzing tweets written by Japanese patients with cancer suggested that visualizing narratives with word cooccurrence networks could be a useful approach to obtain insights from social media [42].

    We noted several strengths of tōbyōki blogs as a resource for data analysis in this study. One was the ease of obtaining patient background information as summarized in Table 2. In contrast to other data sources such as Web-based discussion forums in which patient background information was inherently limited [43], tōbyōki blogs usually had a profile or introduction page from which a substantial level of information could be collected. Another strength was that most tōbyōki bloggers wrote their blogs voluntarily to record and share their experiences with others, resulting in primarily subjective descriptions of patient experiences. This first-hand, observational quality, free from obligations or interventions, might enable researchers to better understand patients’ actual concerns. A third strength was that compared with common blogs or social media (even those written by patients), tōbyōki blogs might be more likely to contain analyzable information on health-related or life-related topics because serious disease and other health crises were typical motivations for starting tōbyōki blogs.


    This study had several limitations. First, because tōbyōki blogs were written by only a segment of the patient population, generalization of the findings required caution. For instance, the elderly population might be underrepresented in Internet sources [23]. In addition, as a patient’s condition became more severe, it might be more difficult for them to continue writing their tōbyōki blogs. These biases should be considered when interpreting the results. Second, the insights obtained from qualitative text-mining approaches were based on some degree of subjective interpretation by researchers. For example, in word clouds, the relative size of each word reflected its frequency. It would be helpful to identify frequent or important words that were mentioned by many bloggers. On the other hand, because the size of each word did not reflect its clinical significance, it was possible that some smaller words might have greater clinical significances. Although word clouds have the potential to provide some insights from textual data, interpretation should be done in caution, keeping their pros and cons in mind. Third, some technical improvements would be necessary to extract more meaningful knowledge from the texts used in this study. For instance, we only considered fragmented words for analysis. By excluding phrases and other word combinations, we might have missed some important concepts or patient feelings. Additional techniques such as entity linking or named entity recognition should be considered in future studies to improve the results. Finally, because the language in social media tends to be highly informal and contain a wide variety of expressions, identification of specific concepts such as AEs and medicinal drugs from the unstructured narratives is a challenge. Although we could identify frequently appearing medical events in the TOBYO database, as shown in Tables 5 and 6, it is apparent that not all these events were AEs because we did not consider whether they had occurred before or after drug administration. Additional work is necessary to identify AEs occurring after drug administration.

    Future Challenges for Social Pharmacovigilance

    We also recognized future challenges for the effective use of social media data in PV. First, there is a need for an official guidance or policy about the necessity of obtaining informed consent from patients and protecting privacy. Although research interest in the use of social media is growing, there is currently no consensus or guideline [44]. We think there is no need for artificial constraints such as obtaining subsequent informed consent for the use of blog data because they are already publicly available on the Internet. Regarding patients’ decisions on whether to share data, a study showed that patients in the cancer community tended to think positively about sharing as long as the benefit of sharing data outweighed the risk [45], and the authors recommended that researchers should be careful to protect patient anonymity. In accordance with this recommendation, we prepared all analysis output as summarized data and not individual-level data in consideration of patients’ rights to protected privacy. Second, we acknowledge that issues exist with the reliability and reproducibility of social media, particularly from the regulatory, good pharmacovigilance practice perspective. Concerns about the incorporation of false information have been noted previously [46]. Considering our study using tōbyōki blogs, we assume that the extent of this problem would not be very large because there is no conceivable incentive for maintaining a fake tōbyōki blog at this time. Selecting blogs with more than 10 pages in the screening process before registration in the TOBYO database would help to prevent the inclusion of fake blogs. Concerns about the reproducibility of analysis present a practical challenge. It is not realistic to keep a dynamic dataset that is updated every day and that may be updated retrospectively. To ensure the reproducibility of individual research, storing the final dataset as a snapshot is recommended. Finally, because the volume of data on the Internet is continuously growing, there may be a need to think about how to efficiently detect and process AE information on the Internet. One option is to improve the text-mining algorithm using dictionary-based methods by preparing an annotated corpus to recognize AEs and drugs. However, this process would be time-consuming and costly. Another option is the application of a machine learning approach by preparing a classifier algorithm that does not necessarily require the preparation of annotated corpora, and there has been a report of the application of a deep-learning technique to detect potential AEs from social media texts [47]. In summary, we need to tackle several practical and technical issues to efficiently incorporate social media resources into PV.


    PV activities should have a strong focus on patients’ actual experiences, concerns, and outcomes, and this approach is expected to uncover hidden AE signals earlier and help us understand AEs in a patient-centered way. This study described the fundamental characteristics of tōbyōki blogs in the TOBYO database and provided insights into considering the use of such data for PV. Specific application possibilities for the TOBYO database include the analysis of disorders with a high prevalence in women, psychiatric disorders with important subjective symptoms, refractory disorders, and other chronic disorders. Further research would facilitate the enhancement of PV by incorporating patient-generated data from the Internet.


    The authors thank Hiraku Miyake and Masaru Okuyama for providing tōbyōki blog data from the TOBYO database. The authors also thank Sawako Satomi and Terumi Nakayama for their invaluable input on designing this study and Matthew Mckeehan for writing assistance.

    Conflicts of Interest

    All authors are employees of Chugai Pharmaceutical Co, Ltd. Chugai Pharmaceutical Co, Ltd provided support in the form of salaries for all authors, but did not have any additional role in the study design, data analysis, decision to publish, or preparation of the manuscript.

    Authors' Contributions

    SM took primary responsibility for conducting this study. All authors contributed to the conception and study design. Data analyses and interpretation were done by SM, ST, MS, HK, and RT. SM drafted the manuscript with support from ST and MS. All authors contributed revisions of the manuscript and approved the final version.


    1. Apps.WHO. The importance of pharmacovigilance: safety monitoring of medicinal products   URL: [accessed 2016-12-08] [WebCite Cache]
    2. Edwards IR. The future of pharmacovigilance: a personal view. Eur J Clin Pharmacol 2008 Feb;64(2):173-181. [CrossRef] [Medline]
    3. Basch E. The missing voice of patients in drug-safety reporting. N Engl J Med 2010 Mar 11;362(10):865-869 [FREE Full text] [CrossRef] [Medline]
    4. Härmark L, Raine J, Leufkens H, Edwards IR, Moretti U, Sarinic VM, et al. Patient-reported safety information: a renaissance of pharmacovigilance? Drug Saf 2016 Oct;39(10):883-890. [CrossRef] [Medline]
    5. Blenkinsopp A, Wilkie P, Wang M, Routledge PA. Patient reporting of suspected adverse drug reactions: a review of published literature and international experience. Br J Clin Pharmacol 2007 Feb;63(2):148-156. [CrossRef] [Medline]
    6. Inch J, Watson MC, Anakwe-Umeh S. Patient versus healthcare professional spontaneous adverse drug reaction reporting: a systematic review. Drug Saf 2012 Oct 01;35(10):807-818. [CrossRef] [Medline]
    7. Hazell L, Cornelius V, Hannaford P, Shakir S, Avery AJ, Yellow Card Study Collaboration. How do patients contribute to signal detection? A retrospective analysis of spontaneous reporting of adverse drug reactions in the UK's Yellow Card Scheme. Drug Saf 2013 Mar;36(3):199-206. [CrossRef] [Medline]
    8. Fernandopulle RB, Weerasuriya K. What can consumer adverse drug reaction reporting add to existing health professional-based systems? Focus on the developing world. Drug Saf 2003;26(4):219-225. [Medline]
    9. PMDA. Starting the trial of patient adverse drug reaction reporting [in Japanese]   URL: [accessed 2016-12-08] [WebCite Cache]
    10. Kubota K, Okazaki M, Dobashi A, Yamamoto M, Hashiguchi M, Horie A, et al. Temporal relationship between multiple drugs and multiple events in patient reports on adverse drug reactions: findings in a pilot study in Japan. Pharmacoepidemiol Drug Saf 2013 Oct;22(10):1134-1137. [CrossRef] [Medline]
    11. Yamamoto M, Kubota K, Okazaki M, Dobashi A, Hashiguchi M, Doi H, et al. Patients views and experiences in online reporting adverse drug reactions: findings of a national pilot study in Japan. Patient Prefer Adherence 2015;9:173-184 [FREE Full text] [CrossRef] [Medline]
    12. Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J Med Internet Res 2009;11(1):e11 [FREE Full text] [CrossRef] [Medline]
    13. Powell GE, Seifert HA, Reblin T, Burstein PJ, Blowers J, Menius JA, et al. Social media listening for routine post-marketing safety surveillance. Drug Saf 2016 May;39(5):443-454. [CrossRef] [Medline]
    14. Alvaro N, Conway M, Doan S, Lofi C, Overington J, Collier N. Crowdsourcing Twitter annotations to identify first-hand experiences of prescription drug use. J Biomed Inform 2015 Dec;58:280-287 [FREE Full text] [CrossRef] [Medline]
    15. Carbonell P, Mayer MA, Bravo A. Exploring brand-name drug mentions on Twitter for pharmacovigilance. Stud Health Technol Inform 2015;210:55-59. [Medline]
    16. Freifeld CC, Brownstein JS, Menone CM, Bao W, Filice R, Kass-Hout T, et al. Digital drug safety surveillance: monitoring pharmaceutical products in twitter. Drug Saf 2014 May;37(5):343-350 [FREE Full text] [CrossRef] [Medline]
    17. O'Connor K, Pimpalkhute P, Nikfarjam A, Ginn R, Smith KL, Gonzalez G. Pharmacovigilance on twitter? Mining tweets for adverse drug reactions. AMIA Annu Symp Proc 2014;2014:924-933 [FREE Full text] [Medline]
    18. Baker L, Wagner TH, Singer S, Bundorf MK. Use of the Internet and e-mail for health care information: results from a national survey. J Am Med Assoc 2003 May 14;289(18):2400-2406. [CrossRef] [Medline]
    19. PatientsLikeMe. Live better, together!   URL: [accessed 2016-12-08] [WebCite Cache]
    20. Patientslikeme. News article: PatientsLikeMe and the FDA Sign Research Collaboration Agreement   URL: http:/​/blog.​​2015/​06/​15/​patientslikeme-and-the-fda-sign-research-collaboration-agreement/​ [accessed 2016-12-08] [WebCite Cache]
    21. Web-radr. WEB-RADR project: About us   URL: [accessed 2016-12-08] [WebCite Cache]
    22. EMA.Europa. Guideline on good pharmacovigilance practices   URL: [accessed 2017-01-20] [WebCite Cache]
    23. Takahashi Y, Ohura T, Ishizaki T, Okamoto S, Miki K, Naito M, et al. Internet use for health-related information via personal computers and cell phones in Japan: a cross-sectional population-based survey. J Med Internet Res 2011;13(4):e110 [FREE Full text] [CrossRef] [Medline]
    24. Kadobayashi M. As a source of power to live: sociology of tobyoki for cancer [in Japanese]. Tokyo: Seikaisha Press; 2011:18-19.
    25. Sato A, Aramaki E, Shimamoto Y, Tanaka S, Kawakami K. Blog posting after lung cancer notification: content analysis of blogs written by patients or their families. JMIR Cancer 2015;1(1):e5. [CrossRef]
    26. TOBYO website [in Japanese].   URL: [accessed 2016-12-08] [WebCite Cache]
    27. Fujiwara M, Kawasaki Y, Yamada H. A pharmacovigilance approach for post-marketing in Japan using the Japanese adverse drug event report (JADER) database and association analysis. PLoS One 2016;11(4):e0154425 [FREE Full text] [CrossRef] [Medline]
    28. Kimura S, Sato T, Ikeda S, Noda M, Nakayama T. Development of a database of health insurance claims: standardization of disease classifications and anonymous record linkage. J Epidemiol 2010;20(5):413-419 [FREE Full text] [Medline]
    29. e-Stat [in Japanese]. List of statistical tables   URL: https:/​/www.​​SG1/​estat/​GL08020103.​do?_toGL08020103_&listID=000001137972&disp=Other&requestSender=estat [accessed 2016-12-08] [WebCite Cache]
    30. Brown EG, Wood L, Wood S. The medical dictionary for regulatory activities (MedDRA). Drug Saf 1999 Feb;20(2):109-117. [Medline]
    31. Taku910.github. MeCab: Yet Another Part-of-Speech and Morphological Analyzer [in Japanese]   URL: [accessed 2016-12-08] [WebCite Cache]
    32. Manning CD, Schutze H. Foundations of statistical natural language processing. London: The MIT Press; 2002.
    33. MedDRA. MedDRA Term Selection: Points to Consider   URL: [accessed 2017-01-20] [WebCite Cache]
    34. Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S. Using of Jaccard coefficient for keywords similarity. 2013 Presented at: Proceedings of the International MultiConference of Engineers and Computer Scientists; March 13-15, 2013; Hong Kong.
    35. NPO-MINS. MINS [in Japanese]   URL: [accessed 2016-12-08] [WebCite Cache]
    36. Pitts PJ, Louet HL, Moride Y, Conti RM. 21st century pharmacovigilance: efforts, roles, and responsibilities. Lancet Oncol 2016 Nov;17(11):e486-e492. [CrossRef] [Medline]
    37. Ginsburg O, Bray F, Coleman MP, Vanderpuye V, Eniu A, Kotha SR, et al. The global burden of women's cancers: a grand challenge in global health. Lancet 2016 Nov 01. [CrossRef] [Medline]
    38. McInnes IB, Schett G. The pathogenesis of rheumatoid arthritis. N Engl J Med 2011 Dec 08;365(23):2205-2219. [CrossRef] [Medline]
    39. Roy-Byrne PP, Craske MG, Stein MB. Panic disorder. Lancet 2006 Sep 16;368(9540):1023-1032. [CrossRef] [Medline]
    40. Spettell CM, Wall TC, Allison J, Calhoun J, Kobylinski R, Fargason R, et al. Identifying physician-recognized depression from administrative data: consequences for quality measurement. Health Serv Res 2003 Aug;38(4):1081-1102 [FREE Full text] [Medline]
    41. Perez-Lloret S, Rey MV, Fabre N, Ory F, Spampinato U, Montastruc J, et al. Do Parkinson's disease patients disclose their adverse events spontaneously? Eur J Clin Pharmacol 2012 May;68(5):857-865. [CrossRef] [Medline]
    42. Tsuya A, Sugawara Y, Tanaka A, Narimatsu H. Do cancer patients tweet? Examining the twitter use of cancer patients in Japan. J Med Internet Res 2014;16(5):e137 [FREE Full text] [CrossRef] [Medline]
    43. Park J, Ryu YU. Online discourse on fibromyalgia: text-mining to identify clinical distinction and patient concerns. Med Sci Monit 2014 Oct 07;20:1858-1864 [FREE Full text] [CrossRef] [Medline]
    44. Dreyer NA, Blackburn S, Hliva V, Mt-Isa S, Richardson J, Jamry-Dziurla A, et al. Balancing the interests of patient data protection and medication safety monitoring in a public-private partnership. JMIR Med Inform 2015;3(2):e18 [FREE Full text] [CrossRef] [Medline]
    45. Frost J, Vermeulen IE, Beekers N. Anonymity versus privacy: selective information sharing in online cancer communities. J Med Internet Res 2014;16(5):e126 [FREE Full text] [CrossRef] [Medline]
    46. Shah SG, Robinson I. Patients' perspectives on self-testing of oral anticoagulation therapy: content analysis of patients' internet blogs. BMC Health Serv Res 2011 Feb 03;11:25 [FREE Full text] [CrossRef] [Medline]
    47. Nikfarjam A, Sarker A, O'Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 2015 May;22(3):671-681 [FREE Full text] [CrossRef] [Medline]


    AE: adverse event
    ADR: adverse drug reaction
    HCP: health care professional
    ICSR: individual case safety report
    JADER: Japanese Adverse Drug Event Report
    MedDRA: Medical Dictionary for Regulatory Activities
    PV: Pharmacovigilance
    RA: rheumatoid arthritis

    Edited by G Eysenbach; submitted 08.12.16; peer-reviewed by S Blackburn, E Aramaki; comments to author 03.01.17; revised version received 30.01.17; accepted 31.01.17; published 24.02.17

    ©Shinichi Matsuda, Kotonari Aoki, Shiho Tomizawa, Masayoshi Sone, Riwa Tanaka, Hiroshi Kuriki, Yoichiro Takahashi. Originally published in JMIR Public Health and Surveillance (, 24.02.2017.

    This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.