This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
As social media becomes increasingly popular online venues for engaging in communication about public health issues, it is important to understand how users promote knowledge and awareness about specific topics.
The aim of this study is to examine the frequency of discussion and differences by race and ethnicity of cancer-related topics among unique users via Twitter.
Tweets were collected from April 1, 2014 through January 21, 2015 using the Twitter public streaming Application Programming Interface (API) to collect 1% of public tweets. Twitter users were classified into racial and ethnic groups using a new text mining approach applied to English-only tweets. Each ethnic group was then analyzed for frequency in cancer-related terms within user timelines, investigated for changes over time and across groups, and measured for statistical significance.
Observable usage patterns of the terms "cancer", "breast cancer", "prostate cancer", and "lung cancer" between Caucasian and African American groups were evident across the study period. We observed some variation in the frequency of term usage during months known to be labeled as cancer awareness months, particularly September, October, and November. Interestingly, we found that of the terms studied, "colorectal cancer" received the least Twitter attention.
The findings of the study provide evidence that social media can serve as a very powerful and important tool in implementing and disseminating critical prevention, screening, and treatment messages to the community in real-time. The study also introduced and tested a new methodology of identifying race and ethnicity among users of the social media. Study findings highlight the potential benefits of social media as a tool in reducing racial and ethnic disparities.
Cancer is a major public health problem, impacting more than 14 million men and women in the United States. As of January 2014, an estimated 1.6 million additional new cancer cases will be diagnosed among Americans in 2015 [
Knowledge and awareness about the four cancers with the highest incidence and mortality among adults in the United States, namely lung, breast, prostate, and colorectal cancer, has been shown to differ by race and ethnicity [
Today, social media outlets including Twitter, Facebook, and Instagram, are popular online platforms to engage in communication about any and everything, and many studies [
In this study, we aim to explore differences in cancer-related tweeting by race and ethnicity, basing our work on Rickford’s assertion of unique vernacular patterns amongst African-Americans [
Tweets were collected from April 1, 2014 through January 21, 2015 using the Twitter public streaming Application Programming Interface (API) to collect 1% of public tweets, yielding 281,276,343 tweets. For this study, we restricted our collection to English-only tweets. We provided no restriction on Global Positioning System (GPS) values for each Tweet due to the sparsely available GPS data and instead focus our Tweet location to US-only accounts using an approach introduced later in this paper. Due to a technical issue with our collection system, tweets from May 13, 2014 through July 24, 2014 were not retained. During the data collection period, the Twitter-provided unique user identification (ID) number, tweet, data/time, profile-identified location, and GPS latitude and longitude values were collected (when available). Following the collection of tweets, user timelines were re-constructed by grouping tweets using the unique user ID number. The distribution of character lengths for tweets in the collection are shown in
Histogram showing the distribution of Tweet character lengths.
Histogram of the log of the character length of user timelines. We present this graph in log-form due to the wider distribution of character lengths in timelines.
The preprocessing procedure for cleaning tweets followed a consistent approach across all collected timelines. Given that the focus was on the predictive power of text, tweets containing linking information outside of the self-contained tweet, predominately non-language elements (ie, URLs, usernames, and re-tweet information) were systematically removed. For example a tweet containing elements such as, "www.t.co", "cnn.com", "@username", and "RT @username" would be removed from the collection. While re-tweeted text may provide information about individuals and/or organizations a user interacts with via Twitter at this scale, we were unable to include all re-tweets using the provided Twitter API due to rate limitations (ie restrictions imposed by Twitter limiting the number of searches we could conduct in a 15-minute period). User timelines (tweets aggregated by user) that contained little information were removed by systematically eliminating those that were shorter than 85 characters from the study. To select this character threshold, we randomly selected timelines of varying length and observed that timelines shorter than 85 characters generally contained fewer than fifteen words, which provided little information to make accurate classifications. These preprocessing methods left us with a final Tweet count of 19,818,236 belonging to 779,653 unique users’ timelines for analysis.
The approach to classifying users’ ethnicities presented in this paper relies on a supervised learning classification approach [
Individual tweets are short, often uninformative messages providing little classification potential for identification of user profile information. This led us to examine users’ timelines, rather than individual tweets, to enhance the accuracy of our classification approach by extracting features consisting of deeper information around users’ activities. Users’ tweets were collapsed into timelines containing the chronological order of their submitted tweets for the 10-month data collection period. This provided a larger text source for identifying descriptive elements indicative of a given user’s ethnicity.
Baseline classification models described in previous work [
When building the baseline classifier [
Latent Dirichlet allocation (LDA) [
We used ten-fold cross validation to test the accuracy of the models. The labeled dataset was divided into ten, equally sized bins. Nine of the ten bins were used to train the model, while the remaining bin was used for testing. We iterated over the bins ten times, reserving a new bin for testing with each additional iteration. Due to the unbalanced nature of our dataset, we chose two evaluations metrics. First, for each ethnicity, we computed the balanced accuracy (Equation a,
Balance and overall accuracy equestions.
All statistical analysis for this study was carried out using R Statistical Software Package. To measure the statistical significance of the observed differences between groups,
To evaluate the success in the classification of race and ethnicity, we compared the accuracy of text classification with synonym expansion against the topic-based method (
Text classification with synonym expansion model classification and accuracy results.
Race and ethnicity | % | |
Balanced accuracy |
|
|
|
Caucasian | 88.87 |
|
African American | 81.26 |
|
Asian | 72.32 |
|
Hispanic | 69.07 |
Overall accuracy |
|
|
|
All groups | 76.07 |
|
Caucasian and African Americans | 88.30 |
|
|
|
Confusion matrix.
Classification | Reference, n | |||
Caucasian | African American | Asian | Hispanic | |
Caucasian | 1067 | 117 | 49 | 71 |
African American | 890 | 1286 | 337 | 380 |
Asian | 26 | 10 | 39 | 35 |
Hispanic | 7 | 7 | 25 | 54 |
Given the higher overall accuracy, as well as the high accuracies among Caucasian and African-American users, we selected the synonym expansion approach for classifying the remaining unlabeled users within the collection. Additionally, we elected to exclude users classified as Asian and Hispanic from this study for multiple reasons. First, the population sizes where users declared ethnicities of these types were markedly smaller than populations of Caucasians and African-Americans. In addition, we believe we may have excluded some Asian and Hispanic users by limiting the Tweet collection to English-only tweets. The combination of these complications (small population sizes and the restriction of English-only tweets) is likely reasons for the reduction in accuracy among these groups and their subsequent exclusion from the study.
In this study, we have established and tested a systematic method for detecting ethnicities among Twitter users. Using the more accurate approach, text classification with synonym expansion, we detected and assigned ethnicities to all users within the collection consisting of 19,818,236 tweets posted by 779,653 unique users. Tweets were divided by posting date into nine months, accounting for the ten-month study period with portions of May and July and the entirety of June lost due to system failure. Various descriptive statistics were calculated to describe the health effects extracted from the dataset.
As shown in
Distribution of unique active Twitter users during each month of the study period by race and ethnicity.
Month | Race and ethnicity, n (%) | Total | |||
African American | Caucasian | Asian | Hispanic |
|
|
April | 49,104 (9.72) | 452,924 (89.64) | 1289 (0.25) | 1935 (0.38) | 505,252 |
Maya | 40,956 (12.76) | 277,169 (86.36) | 1177 (0.37) | 1646 (0.51) | 320,948 |
Julya | 43,349 (9.58) | 405,185 (89.57) | 1661 (0.37) | 2191 (0.48) | 452,386 |
August | 54740 (7.91) | 632,687 (91.47) | 1820 (0.26) | 2466 (0.36) | 691713 |
September | 52,224 (10.16) | 457,300 (89.02) | 1789 (0.35) | 2417 (0.47) | 513,730 |
October | 50,120 (11.07) | 398,440 (88.02) | 1763 (0.39) | 2371 (0.52) | 452,694 |
November | 50,060 (10.80) | 409,125 (88.30) | 1762 (0.38) | 2370 (0.51) | 463,317 |
December | 48,247 (11.20) | 378,412 (87.86) | 1727 (0.40) | 2292 (0.53) | 430,678 |
January | 30,707 (15.62) | 162,682 (82.75) | 1435 (0.73) | 1780 (0.91) | 196,604 |
aTweets from May 13, 2014 through July 24, 2014 were not retained due to a system outage.
This study focused on the social media attention given to site-specific cancers and differences by race and ethnicity. Specifically, Twitter timelines were examined for the frequency of occurrence of the following terms: "cancer", "breast cancer", "prostate cancer", "colorectal cancer", and "lung cancer." These terms were detected using methods adopted in previous studies examining discussions about specific health topics on Twitter [
First, we examined user activity by ethnicity during each month of the study period to understand seasonal peaks in term usage on Twitter (
Finally, we examined the differences in term usage by race and ethnicity within each month of the study period using
Statistical significance of pairwise differences in cancer term usage between African Americans and Caucasians during each month of the study perioda.
Month | Cancer term, |
||||
"Cancer" | "Breast cancer" | "Prostate cancer" | "Colorectal cancer" | "Lung cancer" | |
April | 0.00003 | 0.053025 | 0.014894 | 0.025347 | 0.080356 |
May | 0.008194 | 0.584394 | 0.122251 | 0.095581 | 0.510364 |
July | 0.013599 | <0.0001 | 0.006656 | 0.157299 | 0.890133 |
August | <0.0001 | 0.001168 | 0.157209 | 0.312076 | 0.165111 |
September | <0.0001 | 0.00007 | 0.017132 | 0.157299 | 0.013196 |
October | <0.0001 | <0.0001 | 0.242175 | 0.974206 | 0.000162 |
November | <0.0001 | <0.0001 | 0.027708 | 0.014306 | 0.000631 |
December | 0.000266 | 0.000001 | 0.027575 | 0.317311 | 0.000067 |
January | 0.241671 | 0.00945 | 0.1573 | 0.083265 | 0.91944 |
aEach user’s total term usage was calculated by summing the frequency with which cancer terms appeared in their timeline.
Monthly frequency of cancer terms by race/ethnicity (African American, left axis; Caucasian, right axis), and all Twitter users (right axis). Cancer terms are "Cancer" (top left), "Breat Cancer" (top right), "Prostate Cancer" (bottom left), and "Lung Cancer" (bottom right). It is important to note the sharp decreases seen following cancer awareness months (Prostate Cancer Awareness Month [PCAM, September], Breast Cancer Awareness Month [BCAM, October], and Lung Cancer Awareness Month [LCAM, November]), particularly among African Americans. Both groups are seen returning to lower frequencies following awareness months; however, this observation is more prevalent among African Americans, specifically following BCAM.
In this study, we observed interesting patterns of media attention given to specific cancer terms among unique Twitter users during a 9-month period in 2014. With a focus on cancer in general, and breast, prostate, and lung cancers specifically, which are the leading cancers among men and women in the United States, we observed some variation in the frequency of term usage during and after specific months known to be cancer awareness months, specifically September (Prostate Cancer Awareness Month [PCAM]), October (Breast Cancer Awareness Month [BCAM], and November (Lung Cancer Awareness Month [LCAM]). Interestingly, colorectal cancer, the third most common cancer in both men and women [
Overall, we found that the frequencies of mentions of "cancer" among Caucasian and African American users were similar in terms of seasonal increases or decreases, although it appeared that African Americans maintained a higher percentage of normalized Tweet frequency of this broad term compared to the Caucasian group. In terms of the frequencies of mentions of "breast cancer", Caucasian users consistently had a higher percentage of use during all months of the study period. As expected, the frequency of use of this term was highest during BCAM, with a dramatic decrease in the months following, ultimately returning to levels lower than observed leading up to BCAM. This was true among both Caucasians and African Americans; however, there was a steeper decline in the mentions of "breast cancer" on Twitter among African Americans following BCAM.
This may be an area that can be the focus of future interventions aimed at increasing breast cancer awareness throughout the year, which could contribute to increased knowledge, improved within-guidelines screening rates, and increased preventive activities among groups with a disproportionate disease burden. For example, weekly Twitter chats hosted by the #bcsm ("breast cancer social media") community have been shown to raise awareness and decrease medical anxiety in patients [
During PCAM, there was a substantially higher frequency of discussion of prostate cancer among Caucasians compared to African Americans. In July and January, among Caucasian users, we observed the lowest levels of prostate cancer discussion. Conversely, among African Americans, we observed a steady decrease in prostate cancer discussion from August through January. Following PCAM, we observed a decline in the frequency of use of the term "prostate cancer" among both groups; however, these declines were slower than that observed with other cancer awareness campaigns. For example, when examining the frequency of use of the term "lung cancer", we observed a peak in November (LCAM) and then a dramatic decrease to levels lower than observed in the months prior to LCAM.
The months following cancer awareness month campaigns also presented interesting findings. While awareness month campaigns (eg, PCAM, BCAM, LCAM) could be considered successful in promoting discussion around various cancer topics, our findings suggest that these campaigns as evidenced by mentions of cancer terms via Twitter during specific cancer awareness months, did not appear to sustain long-term interest and discussion. This phenomenon was particularly evident when examining breast cancer discussion frequency, but was also present in both lung cancer and prostate cancer social media activity. In fact, our findings showed that racial and ethnic groups often returned to a state of lower participation following awareness campaigns when compared with preceding months. Notably, this reduction in discussion frequency appeared to be more prevalent among minority groups. For example, African Americans reduced their participation by 73% in the month following BCAM when compared with months preceding the program. Among Caucasians, we also saw a drop in participation where we observed only a 47% reduction. Similarly for LCAM, we observed a 50% drop among African Americans compared with a 25% drop in the Caucasian cohort. Finally, in terms of discussion of colorectal cancer, we saw poor participation throughout the months of the study. This could be an indication of poor marketing or the taboo nature of the topic among some populations as well as lack of collection of tweets during Colorectal Cancer Awareness Month (CRCAM) due to a technical issue with our data collection system.
These drops in participation are likely related to media exposure and framing, two media effects that are mediated by structural determinants of health (eg, SES, race, ethnicity) [
With the growing popularity of social media and the previously unavailable personal insights it offers, social media mining presents new opportunities and methods applicable to epidemiologic research. Existing studies have examined the health impacts of social media, as shown in previous work [
There were limitations of this study that should be considered. First, our findings provide only a glimpse of all tweets, focused on cancer-specific topics among users without private Twitter accounts during one year. Thus, there could very well be an underestimation of the frequency of cancer-focused discussion via Twitter. Relatedly, it is possible that tweets of interest were missed due to our choice of keywords or use of alternate terms and/or spellings of some words among the users. It is possible that we missed tweets of interest based on the keywords we have chosen to examine and, consequently, the true frequencies of cancer-related tweets may be higher than what we currently examined in the analysis. Nevertheless, our large-scale systematic examination of 779,653 unique Twitter users and their tweets contributed during a 9-month period still provides a meaningful glimpse into users’ social media activity related to general or specific cancer topics. Due to the scope and length limit of this manuscript, we choose to report several representative case studies using the most popular cancer terms concerned by Twitter users. As demonstrated through these multiple case studies, commonly enabled by the proposed approach, the new method has the promise to be generically applicable for detecting, tracking, and comparing user interests regarding other cancer or disease topics. In addition, due to technical issues with our collection system, we were unable to retain collected tweets from the middle of May through the end of July 2014, which could have contributed to the very low frequency of use of the term "colorectal cancer". In addition, March, which is CRCAM, was not included in our collection period and could also contribute to the low frequency of the term "colorectal cancer." Another possibility is that not all public tweets were delivered from the Twitter public API; but there is no way to determine the likelihood of this possibility. The collection period excluding winter and post-holiday months (late January to March) could potentially miss important patterns that may emerge through the analysis of this time period.
And finally, because several regional, temporal, and country-specific factors may have some influence on the contents of information shared or communicated via Twitter, we went to considerable lengths to limit our dataset to US-based users. Ideally, we would have liked to filter our dataset by a Twitter-provided variable, distinguishing US-based users from non-US-based users. However, because Twitter does not provide this information, we chose to adopt an alternate method for the extraction of US users by looking at the "Location" portion of a user’s profile. This is a free-text area provided by Twitter where users can input information such as New York or San Francisco, California, excluding users with non-US locations in their profile. This method was chosen for the following two reasons: (1) only a small fraction of users provide geo-tagged tweets, and (2) it is difficult to assume that geo-tagged tweets taken internationally do not belong to a US-national. Geo-tagging of tweets varies in location for a given user and, therefore, does not provide an accurate understanding of the location a user defines as home.
This study demonstrated that social media can serve as a very powerful and important tool in implementation and dissemination of critical cancer education and awareness messages to the community in real-time. These findings could help improve future social media studies, identify trends within groups of users, and target group-specific health education literature by learning users’ characteristics through language differences. This study also introduced and tested a new methodology for identifying race and ethnicity among users of social media, which presents a unique opportunity to study risk profiles, risk factors and behaviors for several conditions by race and ethnicity and has significant implications in reducing disparities through targeted intervention and dissemination of evidence-based information tailored to specific racial and ethnic groups.
Application Programming Interface
Breast Cancer Awareness Month
Colorectal Cancer Awareness Month
identification
Global Positioning System
Lung Cancer Awareness Month
Latent Dirichlet allocation
Prostate Cancer Awareness Month
Socioeconomic status
This study was supported by a grant from the National Cancer Institute at the National Institutes of Health (R01 CA170508). This work was also partially supported by the Cancer Center Support Grant (P30 CA072720) from the National Cancer Institute.
The contents of this manuscript are solely the responsibility of the authors and do not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.
None declared.