Associations of Topics of Discussion on Twitter With Survey Measures of Attitudes, Knowledge, and Behaviors Related to Zika: Probabilistic Study in the United States

Background Recent outbreaks of Zika virus around the world led to increased discussions about this issue on social media platforms such as Twitter. These discussions may provide useful information about attitudes, knowledge, and behaviors of the population regarding issues that are important for public policy. Objective We sought to identify the associations of the topics of discussions on Twitter and survey measures of Zika-related attitudes, knowledge, and behaviors, not solely based upon the volume of such discussions but by analyzing the content of conversations using probabilistic techniques. Methods Using probabilistic topic modeling with US county and week as the unit of analysis, we analyzed the content of Twitter online communications to identify topics related to the reported attitudes, knowledge, and behaviors captured in a national representative survey (N=33,193) of the US adult population over 33 weeks. Results Our analyses revealed topics related to “congress funding for Zika,” “microcephaly,” “Zika-related travel discussions,” “insect repellent,” “blood transfusion technology,” and “Zika in Miami” were associated with our survey measures of attitudes, knowledge, and behaviors observed over the period of the study. Conclusions Our results demonstrated that it is possible to uncover topics of discussions from Twitter communications that are associated with the Zika-related attitudes, knowledge, and behaviors of populations over time. Social media data can be used as a complementary source of information alongside traditional data sources to gauge the patterns of attitudes, knowledge, and behaviors in a population.

1. Coverage: To ensure representativeness, all working cell phone and landline telephone exchanges in the fifty states and the District of Columbia are covered in SSRS's overlapping, dual-frame (cell phone and landline) design. According to the most recent National Health Interview Survey (NHIS), nearly 97% of U.S. adults are reachable by either a cell phone or a landline 1 .
2. Covering the cell-phone only population: Currently, nearly half of U.S. adults live in households without a landline connection. To ensure adequate representation of the cellphone only (CPO), the following two steps are taken: (1) The majority of interviews are completed with respondents reached through their cell phones. On the SSRS Omnibus, the current share of respondents interviewed on their cell phones is 60%. Of these, about 60% are CPO. In other words, about 36% of respondents are CPO. On custom studies, the share of respondents reached via cell phone is typically 70% (approximately 40% CPO). (2) Weighting by phone usage: Phone status, that is CPO, landline only or dual-user, is included in the post stratification weighting adjustments, based on the most recent NHIS estimates. Currently, this means that in a weighted national sample, about 50% of the sample is CPO.
3. Spanish interviewing: The Hispanic population is the most rapidly-growing ethnic group in the U.S. According to the Census, about one third of the Hispanic population are estimated to speak English less than very well, including some defined as linguistically isolated. To ensure that non-English speaking Hispanics are represented in the sample, about 3%-3.5% of interviews conducted in national surveys are completed in Spanish.
4. Probability-based sampling: To ensure unbiased sampling, both the landline and cell phone sample are generated randomly, so that phone numbers have an equal and known probability of selection (EPSEM). Furthermore, telephone exchanges are stratified by geography, to improve geographic representativeness and pulled in replicates of 100, to reduce sample variance.  When reaching a household by dialing a landline number a single respondent is selected through the following selection process: First, interviewers ask to speak with the youngest adult male/female at home. The term "male" appears first for a random half of the cases and "female" for the other randomly selected half. If there are no men/women at home during that time, interviewers ask to speak with the youngest female/male at home. 5. Adjustment for probability of selection: As part of the weighting process, each case is assigned a sample-weight (or baseweight) equal to the inverse of the respondent's probability of selection. Based on Buskirk and Best (2012) 2 , probability of selection is based on respondents' probability of being selected into the landline sample and their probability of selection into the cell phone model:

Pselect=Pcell+PLL-Pcell*PLL
Where Pselect is probability of selection, Pcell is probability of selection into the cell phone frame and PLL is probability of selection into the landline frame.
Pcell, in turn is equal to: Fcell*Ncell and PLLis equal to FLL*NLL/AdultsHH Where Fcell is equal to the number of cell phone numbers selected into the study's sample divided by the total possible cell phone numbers available for sampling, Ncell equals the number of cell phones by which a respondent could personally be reached, FLL is equal to the number of landline phone numbers selected into the study's sample divided by the total possible landline numbers available for sampling, NLL equals the number of landlines by which a respondent's household could be reached, and AdultsHH is equal to the number of adults living in the respondent's household who could be selected to be interviewed. In addition, the data are weighted to reflect the distribution of the population along quintiles of population density. All counties in the U.S. are ranked from least dense to most dense and assigned to ranked quintiles of about equal size, based on the most recent Decennial Census. Weighting the sample to population density improves representativeness of the weighted sample by urban, suburban and rural status.
Post-stratification also includes the phone status variable, mentioned above, based on the most recent NHIS estimate.
Weighting is done by iterative proportional fitting, or 'raking', a method is which the data are repeatedly weighted to the parameters until the variance between the weighted sample and the population parameters is zero, or near-zero. 3 7. Response rate calculation: Response rate is calculated using AAPOR's response rate 3 (RR3) 4 . RR3 is calculated as the number of completed interviews (I) divided by the estimated number of eligible respondents (E). The estimated number of eligible respondents is calculated as: E=(I+P)+(R+NC+O)+e(UHH+UO) P is partial interviews, R is eligible refusals, NC is eligible non-contacts (where a respondent was identified but no interview completed), O is other eligible cases not completed, UHH are cases where a household was reached but the eligibility of respondents not ascertained, and UO are other unknown cases where it is unclear whether the number is attached to a household (or cell phone respondent) and whether that respondent is eligible or not. e is an estimator for the percent of unknown cases estimated to be eligible. In dual frame studies to different e estimators are used for landline and cell phone numbers: e1 -Estimated Percentage of Screener Eligibility (i.e., the proportion of households known to be eligible at the household-level that are estimated to have an eligible respondent residing there); and e2 = Estimated Percentage of Household Eligibility (i.e., the proportion of cases that are of unknown eligibility at the household-level and it is unknown if an eligible respondent resides there).