This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
Selection bias and unmeasured confounding are fundamental problems in epidemiology that threaten study internal and external validity. These phenomena are particularly dangerous in internetbased public health surveillance, where traditional mitigation and adjustment methods are inapplicable, unavailable, or out of date. Recent theoretical advances in causal modeling can mitigate these threats, but these innovations have not been widely deployed in the epidemiological community.
The purpose of our paper is to demonstrate the practical utility of causal modeling to both detect unmeasured confounding and selection bias and guide model selection to minimize bias. We implemented this approach in an applied epidemiological study of the COVID19 cumulative infection rate in the New York City (NYC) spring 2020 epidemic.
We collected primary data from Qualtrics surveys of Amazon Mechanical Turk (MTurk) crowd workers residing in New Jersey and New York State across 2 sampling periods: April 1114 and May 811, 2020. The surveys queried the subjects on household health status and demographic characteristics. We constructed a set of possible causal models of household infection and survey selection mechanisms and ranked them by compatibility with the collected survey data. The most compatible causal model was then used to estimate the cumulative infection rate in each survey period.
There were 527 and 513 responses collected for the 2 periods, respectively. Response demographics were highly skewed toward a younger age in both survey periods. Despite the extremely strong relationship between age and COVID19 symptoms, we recovered minimally biased estimates of the cumulative infection rate using only primary data and the most compatible causal model, with a relative bias of +3.8% and –1.9% from the reported cumulative infection rate for the first and second survey periods, respectively.
We successfully recovered accurate estimates of the cumulative infection rate from an internetbased crowdsourced sample despite considerable selection bias and unmeasured confounding in the primary data. This implementation demonstrates how simple applications of structural causal modeling can be effectively used to determine falsifiable model conditions, detect selection bias and confounding factors, and minimize estimate bias through model selection in a novel epidemiological context. As the disease and social dynamics of COVID19 continue to evolve, public health surveillance protocols must continue to adapt; the emergence of Omicron variants and shift to athome testing as recent challenges. Rigorous and transparent methods to develop, deploy, and diagnosis adapted surveillance protocols will be critical to their success.
Accurate estimation of disease parameters is a fundamental problem in epidemiology. The internal and external validity of epidemiological studies is threatened by unmeasured confounding and selection bias [
This gap is particularly acute in internetbased public health and surveillance. Internetbased sampling in general suffers from unknown selection mechanisms on largely unobservable and dynamic populations, making traditional adjustment methods that require external data about the target population vulnerable to model violation. Previous studies that augmented traditional surveillance mechanisms with internetbased data have proved highly successful at imputing missing or timedelayed information [
In this work, we seek to address this gap between recent theoretical developments and the current practice of internetbased public health surveillance. We present structural causal modeling as a guide to epidemiological judgement through encoding epidemiological knowledge into models that can be tested using sample data, and we describe a general graphical method for deriving falsifiable model conditions. Importantly, this approach can be deployed using only the sampled data, whereas traditional methods for detecting confounding and selection bias require some information about the unsampled or missing data from units with partial data or external data, such as census or health care system medical records [
Structured causal models permit the formal encoding of causal mechanisms and have been extended to formally analyze studies in the presence of selection bias and unmeasured confounding. The mathematical tool necessary for this work is dseparation on directed acyclic graphs (DAGs). Here, we briefly review dseparation notation and concepts. We can represent a probability distribution as a DAG where nodes represent variables X, Y, and Z and edges represent functional dependencies between variables. The formalism of dseparation is a mapping between the DAG of a probability distribution and the conditional independencies of that distribution; this is stated formally in the conditional independence statements of
The dseparation path element rules determine whether a path between X and Y is blocked. A path between X and Y can be blocked in 2 different ways by conditioning on a set of variables W. If the path contains a fork or a chain element, then it is blocked if the middle variable (Z in
Conditional independence statements and dseparation rules.
The only additional conceptual step necessary for analyzing selection bias is to add the sampling mechanism to the initial causal graph G to create the augmented causal graph G_{s}. The encoded sampling mechanism determines the value of the sampling indicator variable S, where S=1 if the unit was sampled and S=0 otherwise. Additionally, any mechanism that filters data after primary collection induces an additional selection bias and must also be encoded in G_{s}. The augmented graph G_{s} obeys dseparation rules, but for clarity, the sampling indicator S node is depicted in G_{s} with a double ring to emphasize that S=1 for all samples by definition; all dseparation statements in G_{s} must be evaluated conditional on S=1.
For any graph G_{s}, the srecoverability condition states that for any variables Y and X in G_{s}, the distribution of the sample P(YX,S=1) is identical to the distribution of the target population P(YX) if and only if Y and S are dseparated by X [
We demonstrate this principle in
In this work, we focus on graphical modeling as a formalism to aid epidemiological judgment. Epidemiological knowledge tightly constrains the set of possible explanatory scenarios for a given context; the difficultly is choosing which of these scenarios is most plausible. Statistically testing the independencies implied by the causal graph encoding is a direct method to select between scenarios. We now demonstrate this approach in an applied problem of estimating the cumulative infection rate CI_{P} of SARSCoV2 in the COVID19 New York City (NYC) spring 2020 epidemic through a prospectively collected crowdsourced internet survey.
Example causal graphs with selection bias and unmeasured confounding.
Initial crowdsourced epidemiology efforts in the COVID19 pandemic focused on surveys collected from a variety of internet sources and target populations. Instead of recruiting via major internet platforms such as Facebook and Google, we recruited our survey participants from the Amazon Mechanical Turk (MTurk) crowdworking platform. MTurk is an internetbased labor market where a research group or business (
We chose the MTurk population for 2 reasons. First, MTurk has been successfully used by many academic groups, including our own, across a broad array of disciplines [
This research was not found to be considered human subject research as the survey did not collect any personally identifying information or set of information that could be reidentifying, in compliance with MTurk’s policy prohibiting any transmission of workers’ personally identifiable information to requesters and Stanford University research policy GUIH12. Research was carried out in a way that followed ethical guidelines set by the Declaration of Helsinki. All MTurk tasks are carefully reviewed before being posted, and MTurk workers are able to accept but then refuse to complete any task or any part of a task at any point in time. Furthermore, the survey task included an introduction page that informed the respondents of the purpose and content of this survey and for what purposes their response data would be used.
We collected primary data from the MTurk population listed as currently residing in New Jersey or New York. Data for surveys s_{1} and s_{2} were collected in 2 successive periods: April 1114 and May 811, 2020. During this period, both New York and New Jersey were under a statewide stayathome order that greatly restricted travel and prohibited public gatherings [
Before answering any questions, the survey asked each participant (respondent) to privately list their 5 closest peer relationships (relations) with whom they typically socialize in person. There was large variation in the number of contacts for each person during the mandatory stayathome orders. Instead of asking respondents about their total number of contacts, we asked about their closest peer relationships because these are the set of persons whose current health status and household characteristics would most likely be known to the respondents. Furthermore, we only asked about 5 relations to minimize the time to complete the survey. The survey queried each respondent about the demographic, employment characteristics, and possible COVID19 symptoms of both themselves and their relations. The survey also queried each respondent about both their household and their relations’ households, including household size and whether any member had a confirmed SARSCoV2 infection since March 15, 2020. These questions were chosen to permit comparison of respondents and relations to known census demographic data and to estimate the cumulative number of infected households and individuals within a specified geographic area. The survey material is included in
We defined a householdbased cumulative infection rate estimator
where C_{p} is an indicator variable for the confirmed SARSCoV2 infection status of person p in a population P of size N_{P}. We defined the household secondary attack rate (SAR_{h}) as the ratio of secondary household cases to the total population of exposed household members. We can write SAR_{h} formally as:
where H is the set of unique households in population P, indicator variable C_{h}=1 for if there is at least 1 SARSCoV2 infection in household h, and N_{h} the size of household h in H. Let the total population be defined as the sum of the household members N_{P}=Σ_{hεH} N_{h}. We can then rewrite CI_{P} in terms of households as:
We then defined the estimator
with unique households H_{S}. The estimator
The survey data were modeled in a causal graph encoding the variables and assumptions, as depicted in
We defined 4 possible causal models depicted in
The first 3 causal models G_{Res,s}, G_{Rel,s}, and G_{All,s} all shared the same causal graph, as represented in the first graph in
Alternative causal graphs.
The second graph in
These 4 models can be compared and ranked empirically by statistical tests of the conditional independences implied by their dseparation conditions. Each causal model in
We evaluated the causal models by statistically testing the implied independence of A; C_{h} in models G_{Res,s}, G_{Rel,s}, and G_{All,s}; and A_{Rel} and C_{h,Rel} in model G_{Rel*,s} using the Fisher exact test for independence. For each model, we filtered the data, as specified in the model, mediansplit the age variable, and performed Fisher exact tests on the 2×2 table of the age group (A_{0}, A_{1}) by house infection status (C_{h,0}, C_{h,1}) with point test statistics shown in
Model selection by conditional independence tests.
Survey  Odds ratio (95% CI)  Sample size, N  



s_{1}  0.802 (0.7650.922)  .77  527 

s_{2}  0.271 (0.2020.438)  .01  513 



s_{1}  0.955 (0.8851.026)  .90  2635 

s_{2}  0.634 (0.5810.694)  .01  2565 



s_{1}  0.919 (0.8241.007)  .73  3162 

s_{2}  0.572 (0.5250.614)  <.001  3078 



s_{1}  0.977 (0.8551.130)  .99  1340 

s_{2}  1.472 (1.2161.823)  .28  1104 
In the general case, there will be no ground truth to compare the model against. However, in this study, we assessed the model performance directly. Due to the particular conditions of the NYC epidemic, the performance of the
To demonstrate the practical epidemiological utility of this type of internetbased sampling, we calculated
In total, 527 and 513 responses met the inclusion criteria from surveys s_{1} and s_{2}, respectively. Demographic information is summarized as frequencies for each collection period, with Pearson chisquared tests performed to compare raw counts to demographic distributions in the 2018 ACS update of the US Census Bureau (
Significant age skews were apparent across all survey periods, with both respondents and relations skewing significantly younger than the known population distribution, while sex distributions were not significantly different than the ACS estimate for New York and New Jersey. This large age skew made the sample highly unrepresentative of the target population, but with a correctly specified causal model, it was possible to obtain an unbiased estimate of the cumulative infection rate CI_{P}, as we next demonstrate through model diagnosis, selection, and evaluation.
Demographic characteristics of survey samples.
Characteristics  Respondents, n(%)  Relations, n(%)  Combined, n(%)  ACS^{a} (%)  


Survey s_{1} (N=527)  Survey s_{2} (N=513)  Survey s_{1} (N=2635)  Survey s_{2} (N=2565)  Survey s_{1} (N=3162)  Survey s_{2} (N=3078) 





<19  2 (0.4)  5 (1.0)  264 (10.0)  356 (13.9)  266 (8.4)  361 (11.7)  22.7  

1929  184 (34.9)  192 (37.4)  620 (23.5)  575 (22.4)  804 (25.4)  767 (24.9)  15.2  

3039  170 (32.3)  163 (31.8)  531 (20.2)  525 (20.5)  701 (22.2)  688 (22.4)  13.2  

4049  89 (16.9)  77 (15)  400 (15.2)  344 (13.4)  489 (15.5)  421 (13.7)  12.9  

5059  56 (10.6)  46 (9.0)  367 (13.9)  346 (13.5)  423 (13.4)  392 (12.7)  14.1  

6069  19 (3.6)  24 (4.7)  284 (10.8)  294 (11.5)  303 (9.6)  318 (10.3)  11.2  

≥70  7 (1.3)  6 (1.2)  169 (6.4)  125 (4.9)  176 (5.6)  131 (4.3)  10.6  

Chisquare ( 
475  480  457  358  793  672  N/A^{b}  

<.001  <.001  <.001  <.001  <.001  <.001  N/A  



N/A  1 (0.2)  3 (0.6)  20 (0.8)  77 (3.0)  21 (0.7)  80 (2.6)  N/A  

Female  267 (50.7)  286 (55.8)  1353 (51.3)  1303 (50.8)  1620 (51.2)  1589 (51.6)  51.4  

Male  259 (49.1)  224 (43.7)  1262 (47.9)  1185 (46.2)  1521 (48.1)  1409 (45.8)  48.6  

Chisquare ( 
0.08  4.50  0.13  0.97  0.046  3.14  N/A  

.77  .03  .71  .32  .83  .07  N/A  



Essential worker  153 (30.0)  140 (27.9)  N/A  N/A  N/A  N/A  N/A  

Food service  31 (5.9)  27 (5.3)  N/A  N/A  N/A  N/A  N/A  

Health care  66 (12.5)  69 (13.5)  N/A  N/A  N/A  N/A  N/A  

Work from home  152 (28.8)  183 (35.7)  N/A  N/A  N/A  N/A  N/A  

Not working  71 (13.5)  86 (16.8)  N/A  N/A  N/A  N/A  N/A  

Other  231 (43.8)  173 (33.7)  N/A  N/A  N/A  N/A  N/A 
^{a}ACS: American Community Survey of the US Census Bureau.
^{b}N/A: not applicable.
^{c}Sex inferred from the reported gender identity for comparison with the ACS.
In the first survey period, no model could be rejected at nominal α=0.05, but in the second survey period, only model G_{Rel*,s} could not be rejected. The most likely explanation for why the 4 models were more distinguished in the second period is that the cumulative infection rate increased through the course of the epidemic, giving greater power to detect a statistical dependence in a model, even though the total sample size was similar between periods for each model.
The model G_{Rel*,s} recovered accurate cumulative infection rate
Relative bias of cumulative infection estimates by geographic area, model, and survey period. NJ/NY: New Jersey and New York; NYC: New York City; NYC CBSA: New York City Combined Statistical Area contained within New Jersey and New York.
The model G_{Rel*,s} was used to calculate the cumulative infection rate estimator
Estimated cumulative infection rate by geographic area, household secondary attack rate (SARh), and survey period. Dashed lines are the reported CI_{p} for the survey period, colormatched to the geographic area. NJ/NY: New Jersey and New York; NYC: New York City; NYC CBSA: New York City Combined Statistical Area contained within New Jersey and New York.
The number of households with at least 1 confirmed SARSCoV2 infection increased by 2 times, and the number of households with at least 1 member recently hospitalized for influenzalike illness (hospital ILI) increased by 1.5 times for both respondents and relations across the 2 survey periods, as shown in
The correlation between health status indicators and symptoms remained largely the same across both periods (
Respondent/relation household heath status and symptoms by survey period.
Person  Household health status, n (%)  Reported symptoms, n (%)  


SARSCoV2^{a}  Hospital ILI^{b,c}  Fever  Aches  Anosmia  Allergy 



Respondents (N=1040)  25 (2.4)  24 (2.3)  32 (3.1)  112 (10.8)  45 (4.3)  89 (8.6) 

Relations (N=5200)  155 (3.0)  109 (2.1)  211 (4.1)  325 (6.3)  167 (3.2)  226 (4.4) 



Respondents (N=1040)  53 (5.2)  33 (3.2)  23 (2.3)  145 (14.3)  46 (4.5)  104 (10.2) 

Relations (N=5200)  295 (5.8)  169 (3.3)  154 (3.0)  337 (6.6)  180 (3.6)  192 (3.8) 
^{a}SARSCoV2: at least 1 household member had tested positive for SARSCoV2 infection by realtime reverse transcription polymerase chain reaction (rRTPCR).
^{b}ILI: influenzalike illness.
^{c}Hospital ILI: at least 1 household member was recently hospitalized for an ILI.
Reported symptom correlations by survey period. SARSCoV2 : at least 1 household member tested positive for SARSCoV2 infection by rRTPCR. Hospital ILI: at least 1 household member was recently hospitalized for an ILI. ILI: influenzalike illness; rRTPCR: realtime reverse transcription polymerase chain reaction.
Using no external data and standard statistical independence tests, we were able to rank and reject all alternative models except the model G_{Rel*,s} that yielded the lowest bias for the cumulative infection rate estimator
Although we primarily intend this work to demonstrate the broad utility of graphical models as an aid to epidemiologists, it is worth noting how useful internetbased epidemiology could prove in future epidemics by inspecting the estimates of
The limitations of this approach are encoded directly in the set of causal graphs and entail explicit conditions where statistical tests will fail to reject incorrect models. For example, if age is a poor instrumental variable for response status, then with finite data, none of the models may be rejected by statistical tests. In contrast, if age is strongly related to the outcome variable household status C_{h}, but not response status, then all models could be rejected, even if there was no selection bias on C_{h}. More generally, if there is any relationship between a set of variables, there will be a statistically significant correlation, given sufficient data; therefore, any causal model regardless of its utility will be rejected if statistical tests are applied naively.
These inherent limitations are why we emphasize structural causal models as an aid and not a substitute for epidemiological judgment. The utility of causal modeling is the formal comparison and communication of alternative explanations of the sampled data. For example, in this study, we chose to not model information bias, instead focusing on detecting selection bias. The choice to ignore information bias is explicit in the presented causal graphs; none of them have a subgraph that models an information bias mechanism, such as rRTPCR test constraints or inaccurate selfreporting. These causal models were constructed with these assumptions for the context and objectives of this study, and similar assumptions may not be acceptable for a different context or objective. The key point is that all these assumptions are made apparent on inspection of the causal graphs.
The COVID19 pandemic is an unprecedented event, pushing the limits of the health care system worldwide. Reducing transmission via nonpharmacological interventions has been effective but requires nearrealtime and accurate information across all segments of society, information that has been difficult to reliably ascertain. Given the vast divergence of cumulative infection rate estimates across early studies [
Survey materials for survey period 1.
Survey materials for survey period 2.
American Community Survey
directed acyclic graph
human intelligence task
influenzalike illness
Mechanical Turk
New Jersey and New York
New York City
New York City Combined Statistical Area contained within New Jersey and New York
realtime reverse transcription polymerase chain reaction
Design, implementation, analysis, and preparation of the manuscript were performed by NS. We gratefully acknowledge conceptual input and constructive feedback from PW, BC, KP, JYJ, and DPW.
The work was supported in part by funds to DPW from the National Institutes of Health (1R01EB02502501, 1R01LM01336401, 1R21HD09150001, 1R01LM013083); the National Science Foundation (Award 2014232), the Hartwell Foundation, the Bill and Melinda Gates Foundation, the Coulter Foundation, the Lucile Packard Foundation, Auxiliaries Endowment, the Islamic Development Bank (ISDB) Transform Fund, and the Weston Havens Foundation; program grants from Stanford University’s Human Centered Artificial Intelligence Program, the Precision Health and Integrated Diagnostics Center, the Beckman Center, BioX Center, the Predictives and Diagnostics Accelerator, Spectrum, the Spark Program in Translational Research, and MediaX; and program grants from the Wu Tsai Neurosciences Institute's Neuroscience:Translate Program. We also acknowledge generous support from David Orr, Imma Calvo, Bobby Dekesyer, and Peter Sullivan. PW would like to acknowledge support from Mr. Schroeder and the Stanford Interdisciplinary Graduate Fellowship (SIGF) as the Schroeder Family Goldman Sachs Graduate Fellow.
The data underlying this paper will be shared upon reasonable request to the corresponding author.
None declared.