This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
Although it is well-known that older individuals with certain comorbidities are at the highest risk for complications related to COVID-19 including hospitalization and death, we lack tools to identify communities at the highest risk with fine-grained spatial resolution. Information collected at a county level obscures local risk and complex interactions between clinical comorbidities, the built environment, population factors, and other social determinants of health.
This study aims to develop a COVID-19 community risk score that summarizes complex disease prevalence together with age and sex, and compares the score to different social determinants of health indicators and built environment measures derived from satellite images using deep learning.
We developed a robust COVID-19 community risk score (COVID-19 risk score) that summarizes the complex disease co-occurrences (using data for 2019) for individual census tracts with unsupervised learning, selected on the basis of their association with risk for COVID-19 complications such as death. We mapped the COVID-19 risk score to corresponding zip codes in New York City and associated the score with COVID-19–related death. We further modeled the variance of the COVID-19 risk score using satellite imagery and social determinants of health.
Using 2019 chronic disease data, the COVID-19 risk score described 85% of the variation in the co-occurrence of 15 diseases and health behaviors that are risk factors for COVID-19 complications among ~28,000 census tract neighborhoods (median population size of tracts 4091). The COVID-19 risk score was associated with a 40% greater risk for COVID-19–related death across New York City (April and September 2020) for a 1 SD change in the score (risk ratio for 1 SD change in COVID-19 risk score 1.4;
The COVID-19 risk score localizes risk at the census tract level and was able to predict COVID-19–related mortality in New York City. The built environment explained significant variations in the score, suggesting risk models could be enhanced with satellite imagery.
The COVID-19 pandemic has disrupted major world economies and overwhelmed hospital intensive care units worldwide [
At the time of writing, New York emerged as a location with several COVID-19–related deaths spread across the 2141 census tracts in the city. Even within city hot spots like New York City, common chronic diseases and their risk factors for COVID-19 are geographically heterogeneous and vary per unit of geography, including within and across states, counties, and even cities. It is unclear how the heterogeneity of community-based risk or prevalence of diseases at a census tract level (median population sizes of ~3000-5000 individuals) is related to COVID-19 risk. Furthermore, analyses on coarser spatial resolutions will attenuate predictions and associations [
In this investigation, we sought to create a clinically focused risk score that could be used to predict COVID-19 cases and deaths within cities, identify hot spots at the subcounty (census tract) level, and identify potentially vulnerable communities, and to determine how the social determinants of health and the built environment may explain the variance of this clinically focused risk score and whether the built environment explains statistically significant amounts of score variance even after accounting for the social determinants of health. To do this, we developed the COVID-19 community risk score (COVID-19 risk score) that summarizes the complex comorbidity and demographic patterns of small communities at the census tract level. Additionally, we examined how the social determinants of health (including the built environment, measured using satellite imagery methods [
We obtained geocoded disease prevalence data at the census tract level from the US Centers for Disease Control and Prevention (CDC) 500 Cities Project (the December 2019 release, which is based on data from 2016 to 2017 [
From the 500 Cities data, we chose 13 population-level health indicators that correspond to individual-level chronic disease risk factors associated with COVID-19–related hospitalization and death based on reports from China, Italy, and the United States (eg, [
We further obtained 5-year 2013-2017 American Community Survey (ACS) Census data [
Overview of study. (A) CDC 500 Cities; (B) satellite imagery of 500 cities from OpenMapTiles; (C) ACS Census summary statistics for each census tract; (D) estimates of prevalence and coprevalence of disease and health indicators for risk of COVID-19 complications; (E) use of principal components analysis to reduce dimensionality of diseases and health indicators; (F) construction of COVID-19 score from principal components; (G) “XYDL” deep learning pipeline that inputs satellite imagery, social determinants of health indicators from ACS Census data to predict COVID-19 community risk score; (H) social determinants of health from ACS Census data; (I) visualization of the COVID-19 community risk score; (J) association of the COVID-19 risk score with mortality in New York City; (K) creation of a dashboard; (L) mapping highest and lowest risk cities and tracts as a function of the risk score. ACS: American Community Survey; CDC: Centers for Disease Control and Prevention.
Given the complex interplay between the social determinants of health, chronic disease, and the built environment, we sought to first examine how clinical comorbidities could be used to predict COVID-19 rates by developing a clinically focused risk score and then examine how these comorbidities relate to the built environment and social determinants of health. Understanding if the built environment and social determinants of health can explain the variance of a clinically focused risk score would show that more complex risk models could be built using this data in the future. To do this, we used the statistical programming language R (version 4.0.5; R Foundation for Statistical Computing) [
Next, we examined the relationship between the ACS-derived sociodemographic indicators with the COVID-19 risk score. This was done by calculating multivariate linear and random forest regressions to test the linear and nonlinear contribution of the sociodemographic indicators in the COVID-19 score (
To correlate the COVID-19 risk score from satellite imagery (
Many census tracts are large enough to contain multiple satellite images. The median number of images per tract was 94, and the number of images per census tract ranged from 1 image in the census tract to the largest geographical tract with 162,811 images (in Anchorage, Arkansas) with an IQR from 43 to 182 images. The geographical coverage of the images per census tract ranged from the smallest census tract covering 0.022 km2 and the largest census tract covering 5679.52 km2, with an IQR from 0.93 km2 to 3.89 km2 and a median of 1.92 km2 per census tract.
First, using the Python 3.7.7 programming language [
We downloaded case and death count data on a zip code tabulation area (ZCTA) of New York City, a hot spot of the US COVID-19 epidemic as of May 20, 2020, and then again on September 20, 2020 (
Finally, the COVID-19 risk score was made publicly available through an application programming interface and online web dashboard (see
Ethics approval was not required for this investigation as the study did not involve any human participants, and all of the data used were obtained from publicly available data sets.
We present summary statistics of the prevalence of the 15 COVID-19 comorbidities and risk factors for 27,648 census tracts across the United States using the 2019 release of the CDC 500 Cities data (derived from data obtained in 2017) and ACS data collected between 2013 and 2017 (
Atlanta had the greatest IQR for obesity (22%-40%), high blood pressure (20%-44%), and COPD (4%-9%), while Gainesville had the highest variation in prevalence of high cholesterol (18%-34%) and blood pressure medication (51%-74%).
Per census tract prevalence for health indicators (y-axis). BP: blood pressure; COPD: chronic obstructive pulmonary disease.
Median prevalence within a city versus the IQR of the prevalence of health indicators (top 3 cities with the largest IQR are highlighted in red). BP: blood pressure; COPD: chronic obstructive pulmonary disease.
The Pearson correlations between the 15 different prevalence values was calculated using census tract–level data (
The first two principal components of the 15 COVID-19 health indicators and risk factors described 85% of the total variation (61% and 24% for component 1 and 2, respectively, see
Pearson correlation of health indicators across 27,648 census tracts (legend value corresponds to Pearson correlation value). BP: blood pressure; COPD: chronic obstructive pulmonary disease.
Scatterplot showing the relationship between the first and second principal components from principal component analysis, with each point indicating a city or census tract in the United States (top 10 cities/tracts by principal component 1 or 2 are highlighted in red).
The COVID-19 risk score was calculated using the 15 disease and health indicators for 27,648 included census tracts. The average score was 33.7 (SD 8.6); the median was 33.32 (IQR 28-38).
Cities with the largest variation of COVID-19 risk score.
City, State | Median | Min | Max | 25th percentile | 75th percentile | SD | IQR |
Athens, GA | 32.2 | 6.3 | 42.6 | 20.7 | 35.7 | 10.0 | 15.0 |
Atlanta, GA | 33.7 | 4.7 | 53.7 | 23.8 | 41.4 | 11.0 | 17.6 |
Boynton Beach, FL | 41.7 | 21.6 | 81.3 | 35.5 | 52.1 | 16.2 | 16.6 |
Champaign, IL | 29.8 | 3.4 | 45.2 | 17.9 | 34.1 | 12.8 | 16.2 |
Gainesville, FL | 27.1 | 2.2 | 75.2 | 16.6 | 38.5 | 15.0 | 21.9 |
Hemet, CA | 41.2 | 30.1 | 67.9 | 36.4 | 53.4 | 11.1 | 17.0 |
Mesa, AZ | 32.2 | 7.7 | 84.6 | 29.0 | 43.2 | 14.7 | 14.2 |
Montgomery, AL | 41.0 | 15.0 | 61.3 | 33.5 | 47.6 | 9.8 | 14.1 |
St. Louis, MO | 36.9 | 22.0 | 57.3 | 31.7 | 46.3 | 8.6 | 14.6 |
Surprise, AZ | 30.2 | 24.1 | 77.9 | 26.1 | 58.7 | 20.6 | 32.6 |
Birmingham, AL | 43.9 | 18.8 | 57.9 | 36.4 | 49.3 | 9.6 | 12.9 |
Cape Coral, FL | 43.4 | 30.3 | 63.3 | 37.3 | 49.3 | 8.2 | 12.0 |
Clearwater, FL | 42.9 | 28.7 | 66.4 | 39.6 | 51.5 | 8.0 | 12.0 |
Cleveland, OH | 42.4 | 18.3 | 78.3 | 38.0 | 49.6 | 9.1 | 11.6 |
Dayton, OH | 43.1 | 6.0 | 78.1 | 38.6 | 49.8 | 11.8 | 11.2 |
Huntsville, AL | 42.7 | 22.1 | 56.3 | 32.9 | 45.6 | 8.4 | 12.7 |
Lake Charles, LA | 41.5 | 27.4 | 54.0 | 36.4 | 46.5 | 7.5 | 10.1 |
Lakeland, FL | 43.9 | 18.0 | 65.8 | 38.3 | 49.2 | 10.9 | 10.9 |
Largo, FL | 45.0 | 26.1 | 75.3 | 41.1 | 53.2 | 11.6 | 12.2 |
Palm Coast, FL | 46.8 | 33.9 | 58.1 | 43.7 | 54.8 | 7.5 | 11.0 |
Pompano Beach, FL | 43.6 | 27.1 | 64.3 | 37.0 | 48.5 | 9.7 | 11.5 |
Shreveport, LA | 42.8 | 21.7 | 64.0 | 37.7 | 49.7 | 8.6 | 12.1 |
Gary, IN | 50.8 | 42.5 | 61.8 | 47.1 | 54.6 | 5.2 | 7.6 |
The social determinants of health measures (excluding built environment) and demographic characteristics of a community (
Concerning important features, all 13 sociodemographic variables correlated with the COVID-19 risk score (linear regression
When assessing the explained variance using nonlinear regression (random forest) methods, the
Multivariate coefficients and CIs for linear regression and random forest regression of the COVID-19 risk score.
Variable | Linear coefficient | Low (95% CI) | High (95% CI) | MSEa | Node purityb | VIFc,d | |
Median income | –1.34 | <.001 | –1.53 | –1.16 | 42 | 59,736 | 3.68 |
Median home value | –0.13 | .07 | –0.27 | 0.01 | 39 | 33,163 | 2.21 |
At or below poverty (%) | –3.24 | <.001 | –3.42 | –3.07 | 61 | 78,890 | 3.04 |
Unemployment (%) | 0.73 | <.001 | 0.60 | 0.86 | 87 | 68,364 | 1.69 |
Nonemployed (%) | 5.38 | <.001 | 5.26 | 5.50 | 285 | 316,903 | 1.42 |
Less than high school (%) | 2.12 | <.001 | 1.90 | 2.33 | 71 | 63,048 | 4.71 |
No health insurance (%) | 0.69 | <.001 | 0.55 | 0.83 | 50 | 34,818 | 2.18 |
More than 1 occupant (%) | –0.89 | <.001 | –1.04 | –0.73 | 59 | 41,387 | 2.46 |
African American (%) | 0.73 | <.001 | 0.59 | 0.87 | 68 | 84,497 | 2.09 |
Hispanic (%) | –2.30 | <.001 | –2.49 | –2.10 | 78 | 63,847 | 4.12 |
Asian (%) | –1.14 | <.001 | –1.25 | –1.02 | 91 | 93,675 | 1.42 |
Other race (%) | –0.51 | <.001 | –0.67 | –0.36 | 69 | 45,301 | 2.45 |
aMSE: mean standard error.
bNode impurity: residual sum of squares for the random forest model.
cVIF: variance inflation factor.
dFor the linear regression model.
A 1 SD increase in the COVID-19 risk score was associated with a 40% increase in the incident rate ratio (IRR 1.40 per 1 SD increase;
COVID-19 deaths as a function of the COVID-19 risk score in New York City for each zip code (middle panel). The zip codes with the highest and lowest death rates are annotated. Blue points denote data on the epidemic death counts in September 2020. Red points denote epidemic death counts in May 2020.
Multivariate incidence rate ratios (for 1 SD change in the variable) for zip code–level deaths in New York City in May and September 2020.
Variable (per 1 SD unit) | May IRRa (95% CI) | May |
VIFb | September IRR (95% CI) | September |
VIF |
COVID-19 risk score | 1.40 (1.27-1.55) | <.001 | 2.20 | 1.40 (1.27-1.53) | <.001 | 2.20 |
Median income | 1.02 (0.84-1.22) | .80 | 9.06 | 0.99 (0.82-1.18) | .90 | 9.12 |
Less than high school | 0.81 (0.008-1.81) | .10 | 19.80 | 0.81 (0.62-1.06) | .10 | 19.64 |
College educated | 0.93 (0.26-1.92) | .50 | 10.83 | 0.93 (0.76-1.14) | .50 | 10.77 |
African American | 1.14 (1.03-2.78) | .03 | 3.91 | 1.16 (1.03-1.30) | .01 | 3.95 |
Mexican | 0.9 (0.87-1.08) | .60 | 3.72 | 0.97 (0.87-1.07) | .50 | 3.73 |
Hispanic | 1.27 (1.19-1.46) | <.001 | 5.60 | 1.29 (1.12-1.47) | <.001 | 5.60 |
Asian | 1.12 (1.00-1.26) | .05 | 4.34 | 1.15 (1.02-1.28) | .02 | 4.34 |
At or below poverty | 1.04 (0.87-1.25) | .60 | 8.94 | 0.99 (0.83-1.17) | .90 | 8.92 |
More than 1occupant per room | 1.12 (1.00-1.27) | .06 | 4.83 | 1.10 (0.98-1.23) | .10 | 4.71 |
No health insurance | 1.02 (0.91-1.16) | .70 | 4.66 | 1.03 (0.91-1.16) | .70 | 4.68 |
Unemployment | 1.01 (0.91-1.13) | .80 | 3.36 | 1.02 (0.91-1.14) | .70 | 3.36 |
COVID-19 case count | 1.08 (0.97-1.21) | .10 | 2.94 | 1.09 (0.98-1.21) | .10 | 2.88 |
aIRR: incidence rate ratio.
bVIF: variance inflation factor.
In this multi-scale analysis integrating and comparing spatial disease information from gold standard disease prevalence sources such as the US CDC, social determinants of health information from the US census, and satellite imagery data, we demonstrate an approach to identify characteristics of communities at risk for COVID-19 complications. We used the tools of unsupervised learning to develop a COVID-19 risk score that provides a single interpretable number that summarizes a communities’ (census tract) aggregate risk. The constituents of the COVID-19 risk score included census tract–level chronic disease risk factors that corresponded to previously identified individual-level risk factors for COVID-19, such as age, obesity, diabetes, and heart disease.
Others have deployed similar risk scores to identify communities at risk for COVID-19 [
Although it could be argued that the deep learning analysis of satellite imagery is simply a measurement of population density, this approach also measures several other factors that may contribute to COVID-19 infection and death rates independent of population density, such as built environment features that contribute to the development of COVID-19 risk factors and features that may put individuals at risk of contracting COVID-19. Examples of features that may put individuals at risk for developing risk factors include walkability (which contributes to obesity [
We believe that the COVID-19 risk score can be a tool in the growing armamentarium for public health and health care companies’ toolbox to enable communities to prepare for the potential onslaught of cases in the coming winter months, ultimately helping to “flatten the curve” [
As a byproduct of developing a risk score for communities, we observed that there is substantial variation of chronic disease prevalence within cities and across cities in the United States. With the exception of New York City and a few other places in the United States, public health agencies mostly collect COVID-19 case and death records at the county level across the country. However, the findings in our study implicate that smaller populations are at risk, and counties are heterogeneous.
We demonstrated how COVID-19 rates can be modeled using the COVID-19 risk score and how social determinants of health and the built environment can explain most of the score variance. Through simulations of the coprevalences of each of the 27,648 census tracts, we found that the point estimates for the community risk scores were robust to simulated sampling error. Many cities in the southwest and southeast demonstrated wide ranges in the COVID-19 risk score values. For example, Surprise, Arizona had a COVID-19 risk score with an IQR of 26 to 59. Atlanta, Georgia had an IQR of 24 to 41 (Figure S1 in
This large proportion of COVID-19–associated risk variance explained by the social determinants of health and built environment may be partly due to how discrimination affects where people live, their built environment, and access to health care [
The following are limitations of this study. First, we relied on disease and health indicator prevalence from the 500 largest cities in the United States but missed out on less urban areas whose populations are at risk for COVID-19 complications. In the future, we aim to task satellite imaging technology to locations that cannot be covered by resource-limited public surveillance programs. Second, although the CDC 500 Cities data are reflective of the diversity of individuals who live in a census tract, they are updated every 2 years and are dated to the latest collection (2019 data release reflects disease prevalence in 2017). Relatedly, neither individual-level disease nor COVID-19 status of individuals from these communities are measured. Third, satellite image data are captured at a resolution of approximately 20 m per pixel. It is not clear from our study if higher resolution images (that can theoretically capture more human-visible details of the built environment) would lead to better predictions of the COVID-19 risk score. Finally, interpretations of the New York–related data is limited due to the fact that it is aggregated to the zip code level. It is clear that COVID-19 is a disease of disparity; however, we cannot make a causal claim between the instruments such as the COVID-19 risk score, satellite imagery, and census tract–level sociodemographic factors, and eventual individual-level COVID-19–related complications.
Although it is clear that individual-level comorbidities are associated with risk for COVID-19, here we show that communities’ clinical coprevalence structure are predictive of risk quantified by the COVID-19 risk score, and the variance of that score can be explained using the social determinants of health and the built environment measured from satellite imagery. We provide all our tools to monitor COVID-19 risk and related data in an interactive web-based dashboard.
Supplementary methods.
Supplementary information containing figures, tables, and application programming interface specification for the COVID-19 community risk score.
Table S2.
American Community Survey
Centers for Disease Control and Prevention
chronic obstructive pulmonary disease
incident rate ratio
mean squared error
variance inflation factor
zip code tabulation area
We thank Emmanuel Coloma and Sumeet Parekh for their help in crafting the figures. This study is funded by XY Health, Inc, a company that develops machine learning approaches for prediction of health outcomes using satellite and land sensor data.
This study was funded by XY Health, Inc, and AD, GL, CL, and WDB completed this study while employed at XY Health Inc.