This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.

Obesity is a global epidemic causing at least 2.8 million deaths per year. This complex disease is associated with significant socioeconomic burden, reduced work productivity, unemployment, and other social determinants of health (SDOH) disparities.

The objective of this study was to investigate the effects of SDOH on obesity prevalence among adults in Shelby County, Tennessee, the United States, using a geospatial machine learning approach.

Obesity prevalence was obtained from the publicly available 500 Cities database of Centers for Disease Control and Prevention, and SDOH indicators were extracted from the US census and the US Department of Agriculture. We examined the geographic distributions of obesity prevalence patterns, using Getis-Ord Gi* statistics and calibrated multiple models to study the association between SDOH and adult obesity. Unsupervised machine learning was used to conduct grouping analysis to investigate the distribution of obesity prevalence and associated SDOH indicators.

Results depicted a high percentage of neighborhoods experiencing high adult obesity prevalence within Shelby County. In the census tract, the median household income, as well as the percentage of individuals who were Black, home renters, living below the poverty level, 55 years or older, unmarried, and uninsured, had a significant association with adult obesity prevalence. The grouping analysis revealed disparities in obesity prevalence among disadvantaged neighborhoods.

More research is needed to examine links between geographical location, SDOH, and chronic diseases. The findings of this study, which depict a significantly higher prevalence of obesity within disadvantaged neighborhoods, and other geospatial information can be leveraged to offer valuable insights, informing health decision-making and interventions that mitigate risk factors of increasing obesity prevalence.

Obesity is a global epidemic with increasing prevalence from 3% to 11% among men and 6% to 15% among women within the past 40 years [^{2} or higher [

Although genetic and behavioral factors increase susceptibility, studies have shown that social determinants of health (SDOH) risk factors adversely affect health outcomes and are major contributing factors to the increasing occurrence of obesity and other NCDs [

There is a dearth of studies that have leveraged geospatial intelligence to examine SDOH indicators associated with obesity. In this study, we examined the geographical variations and prevalence patterns of obesity in Shelby County in the United States, using Getis-Ord Gi* statistics and calibrated multiple models to study the association between SDOH and adult obesity. We also adopted unsupervised machine learning to conduct grouping analysis and investigate the distribution of obesity prevalence and the associated SDOH indicators. In addition to facilitating the surveillance of obesity and other NCDs within Shelby County, our findings could inform innovative health strategies to tackle SDOH disparities and other adverse influences on health outcomes.

In this study, data from well-known, publicly available multidimensional sources were merged at the census tract level. We used CDC 500 Cities data (2019) [

Summary statistics for obesity and related risk factors in census tracts of Shelby County, Tennessee.

Variables | Operationalization | Source | Values, mean (SD) |

Obesity | Model-based estimate for the crude prevalence of obesity among adults aged ≥18 years, 2018 | CDC^{a} |
35.77 (7.84) |

Low access to supermarket | Count of housing units without a vehicle and greater than half a mile from supermarket in the census tract | USDA^{b} |
102.54 (108.37) |

Black population | Percentage of the Black or African American population living in the census tract | US census | 58.02 (17.31) |

Poverty | Percentage of the population living below the federal poverty line in the census tract | USDA | 24.89 (17.35) |

Unemployment | Percentage of the unemployed population living in the census tract | US census | 4.32 (3.04) |

High school diploma | Percentage of the population aged ≥25 years without a high school diploma in the census tract | US census | 9.33 (6.59) |

Renters | Percentage of the population renting their homes | US census | 18.87 (11.85) |

Average household size | Average household size in a census tract | US census | 2.57 (0.52) |

Median household income | Median household income in a census tract (US $) | US census | 53,746 (29,335) |

Female head of the household | Percentage of the households with a female head in a census tract | US census | 7.75 (4.23) |

Uninsured | Model-based estimate for the crude prevalence of uninsured adults aged ≥18 years, 2018 | CDC | 18.84 (7.16) |

Lack of physical activity | Model-based estimate for the crude prevalence of lack of physical activity among adults aged ≥18 years, 2018 | CDC | 32.88 (10.52) |

Aged 55 years and older | Percentage of the population aged ≥55 years in a census tract | US census | 21.89 (7.81) |

Single | Percentage of the population who are single in a census tract | US census | 13.70 (8.62) |

^{a}CDC: Centers for Disease Control and Prevention.

^{b}USDA: The United States Department of Agriculture.

We explored the geospatial clustering and hot spots of adult obesity prevalence in Shelby County. We conducted this analysis by using Getis-Ord Gi* statistics with first order queen contiguity and applied the false discovery rate correction parameter to account for multiple testing and spatial dependence.

To prepare the data set for predictive modeling, we scaled our features such that columns had a mean of 0 and a SD of 1 [

The predictor variables that were considered were the 13-census, tract-level risk factor variables, and the outcome variable was the adult obesity prevalence in the census tract (

In this study, we applied multiple modeling techniques. Ordinary least squares (OLS) regression modeling was amongst these techniques, represented by the following equation:

Equation 1 shows the regression model in matrix notation, where Y is an n×1 vector of n observations on the dependent variable; X is an n×q design matrix of n observations on q explanatory variables (first column in X matrix will consist of a vector of n ones for the intercept); β is a q×1 vector of regression coefficients; and ε represents an n×1 vector of random error terms (independently and identically distributed). To assess and compare the performance of the models, we used adjusted ^{2} and AIC. To assess the heteroskedasticity of random error terms, we used the Koenker-Bassett test. To assess the normality of error distribution, the Jarque-Bera test was applied. We assessed the multicollinearity of the entire model using the condition number. To examine the independence of the terms Robust Lagrange Multiplier (error) and Robust Lagrange Multiplier (lag) methods were applied. First, order queen contiguity weights were constructed for spatial testing. Queen contiguity was chosen because areas sharing all boundaries and vertices are considered as neighbors, which yields more neighbors per area than the rook contiguity. If dependence was found among the terms, we incorporated the terms that accounted for autocorrelation in the model. Thus, we applied spatial autoregressive models: spatial lag or spatial error model (SEM) [

In equation 2, Y is an n×1 vector of n observations on the dependent variable; ρ is a scalar spatial lag parameter; WY is the spatially lagged dependent variable for an n×n weights matrix W; X is an n×q design matrix of n observations on q explanatory variables; β is an q×1 vector of regression coefficients; and ε represents an n×1 vector of error terms.

The spatial error model includes a spatial autoregressive error term and is represented by equation 3:

In equation 3, Y is an n×1 vector of n observations on the dependent variable; X is an n×q design matrix of n observations on q explanatory variables; β is an q×1 vector of regression coefficients; λ is a scalar spatial error parameter; W represents the n×n spatial weights matrix; u represents an n×1 vector of error terms; Wu denotes a spatially lagged error term; and the represents an n×1 vector of error terms. OLS regression and spatial autoregressive models will be assessed and compared to depict the optimal performance.

In order to understand the dependent variable and significantly associated SDOH across the region, we used the hierarchical clustering unsupervised machine learning algorithm [

We explored the geographical distribution of lack of physical activity, obesity, and the top four features significantly associated with obesity in Shelby County.

ArcGIS Pro software (version 2.9.0; Esri) was used to produce spatial distributions to investigate patterns (ie, spatial clustering). R Studio (version 4.0.3; RStudio, PBC) and GeoDa software (version 1.16.0.12; Luc Anselin) were used for statistical analyses.

(A) represents geospatial distribution of adult obesity prevalence in Shelby County; (B) represents significant hot and cold spots of adult obesity prevalence in Shelby County.

After conducting the analytical modeling techniques in the “Regression Modeling“ section, the percentage of population that lacks physical activity was removed during the VIF assessment (VIF=46.7), and the percentage of population with a female head of the household and the percentage of the population aged 25 years and older without high school education were removed during the AIC process (they were also found to be nonsignificant after conducting further experimental analysis). In addition, the average household size and households with low access to supermarkets were not significantly associated with obesity. However, there were 8 variables from

The final OLS regression model results are shown in ^{2} was 0.963, indicating that 96% of the variation in the outcome variable was explained by the predictors with an AIC of –88.34. There was a multicollinearity condition number of 6.99, which is less than 20, thus not suspected of multicollinearity. The Jarque-Bera test had a

Ordinary least squares regression results.

Variable | Coefficient |

Constant | –0.000 |

Median household income | –0.046^{a} |

Poverty | 0.184^{b} |

Renters | –0.134^{b} |

Aged 55 years and older | 0.043^{a} |

Single | 0.091^{c} |

Uninsured | 0.445^{b} |

Unemployment | 0.042^{a} |

Black population | 0.438^{b} |

^{a}

^{b}

^{c}

However, Robust Lagrange Multiplier (error) had a test value of 10.72 (

Since our variables are measured on the same scale, we were able to compare the strength of the effect of each predictor variable on obesity prevalence. We found that the percentage of uninsured population, the percentage of the Black population, the percentage of the population below poverty level, and the percentage of home renters were the most important variables when predicting obesity prevalence in Shelby County.

Spatial Error Model results.

Variable | Coefficient |

Constant | –0.001 |

Lambda | 0.488^{a} |

Median household income | –0.056^{a} |

Renters | –0.106^{a} |

Poverty | 0.146^{a} |

Aged 55 years or older | 0.051^{b} |

Single | 0.066^{c} |

Uninsured | 0.466^{a} |

Unemployment | 0.027 |

Black population | 0.423^{a} |

^{a}

^{b}

^{c}

After calibrating both models, we found that SEM outperformed the OLS model. ^{2} value improved to 0.968 after incorporating the error term in the model, and the AIC improved to –108.09, indicating a better model fit.

Model performance.

Model | Adjusted ^{2} |
Akaike’s information criterion |

Ordinary least squares | 0.963 | –88.34 |

Spatial error model | 0.968 | –108.09 |

Our grouping analysis divided the study area into 5 distinct groups across the Shelby region, based on the top four features that were significantly associated with obesity (

Grouping analyses results.

Group 1 spans the fourth largest area of the region (47 census tracts) and was quantified as being below average in obesity prevalence, percentage of the Black population, percentage of the population with an income below the poverty level, and percentage of the uninsured population; however, this group is around average in the percentage of renters.

Group 2 is the largest area in the region, comprising of 62 census tracts. This region is far above average in obesity prevalence, percentage of renters, percentage of the Black population, percentage of the population with an income below the poverty level, and percentage of the uninsured population.

Group 3 comprises of 52 census tracts. This region is above average in obesity prevalence, percentage of renters and percentage of the uninsured population, and it is far above average in percentage of the Black population; however, this group is around average in percentage of the population with an income below the poverty level and below average in percentage of renters.

Group 4 comprises of 52 census tracts and is quantified as being far below average in obesity prevalence, percentage of the Black population, percentage of the population with an income below the poverty level, percentage of renters, and percentage of the uninsured population.

Group 5 spans the smallest area of the region (6 census tracts) and is characterized as being average in obesity prevalence and percentage of the uninsured population; however, this group is far above average in percentage of the Black population, percentage of the population with an income below the poverty level, and percentage of renters.

Even though lack of physical activity was removed during the “model selection” process due to multicollinearity, we examined the Spearman rank correlation coefficient (

In addition,

Spearman rank coefficients to assess the relationship between lack of physical activity and obesity and the top four obesity-associated features in Shelby County census tracts.

Variables | Spearman rank coefficient |

Obesity | 0.96^{a} |

Uninsured population | 0.95^{a} |

Black population | 0.76^{a} |

Renters | 0.43^{a} |

Poverty | 0.86^{a} |

^{a}

Assessment of lack of physical activity and the top four features associated with obesity.

Obesity is a serious health condition that is associated with several comorbidities (eg, heart diseases, cancers, and diabetes) that are leading causes of death in the United States. SDOH factors such as the community, home, school, and workplace setting can impact physical activity and access to affordable healthy food. Some communities are more impacted, as evidenced by the disproportionality of adult obesity rates, compared to other populations [

Unlike multiple studies [

Finally, results from this study will be incorporated into the analytics layer of our Urban Public Health Observatory knowledge-based surveillance platform [

Previous studies have found associations between sociogeographical determinants and health outcomes [

Akaike’s information criterion

Centers for Disease Control and Prevention

noncommunicable disease

ordinary least squares

social determinants of health

spatial error model

variance inflation factor

None declared.