This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with disease severity. More accurate prediction of those at risk of developing severe infections is of high clinical importance.
Based on the UK Biobank (UKBB), we aimed to build machine learning models to predict the risk of developing severe or fatal infections, and uncover major risk factors involved.
We first restricted the analysis to infected individuals (n=7846), then performed analysis at a population level, considering those with no known infection as controls (ncontrols=465,728). Hospitalization was used as a proxy for severity. A total of 97 clinical variables (collected prior to the COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements (eg, hematological/liver/renal function/metabolic parameters), anthropometric measures, and other risk factors (eg, smoking/drinking) were included as predictors. We also constructed a simplified (lite) prediction model using 27 covariates that can be more easily obtained (demographic and comorbidity data). XGboost (gradient-boosted trees) was used for prediction and predictive performance was assessed by cross-validation. Variable importance was quantified by Shapley values (ShapVal), permutation importance (PermImp), and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationships between risk factors and outcomes.
A total of 2386 severe and 477 fatal cases were identified. For analyses within infected individuals (n=7846), our prediction model achieved area under the receiving-operating characteristic curve (AUC–ROC) of 0.723 (95% CI 0.711-0.736) and 0.814 (95% CI 0.791-0.838) for severe and fatal infections, respectively. The top 5 contributing factors (sorted by ShapVal) for severity were age, number of drugs taken (cnt_tx), cystatin C (reflecting renal function), waist-to-hip ratio (WHR), and Townsend deprivation index (TDI). For mortality, the top features were age, testosterone, cnt_tx, waist circumference (WC), and red cell distribution width. For analyses involving the whole UKBB population, AUCs for severity and fatality were 0.696 (95% CI 0.684-0.708) and 0.825 (95% CI 0.802-0.848), respectively. The same top 5 risk factors were identified for both outcomes, namely, age, cnt_tx, WC, WHR, and TDI. Apart from the above, age, cystatin C, TDI, and cnt_tx were among the top 10 across all 4 analyses. Other diseases top ranked by ShapVal or PermImp were type 2 diabetes mellitus (T2DM), coronary artery disease, atrial fibrillation, and dementia, among others. For the “lite” models, predictive performances were broadly similar, with estimated AUCs of 0.716, 0.818, 0.696, and 0.830, respectively. The top ranked variables were similar to above, including age, cnt_tx, WC, sex (male), and T2DM.
We identified numerous baseline clinical risk factors for severe/fatal infection by XGboost. For example, age, central obesity, impaired renal function, multiple comorbidities, and cardiometabolic abnormalities may predispose to poorer outcomes. The prediction models may be useful at a population level to identify those susceptible to developing severe/fatal infections, facilitating targeted prevention strategies. A risk-prediction tool is also available online. Further replications in independent cohorts are required to verify our findings.
COVID-19 has resulted in a pandemic affecting more than a hundred countries worldwide [
Machine learning (ML) approaches are powerful tools to predict disease outcomes and have been increasingly applied in biomedical research. In this study we employed boosted trees (with XGboost) to predict disease outcomes and identify risk factors. This ML approach can capture complex and nonlinear interactions between variables, hence leading to better predictive power in many circumstances. In view of the COVID-19 pandemic, many ML models have been developed for diagnostic or prognostic purposes. For instance, Bayat et al [
Here we made use of the UK Biobank (UKBB) data to build ML models to predict severity and fatality from COVID-19, and evaluated the contributing risk factors. We built prediction models not only for patients infected but also at a general population level. While predictive performance is the main concern in most previous studies, we argue that ML models can also provide important insights into individual contributing factors and the pattern of complex relationships between risk factors and the outcome. While many have studied risk factors of COVID-19 susceptibility or severity in the UKBB [
We note that in the UKBB clinical data were collected years before the outbreak of infection in 2020, which may be a limitation. Ideally, the predictors should be measured at the time when the model is intended to be applied (eg, at admission). However, we believe that building ML models with previously collected clinical data is useful for reasons detailed below. First, using previously collected clinical features may facilitate the identification of potential causal risk factors. As the predictors are collected prior to the outbreak, there is no concern about reverse causality. In practice, infection itself will lead to changes in many clinical parameters (eg, glucose, inflammatory markers, liver/renal functions); hence, it is often difficult to tell the direction of effect in cross-sectional studies. We hypothesize that this study will identify general or “baseline” risk factors or laboratory measurements that may be (causally) predictive of outcome. Second, the UKBB is a huge population-based sample (N=~500,000), and the rich clinical data collected previously enable ML models to be developed at the general population level. Importantly, there is a relative lack of such population-level ML prediction models to identify who may be at risk of developing severe COVID-19 infections. We hope this study will fill the gap, as this may have implications for prioritizing individuals for specific prevention strategies (eg, vaccination) and diagnostic testing under limited resources.
In this study we performed 4 sets of analysis. In the first 2 sets, we built ML models to predict the severity and mortality of COVID-19 among those who are tested positive for the virus. In this setting, predictive performance is of secondary concern (as predictors were not assessed at or during admission), but the predictive performance can shed light on to what extent
The UKBB is a large-scale prospective cohort comprising nearly 500,000 individuals aged 40-69 when they were recruited in 2006-2010. Given that the first case of COVID-19 in the UK was recorded on January 31, 2020, individuals with recorded mortality before January 31, 2020 (28,931 out of 502,524 individuals) were excluded. We also excluded from subsequent analyses a very small number of individuals (n=19) whose cause of mortality was COVID-19 (ICD code U07.1) but with negative test result(s) within 1 week. The current age of individuals included in our analyses ranged from 50 to 87 years, with 50.77% (255,170/502,524) being older than 70. This analysis was conducted under the project number 28732. For details of the UKBB data, please also refer to Sudlow et al [
COVID-19 outcome data were downloaded from data portal provided by the UKBB. Details of data release are provided in [
In general, we required both test result and origin to be 1 (indicating positive test and inpatient origin, respectively) to qualify as an “inpatient” case. For a small number of individuals with inpatient origin=0 and result=1, but changed to origin=1 with result=0 within 2 weeks’ time (based on the fact that median duration of viral persistence is nearly 2 weeks [
Data on mortality and cause of mortality were also extracted (with latest update on December 14, 2020). Individuals with recorded cause of mortality as “U07.1” were considered as having a fatal infection with laboratory-confirmed COVID-19 (please also refer to [
Four sets of analysis were performed. The first 2 sets were restricted to test-positive cases (n=7846). “Severe COVID-19” (n=2386) and death (n=477) due to COVID-19 were treated as outcomes. Because only prediagnostic clinical data were available, the main objective of this analysis was to identify baseline risk factors for severe/fatal illness among the infected. We then performed another 2 sets of analysis with the same outcomes, but the “unaffected” group was composed of the general population (n=465,728) that did not have a diagnosis of COVID-19 or were tested negative. The 4 sets of analysis were also referred to as cohorts A-D as shown in
The four sets of analysis performed and predictive performances (full model and lite model).
Cohort | Group 1 | Group 2 | n (group 1) | n (group 2) | Area under the curvea (%) | 95% CI (%) | ||
Full | Lite | Full | Lite | |||||
A | Hospitalized or fatal cases | Nonhospitalized cases | 2386 | 5460 | 72.3 | 71.6 | 71.1-73.6 | 70.3-72.9 |
B | Fatal cases | All other COVID-19 cases | 477 | 7369 | 81.4 | 81.8 | 79.1-83.8 | 79.4-84.2 |
C | Hospitalized or fatal cases | UK Biobank patients without a COVID-19 diagnosis or tested negative | 2386 | 465,728 | 69.6 | 69.6 | 68.4-70.8 | 68.4-70.7 |
D | Fatal cases | UK Biobank patients without a COVID-19 diagnosis or tested negative | 477 | 465,728 | 82.5 | 83.0 | 80.2-84.8 | 80.8-85.3 |
aAUC was taken from the average of 5 folds of cross-validation.
We extracted a total of 97 clinical variables of potential relevance based on the literature. For details, please refer to Table S1d in
The full list of variables is shown in Table S1b in
Missing values of remaining features were imputed with the R package missRanger (R Foundation). The program is based on missForest [
We have also attempted to use multiple imputation by chained equation (MICE) for imputation. For our data set with nearly 500,000 individuals, MICE stopped after running for 6 hours due to memory overflow error (>64 GB), whereas missRanger finished the imputation within 3 hours successfully. We considered the computational burden of MICE as too high and therefore employed missRanger in our analyses.
Several studies have compared MissForest with MICE, and there are several advantages of missForest. For categorical variables, imputation accuracy of missForest is likely to be higher than that of MICE [
XGboost with gradient-boosted trees was employed for building prediction models. Analysis was performed by the R package “xgboost.” We employed a fivefold nested cross-validation strategy to develop and test the model. To avoid overoptimistic results due to choosing the best set of hyperparameters based on test performance, the test sets were
In each iteration, we divided the data into 5 folds, among which one-fifth was reserved for testing only. For the remaining four-fifth of the data, we further sampled four-fifth for training and one-fifth for hyperparameter tuning. The best prediction model was applied to the test set. The process was repeated 5 times. A grid-search procedure was used to search for the best combination of hyperparameters (eg, tree depth, learning rate, regularization parameters for L1/L2 penalty). The full range of hyperparameters chosen for grid search is given in Table S6 in
The “full” model described above covers a wide range of predictors but some features (such as blood biochemistries) may not be readily accessible. For easier implementation in practice, we also built a simplified prediction model (also referred to as the “lite” model) based on a reduced set of 27 predictors. The reduced set of variables were chosen based on the ease of being assessed or measured, which included comorbidities (see above), anthropometric measures (BMI, weight, WC), demographic variables (eg, age, sex, ethnic group), and general indicators of health (number of medications taken, number of illnesses).
To evaluate the predictive performance of the prediction models, we computed the area under the receiving-operating characteristic curve (AUC–ROC), which is very widely used in clinical prediction studies. We also calculated other measures including the area under the precision–recall curve (AUC–PRC), F1 score, accuracy, and Matthews correlation coefficient (MCC). The cutoff of predicted probability for calculating the latter 3 measures was determined by optimizing the geometric mean of sensitivity and specificity.
In addition to good ability to discriminate cases from noncases, it is also important that the predicted event probabilities match with the observed probabilities (also known as calibration of a model). We assessed calibration by several measures, including the Hosmer–Lemeshow test, expected calibration error (ECE), and maximum calibration error (MCE) [
In this work we primarily employed Shapley value (ShapVal) [
Intuitively, the ShapVal of the
A related index is the Shapley
An advantage of Shapley value is that it is calculated for each individual, so how each risk factor affects a specific person’s risk of infection/severity can be estimated as well. To illustrate this concept, we also produced decision plots for individuals at the highest, median, and lowest risk of each cohort.
We also performed cluster analysis based on ShapVal to identify subgroup of patients who share similar clinical risk factors with respect to severity of infection. As introduced in [
Here we performed k-means sparse clustering to uncover underlying patient subgroups based on ShapVal of risk factors. As the number of features included is large but not all may contribute to the underlying subgroups, we employed
An overview of the sample sizes in each set of analysis is presented in
Simulation results for the validity of permutation
We performed 5-fold cross-validation and the average AUC under the ROC curve is given in
As for the “lite” models which included a reduced set of predictors, the predictive performances in terms of AUC are broadly similar, with estimated AUC–ROC for cohorts A-D of 0.716, 0.818, 0.696, and 0.830, respectively.
The results of other predictive indices are listed in Table S2b in
We also conducted sex-stratified analysis (Table S2a in
We also computed the proportion of cases explained by individuals at the highest
These results showed in general a strong enrichment of cases among those predicted to have high risks, indicating good model discriminatory ability.
Relative risk (RR) comparing subjects in the top and bottom k% of predicted risks and proportion of cases explained by those at top k% of predicted risk.
Full model | Lite model | |||||||||
|
Risk in top k%a,b | Risk in bottom k% | RR | Proportion of cases explained by top k% | Risk in top k%a,b | Risk in bottom k% | RR | Proportion of cases explained by top k% | ||
|
|
|
|
|
|
|
|
|
||
|
5 | 0.676 | 0.148 | 4.56 | 0.112 | 0.691 | 0.158 | 4.37 | 0.113 | |
|
10 | 0.654 | 0.138 | 4.74 | 0.216 | 0.644 | 0.157 | 4.10 | 0.211 | |
|
20 | 0.579 | 0.145 | 4.00 | 0.382 | 0.581 | 0.153 | 3.79 | 0.381 | |
|
30 | 0.540 | 0.148 | 3.65 | 0.533 | 0.533 | 0.152 | 3.50 | 0.526 | |
|
40 | 0.489 | 0.152 | 3.20 | 0.644 | 0.479 | 0.158 | 3.03 | 0.630 | |
|
50 | 0.443 | 0.166 | 2.67 | 0.730 | 0.439 | 0.170 | 2.59 | 0.720 | |
|
|
|
|
|
|
|
|
|
||
|
5 | 0.214 | 0.000 | Infinity | 0.174 | 0.212 | 0.003 | 84.27 | 0.174 | |
|
10 | 0.200 | 0.001 | 158.20 | 0.327 | 0.216 | 0.008 | 28.38 | 0.352 | |
|
20 | 0.171 | 0.008 | 22.42 | 0.562 | 0.188 | 0.008 | 24.59 | 0.618 | |
|
30 | 0.148 | 0.009 | 16.57 | 0.727 | 0.155 | 0.008 | 19.21 | 0.763 | |
|
40 | 0.127 | 0.009 | 14.21 | 0.830 | 0.131 | 0.009 | 14.23 | 0.866 | |
|
50 | 0.111 | 0.010 | 10.94 | 0.916 | 0.111 | 0.011 | 10.37 | 0.912 | |
|
|
|
|
|
|
|
|
|
||
|
5 | 0.0201 | 0.0017 | 11.76 | 0.197 | 0.0210 | 0.0013 | 15.88 | 0.207 | |
|
10 | 0.0149 | 0.0021 | 6.98 | 0.293 | 0.0158 | 0.0012 | 12.95 | 0.310 | |
|
20 | 0.0109 | 0.0023 | 4.67 | 0.427 | 0.0118 | 0.0021 | 5.71 | 0.462 | |
|
30 | 0.0090 | 0.0030 | 2.99 | 0.528 | 0.0097 | 0.0027 | 3.57 | 0.573 | |
|
40 | 0.0075 | 0.0033 | 2.27 | 0.590 | 0.0084 | 0.0026 | 3.20 | 0.656 | |
|
50 | 0.0069 | 0.0033 | 2.09 | 0.678 | 0.0074 | 0.0028 | 2.63 | 0.725 | |
|
|
|
|
|
|
|
|
|
||
|
5 | 0.0067 | 0.00000 | Infinity | 0.325 | 0.0068 | 0.00000 | Infinity | 0.333 | |
|
10 | 0.0047 | 0.00002 | 218.02 | 0.457 | 0.0047 | 0.00006 | 73.67 | 0.463 | |
|
20 | 0.0033 | 0.00011 | 30.30 | 0.635 | 0.0032 | 0.00009 | 36.75 | 0.616 | |
|
30 | 0.0026 | 0.00014 | 18.74 | 0.746 | 0.0027 | 0.00011 | 23.38 | 0.784 | |
|
40 | 0.0021 | 0.00016 | 13.17 | 0.828 | 0.0022 | 0.00013 | 16.68 | 0.874 | |
|
50 | 0.0018 | 0.00022 | 8.35 | 0.893 | 0.0019 | 0.00015 | 13.03 | 0.929 |
a‘Top k%’ refers to top k% of
b‘Risk in top k%’ refers to the actual probability of the outcome (severe disease or fatality) within the patients belonging to the highest k% of predicted risks.
We also computed the relative risk (RR) of infection or severe disease by comparing individuals at the highest and lowest
We observed large RRs for cohorts B and D, suggesting that the prediction models were able to discriminate individuals at the highest and lowest risks of fatality very well. RRs for cohorts B and D were much larger than those for cohorts A and C, indicating that the model predicted fatality better than severe disease.
As for calibration, please refer to Figures S6 and S7 in
Considering cohort B cases (fatal infections), the optimal solution comprised 3 clusters. Interestingly, the first and third clusters seemed to be markedly different with respect to their risk factor profiles. Mean ShapVal for age was largely negative for the first cluster but highly positive for the other 2 clusters. By contrast, mean ShapVal for WC was markedly higher and positive for the first cluster. The third cluster was characterized by the highest mean ShapVal for age, and higher (positive) ShapVal for mainly cnt_tx, HbA1c, and T2DM. The results suggest that there may exist pathophysiologically distinct subgroups of patients with fatal infection. The first cluster represents a subgroup with younger age but with higher proportion of obesity. The third cluster represents another subgroup with advanced age, more comorbidities, and higher proportion of glucose abnormalities or T2DM. The second cluster is in between.
Results of sparse k-means clustering based on Shapley values (ShapVal) in cohorts A (hospitalized cases) and B (fatal cases). The y-axis indicates the ShapVal and only those risk Factors with significant differences (
Here we primarily report the results of the full model as a more complete set of predictors is included. The Shapley dependence plots (ranked by mean absolute ShapVal) of the top 15 features (full model) are shown in
Full ShapVal analysis results on all variables are given in Tables S3a-c in
Shapley value dependence plots of the top 15 risk factors ranked by mean abs(shapley value) (full model) for cohorts A, B, C, and D, respectively. Shapley value (y-axis) is computed on a log-odds scale. Every unit increase of ShapVal corresponds to an odds ratio (OR) of exp(1)=2.72 compared with the baseline. Positive ShapVal indicates increase in the odds of the outcome and vice versa. CAD: coronary artery disease; COPD: chronic obstructive pulmonary disease; HDL: high-density lipoprotein; RBC: red blood cell; T2DM: type 2 diabetes mellitus.
ShapVal dependence plots of the top 6 risk factors ranked by mean abs(shapley value) (lite model) for cohorts A, B, C, and D, respectively. T2DM: type 2 diabetes mellitus.
Top 10 risk factors ranked by mean absolute Shapley value for cohorts A, B, C, and D (full model).
Risk factor | ShapVal | ||||
|
|||||
|
Age | 0.442 | .002 | ||
|
Treatments taken count | 0.093 | .002 | ||
|
Cystatin C | 0.088 | .002 | ||
|
Waist-to-hip ratio | 0.085 | .002 | ||
|
Townsend deprivation index | 0.059 | .004 | ||
|
HbA1ca | 0.056 | .002 | ||
|
Pulse rate | 0.048 | .002 | ||
|
Hypertension | 0.048 | .002 | ||
|
Apolipoprotein A | 0.027 | .016 | ||
|
HDLb cholesterol | 0.026 | .016 | ||
|
|||||
|
Age | 0.708 | .002 | ||
|
Testosterone | 0.069 | .002 | ||
|
Treatments taken count | 0.048 | .002 | ||
|
Waist circumference | 0.035 | .002 | ||
|
RBCc distribution width | 0.027 | .002 | ||
|
Cystatin C | 0.024 | .002 | ||
|
Townsend deprivation index | 0.023 | .002 | ||
|
Pulse rate | 0.019 | .004 | ||
|
Systolic blood pressure | 0.016 | .002 | ||
|
Lymphocyte percentage | 0.015 | .004 | ||
|
|||||
|
Waist-to-hip ratio | 0.113 | .002 | ||
|
Townsend deprivation index | 0.096 | .002 | ||
|
Age | 0.088 | .002 | ||
|
Treatments taken count | 0.063 | .002 | ||
|
Waist circumference | 0.044 | .002 | ||
|
Self-report: noncancer count | 0.043 | .002 | ||
|
Hypertension | 0.036 | .002 | ||
|
Cystatin C | 0.030 | .024 | ||
|
T2DM | 0.030 | .002 | ||
|
Apolipoprotein A | 0.024 | .052 | ||
|
|||||
|
Age | 0.519 | .002 | ||
|
Townsend deprivation index | 0.136 | .002 | ||
|
Waist-to-hip ratio | 0.131 | .002 | ||
|
Treatments taken count | 0.115 | .002 | ||
|
Waist circumference | 0.110 | .002 | ||
|
Cystatin C | 0.096 | .002 | ||
|
Testosterone | 0.086 | .002 | ||
|
Hypertension | 0.061 | .002 | ||
|
RBC distribution width | 0.046 | .002 | ||
|
Pulse rate | 0.036 | .006 |
aHbA1c: hemoglobin A1c.
bHDL: high-density lipoprotein.
cRBC: red blood cell.
dT2DM: type 2 diabetes mellitus.
Top 10 risk factors ranked by P-value, listing only factors which are not yet included in for cohorts A, B, C, and D (full model).
Risk factor | ShapVal |
|
||
|
||||
|
T2DMa | .004 | 0.010 |
|
|
Self-report: noncancer | .008 | 0.018 |
|
|
Depression | .008 | 0.004 |
|
|
CADb | .016 | 0.002 |
|
|
Cancer diagnosed by doctor | .026 | 0.000 |
|
|
Alcohol intake (occasions) | .028 | 0.002 |
|
|
AFc | .028 | 0.000 |
|
|
Smoking (current) | .036 | 0.000 |
|
|
γ-glutamyltransferase | .046 | 0.021 |
|
|
WBCd count | .046 | 0.014 |
|
|
|
|||
|
BMI | .002 | 0.015 |
|
|
Glucose | .002 | 0.015 |
|
|
HbA1ce | .002 | 0.014 |
|
|
Weight | .002 | 0.010 |
|
|
Mean platelet volume | .002 | 0.009 |
|
|
T2DM | .002 | 0.007 |
|
|
Sleep duration | .002 | 0.006 |
|
|
T1DMf | .002 | 0.003 |
|
|
Cognitive impairment | .002 | 0.003 |
|
|
CAD | .002 | 0.003 |
|
|
|
|||
|
COPDg | .002 | 0.015 |
|
|
Depression | .002 | 0.009 |
|
|
Cognitive impairment | .002 | 0.007 |
|
|
CAD | .004 | 0.017 |
|
|
Ethnic (Asian/Asian British) | .004 | 0.007 |
|
|
Heart failure | .004 | 0.007 |
|
|
AF | .004 | 0.006 |
|
|
Smoking (previous) | .006 | 0.015 |
|
|
Stroke | .012 | 0.001 |
|
|
Ethnic (Black/Black British) | .020 | 0.001 |
|
|
|
|
|
|
|
T2DM | .002 | 0.026 |
|
|
Cognitive impairment | .002 | 0.024 |
|
|
COPD | .002 | 0.021 |
|
|
AF | .002 | 0.016 |
|
|
Heart failure | .002 | 0.007 |
|
|
CAD | .002 | 0.008 |
|
|
Ethnic (Black/Black British) | .004 | 0.004 |
|
|
Stroke | .004 | 0.002 |
|
|
Alcohol drinker (current) | .004 | 0.001 |
|
|
Smoking (previous) | .006 | 0.003 |
|
aT2DM: type 2 diabetes mellitus.
bCAD: coronary artery disease.
cAF: atrial fibrillation.
dWBC: white blood cell.
eHbA1c: hemoglobin A1c.
fT1DM: type 1 diabetes mellitus.
gCOPD: chronic obstructive pulmonary disease.
Top 5 risk factors ranked by mean absolute Shapley value for cohorts A, B, C, and D (lite model).
Risk factor | ShapVal | ||
|
|||
|
Age | 0.496 | .002 |
|
Treatments taken count | 0.121 | .002 |
|
Waist circumference | 0.085 | .002 |
|
Male | 0.058 | .002 |
|
Self-report: noncancer count | 0.054 | .004 |
|
|||
|
Age | 0.721 | .002 |
|
Treatments taken count | 0.079 | .014 |
|
Waist circumference | 0.071 | .040 |
|
Male | 0.048 | .010 |
|
BMI | 0.034 | .242 |
|
|||
|
Waist circumference | 0.153 | .002 |
|
Age | 0.120 | .002 |
|
Treatments taken count | 0.102 | .002 |
|
Self-report: noncancer count | 0.064 | .002 |
|
T2DMa | 0.050 | .002 |
|
|||
|
Age | 0.056 | .002 |
|
Waist circumference | 0.248 | .002 |
|
Treatments taken count | 0.154 | .002 |
|
Male | 0.098 | .002 |
|
BMI | 0.043 | .036 |
aT2DM: type 2 diabetes mellitus.
Top 5 risk factors ranked by
Risk factor | ShapVal | |||
|
||||
|
T2DMa | .002 | 0.047 | |
|
Smoking (current) | .004 | 0.026 | |
|
Depression | .016 | 0.015 | |
|
Alcohol drinker (current) | .020 | 0.013 | |
|
CADb | .022 | 0.010 | |
|
||||
|
T2DM | .006 | 0.027 | |
|
Cognitive impairment | .006 | 0.015 | |
|
T1DMc | .020 | 0.009 | |
|
Bipolar | .024 | 0.006 | |
|
AFd | .036 | 0.011 | |
|
||||
|
COPDe | .002 | 0.024 | |
|
Ethnic (Asian/British Asian) | .002 | 0.016 | |
|
Cognitive impairment | .002 | 0.008 | |
|
Male | .004 | 0.049 | |
|
CAD | .004 | 0.023 | |
|
||||
|
T2DM | .002 | 0.043 | |
|
COPD | .002 | 0.039 | |
|
Cognitive impairment | .002 | 0.029 | |
|
AF | .002 | 0.024 | |
|
Ethnic (Black/Black British) | .002 | 0.016 |
aT2DM: type 2 diabetes mellitus.
bCAD: coronary artery disease.
cT1DM: type 1 diabetes mellitus.
dAF: atrial fibrillation.
eCOPD: chronic obstructive pulmonary disease.
As for interaction analyses, top results are presented in
Note that ShapVal is measured on the log-odds scale. Every unit increase of ShapVal corresponds to an odds ratio of exp(1)=2.72. Positive ShapVal indicates increase in the odds of outcome and vice versa.
Top interacting pairs of variables ranked by ShapVal (full model).
Risk factor 1 | Risk factor 2 | Value | |
|
|||
|
Waist-to-hip ratio | Age | 150 |
|
Treatments taken count | Age | 149 |
|
HDLa cholesterol | Age | 86 |
|
Age | Hypertension | 85 |
|
Cystatin C | Age | 84 |
|
|||
|
Testosterone | Age | 195 |
|
Waist circumference | Age | 95 |
|
BMI | Age | 82 |
|
Treatments taken count | Age | 63 |
|
Pulse rate | Age | 57 |
|
|||
|
Waist-to-hip ratio | Age | 709 |
|
Waist-to-hip ratio | Treatments taken count | 494 |
|
Townsend deprivation index | Treatments taken count | 481 |
|
Townsend deprivation index | Waist-to-hip ratio | 450 |
|
Albumin | Waist-to-hip ratio | 407 |
|
|||
|
Waist circumference | Age | 859 |
|
Testosterone | Age | 780 |
|
Townsend deprivation index | Age | 725 |
|
Waist-to-hip ratio | Age | 603 |
|
Age | Hypertension | 585 |
aHDL: high-density lipoprotein.
ShapVal interaction plots of the full model for the top 4 interacting pairs of cohorts A, B, C, and D, respectively.
The top 5 contributing features by ShapVal included age, number of medications received (cnt_tx), cystatin C, TDI, and WHR, followed by HbA1c. Higher levels of these risk factors generally lead to higher disease severity among the infected. Interestingly, Shapley dependence plots revealed potential
If we consider the “
Regarding interactions between variables, most of the top interacting pairs involved age (
The top 5 contributing variables by ShapVal included age, testosterone (which may reflect the effect of gender), cnt_tx, WC, and red blood cell distribution width (RDW), which were followed by cystatin C, TDI, pulse rate, systolic blood pressure (SBP), and percentage of lymphocytes. Again, certain nonlinear and “threshold” effects appeared to be present for many top-ranked features. For age, the risk for mortality was more marked beyond 65 years. Higher levels of all the above risk factors (RFs) (except percentage of lymphocytes, which showed a U-shaped relationship) were associated with higher mortality, but the effects were nonlinear. Regarding the top results based on PermImp, 8 out of 10 predictors ranked high by ShapVal also had the lowest
Variable importance based on gain yielded similar results (Figures S5a and S5b in
Based on ShapVal, WHR was the top contributing variable and WC was ranked fifth, suggesting that central obesity may be a stronger predictor for severe disease than BMI alone (BMI was ranked 13th by ShapVal). As before, TDI and age were ranked among the top. For age, slightly unexpectedly, a U-shaped curve was observed, which suggests lowest risk at the age group of 65-70. Note that model C may also capture RFs related to susceptibility to infection. It is possible, for instance, that younger individuals had higher risks of exposure due to work or social interactions. Among the top 10, two are related to general multiple comorbidities (cnt_tx and cnt_noncancer). Increased cystatin C and lower apolipoprotein A were also associated with higher susceptibility to severe infections, and HT and T2DM were also among the top 10. Considering PermImp as the ranking criteria, COPD, depression, and dementia were observed to have the lowest permutation
The interaction plot (
Based on ShapVal, age was the top feature, followed by TDI, WHR, number of drugs taken, and WC. Other top features included cystatin C, testosterone, HT, RDW, and pulse rate. Higher levels of these features (or presence of comorbidity) generally lead to higher mortality risks. Based on PermImp, T2DM, dementia, and COPD were the most highly ranked (ignoring features that are already listed in the top 10 by ShapVal).
Shapley interaction analysis suggested that the top interacting pairs involved age and some of the top contributing features (
As for important variables from the sex-stratified analysis, the top variables were similar which included age, WC/WHR, cystatin C, number of medications received, socioeconomic status (as reflected by TDI), among others (Table S3c in
Overall speaking, the PermImp measure tends to rank binary traits higher than ShapVal. Of note, several diseases were consistently top listed by PermImp across the 4 cohorts (though some were not highlighted by ShapVal), including CAD, atrial fibrillation (AF), T2DM, and dementia, which were among the top 10 in at least three cohorts in
Here we highlight top contributing features for the “lite” models consisting of 27 predictors (Table S3b in
If we consider PermImp as the ranking criteria (further ranked by ShapVal if PermImp is equal), age, cnt_tx, and WC were still highly ranked and listed among the top 5 in at least three cohorts (Table S3b in
As discussed above, we primarily focused on the XGboost ML model as it can capture nonlinear relationships and interactions between predictors. Here we also performed our analyses with logistic regression (LR) for comparison. For prediction performance (Table S7a in
While prediction is one of our goals, uncovering important contributing factors and their relationship to COVID-19 severity is a major objective of this study. In fact, the latter is considered our primary objective when considering the analyses within patients infected (cohorts A and B). As LR assumes linearity on a log-odds scale, it could not capture nonlinear relationships or “threshold effects” of variables on disease severity.
We also showed individual Shapley decision plot for 3 individuals with the highest, median, and lowest predicted risks in each cohort (Figure S10 in
To facilitate further research and studies on risk-prediction models, we also constructed an online risk calculation tool (for “lite” model) [
In this study we have performed 4 sets of analysis, predicting severe or fatal COVID-19 infection among affected individuals or in the population. We observed good predictive power from the XGboost ML models, especially for the prediction of mortality. We also identified risk factors for increased severity or mortality, and uncovered possible nonlinear effects of some features, which may be clinically relevant and shed light on disease mechanisms.
In general, our prediction models achieved reasonably good predictive power. The models predicted mortality (AUC 81%-83%) better than severity of disease. As discussed earlier, in the absence of better alternatives, hospitalization (test performed as inpatient) was used as a proxy for severity. However, reasons or criteria for hospitalization may vary across individuals or hospitals, and some tests may be performed in inpatients for surveillance or due to other confirmed/suspected cases in the ward. As a result, hospitalized patients could also include some with mild or moderate illnesses, which may also impair the prediction performance. By contrast, mortality from infection is a more objective outcome. Other studies (eg, [
By assessing the proportion of cases explained by those at the top k% of predicted risks, we observed in general a strong enrichment of cases among those with high predicted risks, indicating good discriminative ability of the models and suggesting the possibility to focus on the highest-risk group for targeted preventions or treatment. A similar strong enrichment was also observed for the lite model with fewer predictors. We also observed large RRs of the actual outcomes when comparing individuals at high and low percentiles of predicted risks. For example, for the prediction of mortality among the infected, the RR was up to 158 times (~20% vs 0.1%) when comparing and top and bottom deciles using the full model, and 28.38 times when considering the simplified model (~21% vs 0.8%). These results suggest that the prediction models may be used for risk stratification and prioritizing those at higher risks of deterioration, for early medical attention or admission. As the “lite” model only relies on demographic data and information on comorbidities, risk stratification may be conducted even at the start of the illness without other blood or imaging results.
A number of studies have focused on prediction of severity/mortality of COVID-19 (corresponding to our prediction in cohorts A and B) and were reviewed in [
For cohorts C and D, the general population (with no known infection) was treated as “controls.” Compared with cohorts A and B, the identified risk factors may also increase the overall susceptibility to infection. The AUC for cohort C (severe/fatal disease) is about 70% but is much higher when mortality is considered as the outcome (AUC=~83%). To our knowledge, there are still very few predictive models built at a
Another very recent study (“QCOVID” study) from the UK [
A few other studies have investigated risk factors (especially comorbidities) for COVID-19 infection in the UKBB. For example, Atkins et al [
We have performed LR to compare with XGboost on cohorts A and B. The differences in predictive performance appeared to be small. The number of cases (especially fatalities) is relatively small in this data set, and this may limit the predictive performance of more complex models such as XGboost, which may be expected to improve with larger case numbers. An important advantage of XGboost is that it can detect nonlinear relationships when compared with LR. In addition, XGboost may handle multiple collinearity better than LR. Assuming 2 highly correlated features A and B, for each specific tree usually only 1 variable will be used and as the trees are sequential, the focus of the model will be usually on one but not on both features [
For the limit of space, we shall only highlight the top 5-10 risk factors ranked by ShapVal here. Across the 4 cohorts, age and cardiometabolic risk factors predominate the top risk factors. Age and WHR/WC were ranked among the top 5 across all 4 cohorts. The number of medications taken was among the top 5 across all cohorts, and cystatin C (reflecting renal function) was among the top 10 across all cohorts. HbA1c was a top 10 risk factor for cohort A, and T2DM was also highly ranked across multiple cohorts especially when PermImp was considered. TDI (reflecting socioeconomic status) was among the top 10 in most cohorts. As described above, results from the “lite” models were generally in line with those from the full models, with age, WC, and cnt_tx consistently ranked as the top 3.
Obesity has been observed to be a major risk factor for susceptibility or severity of infection in the UKBB [
Another major risk factor we identified is IRF, as reflected by elevated risks with raised urea and cystatin C. Several studies also suggested that IRF increases risk of mortality [
Other potential risk factors briefly highlighted below were less reported. As some were listed only once or twice among the top 10, and for some their ShapVal was close to other risk factors, further replications are required. For example, testosterone was top ranked by XGboost (for mortality), with higher levels associated with increased risk. This may partially reflect that males are at a higher risk of fatal infections, but it remains to be studied whether testosterone itself is involved in the pathophysiology of severe COVID-19, as the ML model chose this variable instead of sex. Studies have suggested that elevated or reduced testosterone levels may be associated with a more severe clinical course [
Among the diseases being included as covariates, T2DM is most consistently ranked among the top, no matter whether full or lite models are used, and regardless of ranking by ShapVal or PermImp (
We note that the simplified (lite) prediction model has very similar predictive performance (as assessed by AUC) to the “full” model with a larger panel of predictors. However, it is important to note that features associated with the outcome may not always improve predictive power. AUC is relatively insensitive to detecting changes in predictive performance when additional risk factors are added [
For example, Pencina et al [
Nevertheless, it is still valuable to study variable importance (eg, ShapVal) from the ML model as it may shed light on the pathophysiology of the disease
Some limitations have been discussed above; for example, the use of hospitalization as a proxy for severity, and that the predictors were recorded prior to the pandemic. We briefly discuss other limitations here. The UKBB is a very large-scale study with detailed phenotypic data, but still the number of fatal cases is relatively small. In addition, the UKBB is not entirely representative of the UK population, as participants tend to be healthier and wealthier overall [
Regarding the ML model, XGboost is a state-of-the-art method that has been consistently shown to be the best or one of the best ML methods in supervised learning tasks/competitions [
In conclusion, we identified a number of baseline risk factors for severe/fatal infection by an ML approach. Shapley dependence plots revealed possible nonlinear and “threshold” effects of risk factors on the risks of infection or severity. To summarize, age, central obesity, IRF, multiple comorbidities, cardiometabolic abnormalities or disorders (especially T2DM), and low socioeconomic status may predispose to poorer outcomes, among other risk factors. The prediction models (of cohorts C/D) may be useful at a population level to identify those susceptible to developing severe/fatal infections, thereby facilitating targeted prevention strategies. Further replication and validation in independent cohorts are required to confirm our findings.
Supplementary tables S1-S8.
Simulation experiment to verify the validity of permutation testing for rare outcomes.
Supplementary figures.
atrial fibrillation
area under the precision–recall curve
area under the receiving-operating characteristic curve
coronary artery disease
chronic obstructive pulmonary disease
expected calibration error
high-density lipoprotein
hypertension
impaired renal function
logistic regression
Matthews correlation coefficient
maximum calibration error
multiple imputation by chained equation
machine learning
red blood cell distribution width
risk factor
relative risk
systolic blood pressure
type 1 diabetes mellitus
type 2 diabetes mellitus
Townsend deprivation index
UK Biobank
waist circumference
waist-to-hip ratio
This work was supported partially by the Lo Kwee Seong Biomedical Research Fund from The Chinese University of Hong Kong, the KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, the Kunming Institute of Zoology, and The Chinese University of Hong Kong. We thank Professor Pak Sham for support on data access and analyses, and Ms Qiu Jinghong for formatting and editing of the manuscript.
H-CS contributed to the conception and design, as well as supervised the study; H-CS and KC-YW were responsible for the analytic methodology. KC-YW (main), YX, LY, and H-CS performed data analysis. H-CS and KC-YW performed data interpretation. H-CS, with input from KC-YW, drafted the manuscript.
None declared.