Using Machine Learning Techniques to Predict Factors Contributing to the Incidence of Metabolic Syndrome in Tehran: Cohort Study

Background: Metabolic syndrome (MetS), a major contributor to cardiovascular disease and diabetes, is considered to be among the most common public health problems worldwide. Objective: We aimed to identify and rank the most important nutritional and nonnutritional factors contributing to the development of MetS using a data-mining method. Methods: This prospective study was performed on 3048 adults (aged ≥20 years) who participated in the fifth follow-up examination of the Tehran Lipid and Glucose Study, who were followed for 3 years. MetS was defined according to the modified definition of the National Cholesterol Education Program/Adult Treatment Panel III. The importance of variables was obtained by the training set using the random forest model for determining factors with the greatest contribution to developing MetS. Results: Among the 3048 participants, 701 (22.9%) developed MetS during the study period. The mean age of the participants was 44.3 years (SD 11.8). The total incidence rate of MetS was 229.9 (95% CI 278.6-322.9) per 1000 person-years and the mean follow-up time was 40.5 months (SD 7.3). The incidence of MetS was significantly (P<.001) higher in men than in women (27% vs 20%). Those affected by MetS were older, married, had diabetes, with lower levels of education, and had a higher BMI (P<.001). The percentage of hospitalized patients was higher among those with MetS than among healthy people, although this difference was only statistically significant in women (P=.02). Based on the variable importance and multiple logistic regression analyses, the most important determinants of MetS were identified as history of diabetes (odds ratio [OR] 6.3, 95% CI 3.9-10.2, P<.001), BMI (OR 1.2, 95% CI 1.0-1.2, P<.001), age (OR 1.0, 95% CI 1.0-1.03, P<.001), female gender (OR 0.5, 95% CI 0.38-0.63, P<.001), and dietary monounsaturated fatty acid (OR 0.97, 95% CI 0.94-0.99, P=.04). JMIR Public Health Surveill 2021 | vol. 7 | iss. 9 | e27304 | p. 1 https://publichealth.jmir.org/2021/9/e27304 (page number not for citation purposes) Hosseini-Esfahani et al JMIR PUBLIC HEALTH AND SURVEILLANCE


Introduction
Metabolic syndrome (MetS), a major contributor to cardiovascular disease and diabetes, is considered to be among the most common public health problems worldwide [1]. According to the World Health Organization and the International Diabetes Federation, MetS is defined as the simultaneous occurrence of three of the following five medical conditions: abdominal obesity, high blood pressure, hyperglycemia, high triglyceride levels, and low high-density lipoprotein cholesterol (HDL-C) levels [2].
The incidence of MetS is estimated to be 34% in the United States [3], 12%-37% in Asian countries [4], and 12%-26% in European populations [5]. In Iran, the overall pooled prevalence and incidence rate of MetS among the general population was reported to be 0.26 (95% CI 0.25-0.29) and 97.96 per 1000 person-years (95% CI 75.98-131.48), respectively, and was higher in women living in urban areas and in men living in rural areas.
The overall pooled prevalence of MetS was higher in urban areas compared to rural areas (0.39 vs 0.26) and the pooled prevalence of MetS was higher in women than in men (0.34 vs 0.22) [6].
According to previous studies, the etiology of MetS is controlled by several risk factors, including abdominal obesity, insulin resistance, glucose tolerance disorder, hypertension, genetic factors, psychosocial stressors, and nutritional and diet factors [7][8][9][10][11]. Previous studies have often investigated the predictive factors using classical approaches and neglected the interpretability of the results. For example, among the explanatory variables, the risk/protective factors have a more important role in the outcomes. One of the simplest and very common ranking techniques is random forest (RF), which is a data-mining approach. The most important features of this model are simplicity and interpretation of the model, flexibility in applying a large number of predictor variables, working with an infinite sample size, and determination of important variables in predicting the outcome. The RF model is also useful when predictor variables are nonlinear concerning disease, because there is no assumption or any constraint on the form of the relationships [12][13][14]. Considering the high prevalence of MetS and its importance in cardiovascular disease, identifying and ranking the most important nutritional and nonnutritional factors in the occurrence of MetS is an essential analysis with respect to public health. Data-mining methods are strong tools in predicting different outcomes and emphasizing interpretability with benefits for precision prediction. Hence, we aimed to identify and rank the most important nutritional and nonnutritional factors in the occurrence of MetS among the general population of Tehran, Iran, using the RF data-mining method.

Design and Participants
This prospective study (Code: IR.UMSHA.REC.1398.864) was performed under the framework of the Tehran Lipid and Glucose Study, a population-based study to determine risk factors for noncommunicable diseases in a sample of residents of District 13 of the Tehran metropolis [15,16]. The first examination survey was performed from 1999 to 2001 on 15,005 individuals aged ≥3 years. Subsequently, follow-up examinations were performed every 3 years (2002-2005, 2005-2008, 2008-2011, 2011-2014, and 2015-2018) to identify recently developed diseases (see Multimedia Appendix 1 for more details on the survey).
In the fifth follow-up examination (2011-2014), 4204 adults (aged ≥20 years) participated. These participants completed the Food Frequency Questionnaire (FFQ), and their dietary data were available. The exclusion criteria in this study were as follows: individuals diagnosed with MetS (n=635); people with missing data regarding MetS status (n=61); no follow-up (n=434); stroke, thyroid, or cancer complications (n=18); and following a specific dietary regimen (n=8). Finally, 3048 adults without MetS at baseline were included in the study ( Figure 1). All invited participants signed the informed written consent form. The study was performed in adherence with the Declaration of Helsinki. The ethics committee of the Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences approved the study protocol.

Outcomes
MetS was defined according to the modified definition of the National Cholesterol Education Program/Adult Treatment Panel III [17,18] as having at least three of the following symptoms simultaneously: (1) abdominal obesity (waist circumference >90 cm in both genders); (2) serum HDL-C level <40 mg/dl in men and <50 mg/dl in women or taking HDL-C-elevating drugs; (3) hypertension (systolic blood pressure ≥130 mmHg, diastolic blood pressure ≥85 mmHg, or taking antihypertensive drugs); (4) hyperglycemia (fasting blood glucose ≥100 mg/dl or taking hypoglycemic drugs); and (5) hypertriglyceridemia (serum triglyceride level ≥150 mg/dl or taking triglyceride-lowering drugs).

Risk Factor Assessment
In this study, the FFQ was used to measure the exact amount of food intake. The FFQ is a valid and reliable tool for measuring 147 food items (Multimedia Appendix 2) [18]. Trained nutritionists helped the participants to complete the questionnaires through face-to-face interviews. The usual average size of each food item was explained to each participant, considering the frequency of consumption on a daily, weekly, or monthly basis [18,19]. Portion sizes were converted to grams using household measures. Due to the incompleteness of the Iranian food composition table, the United States Department of Agriculture food consumption table was used to analyze foods in terms of their macro-and micronutrients [20,21]. A literature review was performed to select effective nutrients for MetS [22][23][24].
Weight was measured to the nearest 100 g using digital scales (Seca, Hamburg, Germany) while subjects were minimally clothed and not wearing shoes. Height was measured to the nearest 0.5 centimeter using a stadiometer while the subjects were in a standing position, with their shoulders in normal alignment and without shoes. Information on age, gender, marital status (single, divorced, widowed), history of hospitalization in the previous 3 months, history of cancer, education (primary, intermediate, high school, and academic education), and smoking (never smoked, past smoker, current smoker) was collected using a general information questionnaire.

Statistical Analysis
The χ 2 test and t test were applied to explore the differences in qualitative and quantitative variables between groups. Since the data-mining approach cannot reveal the direction of the association of variables on the outcome, multiple logistic regression was used to estimate the adjusted effect of variables. The backward-selection method was applied to choose the variables in this model. To remove variables from the model, the P value threshold was set to .20. R software (version 3.6.1) with the randomForest and caret packages was used for data analysis.

RF Analysis
RF, proposed by Leo Breiman [25], is an ensemble learning method that grows many classification trees. A random sample with replacement of the original training dataset was used to construct the trees in RF. The algorithm only searches across a random subset of the input variables at each node to determine the best split. Finally, RF chooses the class with the most votes over all the trees in the forest [25]. RF has exhibited superior performance over other machine-learning methods such as support vector machine, artificial neural network, and k-nearest neighbor [26][27][28].
Moreover, although most machine-learning classifiers are useful for classifying, they do not provide any insight into the most important variables based on the derived classifier. However, RF provides variable importance measurements that can be used in model interpretation [26]. The most common method to find the most important variable is to use the mean decrease in accuracy and the mean decrease in the Gini index [26,29].

Evaluation Criteria
Our dataset consisted of 2259 adults (after removing variables with missing data) divided into training and testing sets. We randomly chose 70% of the data as the training set and the remaining 30% as the test set. The RF classifier was trained using the training dataset. The test dataset was used to evaluate the performance of the method. To evaluate the performance of the RF classifier, we used several evaluation criteria of sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), negative likelihood ratio (LR-), and positive likelihood ratio (LR+) (see Multimedia Appendix 3).

Baseline Characteristics
The dataset included 3048 adults, 701 (22.9%) of whom developed MetS and 2347 (77.1%) of whom did not develop MetS. The mean age of the participants at baseline was 44.3 years (SD 11.8). The total MetS incidence rate was 229.98 (95% CI 278.6-322.9) per 1000 person-years. The incidence of MetS was significantly higher in men than in women (27% vs 20%). In both genders, those affected by MetS were older, married, had diabetes, and a lower level of education (P<.001) than their counterparts. In men, a greater frequency of smokers were affected by MetS (P=.05), and the percentage of hospitalized subjects in patients with MetS syndrome was higher than that among healthy people, although this difference was only statistically significant in women (P=.02) ( Table 1).
The distribution of the characteristics of subjects in the training and test datasets is presented in Table 2. The results showed no statistically significant differences between the training and test sets.

RF Model
The variable importance obtained by the training set using RF is presented in  Table 3).
We obtained an overall out-of-bag correct classification of 98.67% (Table 4). The proportion of error for subjects with and without MetS was 99.24% and 96.55%, respectively. Finally, partial plots provided the marginal effect of predictors on MetS (Multimedia Appendix 5).

Principal Findings
In this prospective study, the total incidence rate of MetS was 229.98 per 1000 person-years. The most important determinants of MetS were a history of diabetes, increased BMI, older age, male gender, and low dietary monounsaturated fatty acid intake.
In this study, diabetes was identified as the most important risk factor (ranking first) for MetS. This finding is expected to be associated with common risk factors of diabetes and MetS (eg, increased BMI, hypertension, high-fat diet, and insulin resistance-linked obesity). In addition, some analytical studies have shown that MetS predicts diabetes independently of other factors [30]. Another study showed that MetS was associated with a 3 to 5-fold increase in the risk of developing type 2 diabetes mellitus [31].
BMI was identified as the second most important risk factor for the incidence of MetS. The development of insulin resistance and the role of inflammatory mediators in MetS are the most important mechanisms in the pathogenesis of obesity. Various studies have shown relationships among hyperinsulinemia, insulin resistance, and increased inflammatory mediators such as C-reactive protein with the development and progression of MetS [14,17,32].
Increased age was the third-ranking factor that was associated with MetS in this study. Aging usually leads to decreased physical activity, followed by an increase in BMI, which can contribute to MetS. Previous studies showed that less than 10% of people in their 20s and 30s were affected by MetS, whereas MetS affected 40% of those over 60 years of age [33,34].
Male gender was the fourth-ranking factor associated with MetS. We observed a significantly higher incidence of MetS among men than among women (27% vs 20%). Although previous studies in Iran showed that the prevalence of MetS was higher among women than among men [35,36], more recent studies confirm our findings, demonstrating the opposite pattern [7]. One reason behind this phenomenon may be the higher prevalence of basic MetS-related characteristics in the men of our study, such as hypertension, higher waist-hip ratio, and higher triglyceride levels.
A low monounsaturated fatty acid intake was identified as the fifth most important factor for a lower occurrence of MetS. Our result is consistent with a recent systematic review that reported that a diet with decreased monounsaturated fats was associated with improving lipoprotein profiles and triglyceride levels [37]. As mentioned earlier, hyperlipidemia is one of the components of MetS. Thus, this finding is consistent with other studies in this area.

Strengths and Limitations
This study used a population-based cohort (as the gold standard in observational studies) designed based on standard tools for measuring clinical and other variables. This study had some limitations. First, the role of socioeconomic status as an important factor influencing the dietary pattern of subjects was not determined; however, this study was performed on people living in District 13 of Tehran, which is classified as an area with an average income level.
Another limitation of this study was use of the FFQ. Completing a long list of foods consumed over the past year has the potential for recall bias and consequently measurement error, which may distort the results [38,39]. Another important factor for the incidence of MetS is physical activity status; this variable was not included in the analysis due to the large number of missing data.
Finally, the main strength of this study was that the most important risk factors and nutritional factors were ranked. In contrast, previous studies often investigated the predictive factors using classical approaches and neglected the importance of paying attention to risk/protective factors by considering the ranking of the impact of each factor on the outcome. Therefore, lifestyle modification (eg, having a balanced weight and healthy diet) is one of the most important ways to reduce the incidence of MetS.

Conclusion
In summary, our findings show that the incidence rate of MetS in Tehran was 229.98 per 1000 person-years. The most important determinants of MetS were history of diabetes, increased BMI, increased age, male gender, and decreased dietary monounsaturated fatty acid.