This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
National surveys in public health nutrition commonly record the weight of every food consumed by an individual. However, if the goal is to identify whether individuals are in compliance with the 5 main national nutritional guidelines (sodium, saturated fats, sugars, fruit and vegetables, and fats), much less information may be needed. A previous study showed that tracking only 2.89% of all foods (113/3911) was sufficient to accurately identify compliance. Further reducing the data needs could lower participation burden, thus decreasing the costs for monitoring national compliance with key guidelines.
This study aimed to assess whether national public health nutrition surveys can be further simplified by only recording whether a food was consumed, rather than having to weigh it.
Our dataset came from a generalized sample of inhabitants in the United Kingdom, more specifically from the National Diet and Nutrition Survey 2008-2012. After simplifying food consumptions to a binary value (1 if an individual consumed a food and 0 otherwise), we built and optimized decision trees to find whether the foods could accurately predict compliance with the major 5 nutritional guidelines.
When using decision trees of a similar size to previous studies (ie, involving as many foods), we were able to correctly infer compliance for the 5 guidelines with an average accuracy of 80.1%. This is an average increase of 2.5 percentage points over a previous study, showing that further simplifying the surveys can actually yield more robust estimates. When we allowed the new decision trees to use slightly more foods than in previous studies, we were able to optimize the performance with an average increase of 3.1 percentage points.
Although one may expect a further simplification of surveys to decrease accuracy, our study found that public health dietary surveys can be simplified (from accurately weighing items to simply checking whether they were consumed) while improving accuracy. One possibility is that the simplification reduced noise and made it easier for patterns to emerge. Using simplified surveys will allow to monitor public health nutrition in a more cost-effective manner and possibly decrease the number of errors as participation burden is reduced.
Insufficient compliance with dietary guidelines can lead to several health problems, whereas following guidelines can have protective effects. Systematic reviews have linked excess salt consumption with increased blood pressure, which raises the risk for cardiovascular diseases [
Data mining is a computational technique (often equated with machine learning), which offers significant potential to alleviate that burden by finding key patterns in data. Among the different tasks performed in data mining, our focus is on
There are many algorithms to choose from when performing classification. Decision trees in particular have proven to be a popular approach [
Our overarching goal is to further simplify public health nutrition surveys. Building on previous work showing that only 2.89% (113/3911) of the items were necessary [
The principal contributions of this study can be summarized as follows:
We demonstrate that simplifying the information recorded in a specific dietary survey is not necessarily detrimental to identifying key public health outcomes.
The application of our work to dietary public health suggests that nutritional surveys may be simplified when the aim is to predict compliance with major nutritional guidelines. This simplification may reduce participation burden, lower study costs, or increase the sample size at a same cost.
The methodological part of our work illustrates the potential for data mining to contribute to public health not only by making predictions, but by identifying what part of the data is truly needed to form these predictions.
A decision tree starts at a root (top). For a given individual, we repeatedly compare the individual’s data with the questions in the tree. In this example, if the individual did not consume food 1, then the follow-up question is whether food 2 was consumed. Eventually, we reach a conclusion: whether the individual was in compliance with the guideline or not. Such trees are automatically built from the data.
Our dataset came from a generalized sample of inhabitants from the United Kingdom: the National Diet and Nutrition Survey (NDNS) 2008-2012. The NDNS data were obtained from the UK Data Archive [
Within the dataset, food consumption at a daily level is recorded for participants over several days. To record portion sizes, common household measures (eg, one tablespoon, one cup) and weight in grams were used for the foods consumed throughout the study, including the consumption of liquids. Foods are described specifically and can be related to other foods in a subgroup or a group. For instance, the consumption of bananas would be entered as with 3 different levels of detail: as individual foods (eg, bananas raw flesh only), as subfood groups (eg, bananas), or as food groups (eg, fruit and vegetables).
The NDNS dataset only contains the foods consumed, their composition, and demographical information. It does not make any conclusion in regard to nutritional guidelines. The dataset was expanded in a previous study to include this information [
The NDNS dataset has 4156 participants including 1189 children younger than 11 years. First, for each of the 4156 participants, compute the mean daily intake of fruit and vegetables and sodium, as well as the main daily percentage of energy derived from fat, saturated fat, and free sugars. Then, compare each individual's numbers with the corresponding nutritional recommendations to determine whether the individual is in compliance with the recommendation. UK recommendations on
The World Health Organization (WHO) recommends limitations on how much energy can be derived from each of the following categories: at most 30% from fat, at most 10% from saturated fat, and at most 10% from free sugars (sixth table in [
For each participant, our final dataset includes selected data from the NDNS survey (age, gender, and consumption for all of the 3911 individual foods) and additional data computed through the process above (whether or not they were in compliance for each of the 5 nutritional guidelines).
A
There are 2 types of classifications: binary and multi-class. In a binary situation, the outcome we seek to predict can only have 2 different values. Conversely, in a multi-classification problem, the outcome has 3 or more values. Our study focuses on a binary classification problem: for each of the 5 guidelines, we want to know whether or not the guideline is met.
The process to create a decision tree for binary classification has been detailed in numerous reference material such as Maimon and Rokach [
A portion of the data is not provided to the algorithm for building the tree and is instead held to evaluate the quality of the generated tree [
Highlighting recall and sensitivity is useful when the costs of making mistakes may be different: in health studies, giving someone an intervention that they do not need may be a very different issue from initially suggesting not to give them the intervention that they need. In addition, datasets are frequently imbalanced, that is, there can be many more cases for one outcome than the other. In this case, a high accuracy may be misleading as the tree may do well for the most common case, while being very inaccurate for the less common case. By providing the recall and sensitivity, our study supports public health officials in evaluating our performance by giving more or less weight to specific outcomes. As in previous work, our overall accuracy assumes that the error costs are similar [
In general, class imbalance can be addressed by eliminating cases of the majority class (undersampling), creating new cases for the minority class (oversampling), or biasing the classification algorithm (eg, using nonuniform error costs on the classes) [
Our process is summarized in
Like most classification algorithms, C4.5 (and its J48 implementations) take parameters that can impose further constraints on the resulting tree. We tested different parameter values to either (1) find the most accurate decision tree with a similar structure (ie, number of foods) to the trees generated in the previous study using the exact weights of foods, or (2) identify the most accurate tree without consideration for the number of foods involved [
After each tree was built, we used 10-fold cross-validation [
A sample of our approach to explore the trade-off between the number of foods and accuracy is illustrated in
In the guiding example of
Flow diagram of our methodology, showing the acquisition, preprocessing, and mining of the data. NDNS: National Diet and Nutrition Survey; SMOTE: Synthetic Minority Over-Sampling Technique.
Sample outcome for the decision tree classifier on free sugars.
Study | Minimum number of instances | Accuracy (average) | Recall | Specificity | Number of factors |
Previous | 60 | 76.5 | 76.1 | 76.9 | 28 |
Current | 60 | 78.2 | 73.6 | 82.9 | 31 |
Current | 70 | 78.1 | 74.7 | 77.3 | 31 |
Current | 80 | 78.3 | 74.7 | 78.3 | 30 |
Current | 90 | 77.9 | 75.1 | 80.8 | 30 |
Current | 95 | 77.9 | 75.1 | 80.7 | 25 |
Current | 100 | 77.3 | 75.7 | 78.9 | 26 |
Current | 115 | 77.2 | 75.5 | 78.8 | 22 |
Our dataset can broadly be understood as consisting of participants (the rows) and their food consumptions (the columns). Demographic characteristics of the participants (regardless of food consumptions) are summarized in
The methods introduced in the previous section select a food if it helps to separate individuals in compliance versus those who are not. For instance, if eating bananas is highly prevalent in the population, then knowing whether a person ate bananas may not be useful to predict dietary compliance. Conversely, if a food was clearly associated with a healthier diet for a handful of individuals, the frequency may be too low to warrant its inclusion at the population level.
Our new decision trees, built on simplified reporting of foods, were slightly more accurate than previous trees built using the exact weighted foods. This was found across all guidelines (
Across the 5 guidelines, our new decision trees had an accuracy of 80.1%. That is, in 4 out of 5 cases, by only knowing whether foods were consumed, and using at most a few dozen foods, we can successfully conclude whether nutritional guidelines are met. This accuracy is 2.6 percentage points higher than the average on previous decision trees (77.5%). That is, not asking individuals to weigh foods leads to being better able to tell if they meet guidelines.
The optimized classifiers performed slightly better with an average accuracy of 80.6% on classified classes (
To better contrast optimized decision trees versus those limited in the number of foods,
In
Key characteristics of the National Diet and Nutrition Survey (NDNS) household dataset. All participants in the study were within the United Kingdom. There were several study waves, with around 1000 respondents per year.
Characteristics | Categorical count, n (%) | |
Male | 5034 (47.41) | |
Female | 5439 (52.57) | |
Free sugars | 1472 (35.41) | |
Salt | 2524 (60.73) | |
Fat | 1045 (25.14) | |
Saturated fat | 795 (19.13) | |
Fruits and vegetables | 656 (15.78) | |
English | 5036 (48.08) | |
Northern Irish | 3442 (32.86) | |
Scottish | 684 (6.53) | |
Welsh | 398 (3.80) | |
Irish | 194 (1.85) | |
Other | 719 (6.88) | |
Single (never married) | 6240 (59.57) | |
Married (living with partner) | 1960 (18.71) | |
Divorced | 261 (2.49) | |
Married (living separate) | 3 (0.06) | |
Widowed | 139 (1.32) | |
Other | 1870 (17.85) | |
Going to school full-time | 2974 (28.39) | |
Full or part time employment | 4440 (42.39) | |
Not working presently | 3039 (29.02) |
Main foods either by (a) contribution to caloric intake, or (b) prevalence among individuals.
Comparison of the best decision tree using the weight of foods (previous study, Giabbanelli and Adams, 2016 [
Study | Guidelines | Number of instances | Accuracy (%) | Recall | Specificity | Number of factors |
Previous | Free sugars | 60 | 76.5 | 76.1 | 76.9 | 28 |
Current | Free sugars | 95 | 77.9 | 75.1 | 80.7 | 25 |
Previous | Fat | 70 | 72.4 | 66.3 | 78.4 | 33 |
Current | Fat | 90 | 79.4 | 70.4 | 88.5 | 33 |
Previous | Fruits and vegetables | 50 | 83.1 | 82.5 | 83.8 | 11 |
Current | Fruits and vegetables | 90 | 82.2 | 82.3 | 82.2 | 10 |
Previous | Saturated fat | 20 | 79.7 | 75.8 | 83.6 | 28 |
Current | Saturated fat | 90 | 84.6 | 77.4 | 91.8 | 27 |
Previous | Salt | 15 | 75.8 | 81.9 | 69.8 | 28 |
Current | Salt | 55 | 76.3 | 79.5 | 73.2 | 26 |
Comparison of the best decision tree using the weight of foods (previous study, Giabbanelli and Adams, 2016 [
Study | Guidelines | Number of instances | Accuracy (%) | Recall | Specificity | Number of factors |
Previous | Free sugars | 60 | 76.5 | 76.1 | 76.9 | 28 |
Current | Free sugars | 60 | 78.2 | 73.6 | 82.9 | 31 |
Previous | Fat | 70 | 72.4 | 66.3 | 78.4 | 33 |
Current | Fat | 70 | 79.9 | 72.3 | 87.7 | 43 |
Previous | Fruits and vegetables | 50 | 83.1 | 82.5 | 83.8 | 11 |
Current | Fruits and vegetables | 50 | 83.5 | 84.9 | 82.2 | 16 |
Previous | Saturated fat | 20 | 79.7 | 75.8 | 83.6 | 28 |
Current | Saturated fat | 20 | 84.7 | 79.3 | 90.1 | 42 |
Previous | Salt | 15 | 75.8 | 81.9 | 69.8 | 28 |
Current | Salt | 50 | 76.6 | 79.9 | 73.2 | 25 |
Accuracy, recall (“Yes”), and specificity (“No”) when (a) limiting the number of foods as in a previous study (Giabbanelli & Adams, 2016 [
Individual foods used as predictors at least 5 times in the trees generated using our 2 processes (similar/optimized) and for the 5 guidelines: Fruit and Vegetables, Fat, Saturated Fat, Salt, and Free Sugars. The frequency is the number of times that a food is used as a decision node across all trees (eg, if used 3 times in 5 trees each, it would be 15).
Variables | Similar decision tree | Optimized decision tree | Total frequency | |||||||||
FVb | Fat | SatFatc | Salt | Sugd | FV | Fat | SatFat | Salt | Sug | |||
Sausages | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 20 | |||||
Bananas raw | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 19 | |||||
Sausage roll | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 16 | |||||
Cheese cheddar | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 14 | |||||
Milk chocolate | ✓ | ✓ | ✓ | ✓ | 12 | |||||||
Butter salted | ✓ | ✓ | ✓ | ✓ | ✓ | 10 | ||||||
Cheese spreads | ✓ | ✓ | 8 | |||||||||
Ice cream | ✓ | ✓ | ✓ | 8 | ||||||||
Fruit drink | ✓ | ✓ | ✓ | 8 | ||||||||
Chicken pieces | ✓ | ✓ | 8 | |||||||||
Sex | ✓ | ✓ | ✓ | 7 | ||||||||
Potato crisps | ✓ | ✓ | 6 | |||||||||
Apples | ✓ | ✓ | 6 | |||||||||
Milk whole | ✓ | ✓ | ✓ | 6 | ||||||||
Beans baked | ✓ | ✓ | ✓ | ✓ | 6 | |||||||
Onions | ✓ | ✓ | 6 | |||||||||
Cola | ✓ | ✓ | 6 | |||||||||
Apple juice unsweetened UHTa | ✓ | ✓ | ✓ | ✓ | 6 | |||||||
Olive oil | ✓ | 6 | ||||||||||
Orange juice unsweetened | ✓ | ✓ | 6 | |||||||||
Orange juice unsweetened UHT | ✓ | 6 | ||||||||||
Bacon | ✓ | 6 | ||||||||||
Apple juice unsweetened | ✓ | ✓ | 5 | |||||||||
Sex | ✓ | ✓ | ✓ | 7 |
aUHT: Ultra-high-temperature processing.
bFV: fruits and vegetables.
cSatFat: saturated fat.
dSug: free sugars.
Monitoring at the national level whether the population is in compliance with an array of nutritional guidelines currently requires an extensive data collection process, in which individuals report and weigh the exact foods that they consumed. Our previous study demonstrated that only 2.89% (113/3911) of the foods needed to be reported to predict with 77.5% accuracy (72%-83% across guidelines) whether individuals achieve key dietary recommendations regarding sodium, saturated fats, sugars, fruit/vegetables, and fats [
Although we may have expected a decreased accuracy as a consequence of removing information, our results paradoxically indicate that accuracy has improved to 80%. We observed that results were particularly improved when inferring compliance to the guidelines on fat and saturated fat, but a trade-off was operated on free sugars and salt where a decrease in recall was counter-balanced by a larger increase in specificity. Results were more nuanced on fruit and vegetables, where optimized decision trees were able to offset a loss of specificity with a higher gain in recall (thus resulting in higher accuracy), but nonoptimized decision trees resulted in a small loss of accuracy. Overall, these findings suggest that foods may not have to be weighted, but this may depend on (1) which food guidelines need to be monitored and (2) whether public health officials decide that recall is more important than sensitivity (or vice versa) instead of giving them equal weight.
The main applications of our results are twofold. First, we may simplify surveys not only by asking for few foods in adaptive questionnaires (as shown in [
There are several alternatives to the analysis conducted here. First, an index-based analysis consists of a scoring system based on a priori knowledge that researchers have about (1) dietary guidance and (2) the scores to assign for sets of dietary components based on the guidance. This analysis can be used to assess adherence to guidelines [
Second, one could perform a cluster analysis. As summarized by Reedy et al, “clusters are driven by the sample from which they are derived, so their applicability as a standard for evaluating diets of different populations is limited because of the number of factors that determine food selection” [
Finally, Food Frequency Questionnaires (FFQs) can provide a cost-effective approach to monitoring the health of a large population. Molag et al [
Our study aimed to determine the effects of reducing the level of details employed by a national dietary survey. The NDNS survey used here has been the subject of many publications and provides a wealth of high-quality data. However, several limitations stem from using this survey. First, the NDNS survey relies on self-reported food intake. Individuals may consciously, or unconsciously, misreport their consumption within a 24-hour time frame [
Second, this survey was specific to the population of the United Kingdom, as can be seen in the specific foods used as predictors. This limitation of the data entails that our conclusion may not be generalized to populations that have important differences in eating behaviors. In this case, our approach can be replicated by collecting the complete dataset (in the first study wave) and then using data mining to investigate the consequences of simplifying it (for future study waves). Replicating results across target populations is necessary before concluding that monitoring compliance to nutritional guidelines may generally be simplified.
Our study used the data mining technique of decision trees to automatically relate individual food consumption to meeting specific guidelines. This is a well-researched technique, which has been applied to problems arising in health on multiple occasions. One specific advantage of decision trees lies in their ability to produce a model that can easily be interpreted and used with limited training. For instance, in triage, decision trees provide a “flowchart” that lay participants as well as field specialists can use intuitively. That is, an adaptive questionnaire can be formed by following the rules induced by a tree (
We sought to determine whether identifying individual dietary compliance can be further simplified while remaining as informative and accurate. We found that reporting very few foods and only whether they were consumed was sufficient to correctly identify compliance to 5 major nutritional guidelines. Being able to reduce the detail of a dataset for national monitoring can make it easier to increasing monitoring frequency or monitor more participants, thus increasing research participations without increasing study costs.
Longitudinal Internet Studies for the Social sciences
National Diet and Nutrition Survey
Synthetic Minority Over-Sampling Technique
World Health Organization
The authors wish to thank the Department of Computer Science and the College of Liberal Arts and Sciences at Northern Illinois University for funding this study via start-up funds to PG. NR was also financially supported by the Office of Student Engagement and Experiential Learning at Northern Illinois University.
None declared.