Chapter 5 Body Fat Data
In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables in
http://jse.amstat.org/datasets/fat.dat.txt is available in the R
package mfp by Ambler and Benner (2022) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017). The data set has also been used to illustrate multivariate outliers in Olive, Pelawa Watagoda, and Rupasinghe Arachchige Don (2015). One of the questions posed in Johnson (1996) is “\(\ldots\) Examine the data and note any unusual cases. Sort the cases, for example, by height, weight, and percentage of fat and note the distributions. What should be done, if anything, about these unusual cases? Suggest some rules for changing or deleting outliers.” This is a fantastic activity to introduce students to data cleaning and one we pursue in more depth before using a “cleaned” data set to illustrate model building with a unified framework via the caret package created by Kuhn (2022).
5.1 Getting the Original Data
The R
code below uses the fread()
function from the data.table package written by Dowle and Srinivasan (2022) to create the data frame bodyfat
. The URLs for the variable descriptions and the actual data are, respectively:
http://jse.amstat.org/datasets/fat.dat.txt
and
http://jse.amstat.org/datasets/fat.txt.
library(data.table) # Load data.table package
<- "http://jse.amstat.org/datasets/fat.dat.txt"
url <- fread(url, col.names = c("case", "brozek", "siri",
bodyfat "density", "age",
"weight_lbs",
"height_in", "bmi",
"fat_free_weight", "neck_cm",
"chest_cm", "abdomen_cm",
"hip_cm", "thigh_cm",
"knee_cm", "ankle_cm",
"biceps_cm", "forearm_cm",
"wrist_cm"))
The first 10 rows and eight columns of the data stored in the bodyfat
object are displayed in Table 5.1. Start by examining the relationship between density
and brozek
. Note that the article by Johnson (1996) defines the bodyfat determined with the brozek and siri methods in (5.1) and (5.2), respectively. The fat_free_weight
determination is defined in the description of the variables provided online and repeated in (5.3). Since the relationship in (5.1) is a linear function of density
, we should see a straight line in a scatterplot of brozek
versus density
.
\[\begin{equation} \text{bodyfatBrozek} = \frac{457}{\text{density}} - 414.2 \tag{5.1} \end{equation}\]
\[\begin{equation} \text{bodyfatSiri} = \frac{495}{\text{density}} - 450 \tag{5.2} \end{equation}\]
\[\begin{equation} \text{FatFreeWeight} = \left(1 -\frac{\text{brozek}}{100}\times \text{weight}\_\text{lbs}\right) \tag{5.3} \end{equation}\]
case | brozek | siri | density | age | weight_lbs | height_in | bmi |
---|---|---|---|---|---|---|---|
1 | 12.6 | 12.3 | 1.0708 | 23 | 154.25 | 67.75 | 23.7 |
2 | 6.9 | 6.1 | 1.0853 | 22 | 173.25 | 72.25 | 23.4 |
3 | 24.6 | 25.3 | 1.0414 | 22 | 154.00 | 66.25 | 24.7 |
4 | 10.9 | 10.4 | 1.0751 | 26 | 184.75 | 72.25 | 24.9 |
5 | 27.8 | 28.7 | 1.0340 | 24 | 184.25 | 71.25 | 25.6 |
6 | 20.6 | 20.9 | 1.0502 | 24 | 210.25 | 74.75 | 26.5 |
7 | 19.0 | 19.2 | 1.0549 | 26 | 181.00 | 69.75 | 26.2 |
8 | 12.8 | 12.4 | 1.0704 | 25 | 176.00 | 72.50 | 23.6 |
9 | 5.1 | 4.1 | 1.0900 | 25 | 191.00 | 74.00 | 24.6 |
10 | 12.0 | 11.7 | 1.0722 | 23 | 198.25 | 73.50 | 25.8 |
5.2 Graphing the Data
Figure 5.1 shows a scatterplot of reported brozek bodyfat values versus reported body density, yet, not all values fall along a straight line. The plotly package written by Sievert et al. (2022) is used to obtain an interactive graph where the user can place their cursor over different points to obtain additional information. Consider using the R
code below to obtain an interactive graph of brozek
versus density
.
library(ggplot2)
library(plotly)
<- ggplot(data = bodyfat, aes(x = density, y = brozek,
p color = case)) +
geom_point() +
theme_bw()
<- ggplotly(p)
g g
Use the mouse to hover over points in Figure 5.1 that do not fall along a straight line. Note that cases 48, 76, and 96 appear to be errors based on Figure 5.1, and case 182 has an estimated negative body fat that was truncated to zero per the help file. Looking at the outliers in Figure 5.2, case 42 has a man weighing over 200 pounds who is less than 3 feet tall. Reading more, we can figure out that listed densities for cases 48, 76, and 96 are 1.0665, 1.0666, and 1.0991, and should be 1.0865, 1.0566, and 1.0591, respectively. Johnson (1996) suggests the height for case 42 is probably 69.5 inches. Since it is physically impossible for a human to live with no bodyfat and equations (5.1) and (5.2) have an upper limit on density of approximately 1.1, the density value of 1.1089 for case 182 is highly suspect. While it is possible that the density value for case 182 is a data entry error, it is not clear what the true value should be, so case 182 will be removed from the data set.
library(dplyr)
c(48, 76, 96, 42, 182),
bodyfat[c("density", "brozek", "siri", "height_in")]
density brozek siri height_in
1: 1.0665 6.4 5.6 71.25
2: 1.0666 18.3 18.5 67.50
3: 1.0991 17.3 17.4 77.75
4: 1.0250 31.7 32.9 29.50
5: 1.1089 0.0 0.0 68.00
$density[c(48, 76, 96)] <- c(1.0865, 1.0566, 1.0591)
bodyfat<- bodyfat %>%
bodyfat mutate(siri_C = round(495/density - 450, 1),
brozek_C = round(457/density - 414.2, 1),
bmi_C = round((weight_lbs*0.453592) /
*2.54/100)^2, 1))
(height_inc(48, 76, 96, 42, 182), c("density", "brozek",
bodyfat["brozek_C", "siri_C",
"siri")]
density brozek brozek_C siri_C siri
1: 1.0865 6.4 6.4 5.6 5.6
2: 1.0566 18.3 18.3 18.5 18.5
3: 1.0591 17.3 17.3 17.4 17.4
4: 1.0250 31.7 31.7 32.9 32.9
5: 1.1089 0.0 -2.1 -3.6 0.0
c(48, 76, 96, 42, 182), c("density", "bmi_C", "bmi",
bodyfat["height_in", "weight_lbs")]
density bmi_C bmi height_in weight_lbs
1: 1.0865 20.6 20.6 71.25 148.50
2: 1.0566 22.9 22.9 67.50 148.25
3: 1.0591 26.1 26.1 77.75 224.50
4: 1.0250 165.6 29.9 29.50 205.00
5: 1.1089 18.0 18.1 68.00 118.50
It seems reasonable to assume the reported bmi
value for case 42 is correct and to compute a height_in
value based on the bmi
and weight_lbs
values. Note that the units for the bmi
variable are \(\mathtt{kg/m^2}\), and that there are 0.453592 kilos per pound and 2.54 \(\mathtt{cm}\) per inch.
<- 205 * 0.453592
weight_k <- sqrt(weight_k / 29.9)
height_m height_m
[1] 1.763494
<- height_m*100 / 2.54
height_in height_in
[1] 69.4289
Since the computed height value for case 42 is 69.43 inches, it seems very likely that the recorded height of 29.5 inches was a data entry error and should be replaced with a value of 69.5 inches per the suggestion in Johnson (1996). At this point, four of the five questionable values are changed using the R
code below. Case 182 — the no body fat estimate, will be dropped from the data set later.
$density[c(48, 76, 96)] <- c(1.0865, 1.0566, 1.0591)
bodyfat$height_in[42] <- 69.5
bodyfat# bodyfat <- bodyfat[-182, ] # remove zero bodyfat case
Additional code and text explaining the decisions used to create a cleaned version of the data are provided in Section 5.3.
5.3 Further Cleaning
To evaluate the original computations, consider the R
code below which computes the siri and brozek body fat percentages using equations (5.2) and (5.1), respectively. If one considers differences in absolute values between the reported and computed bodyfat values greater than 0.11, eleven cases are flagged for further evaluation. Differences in absolute value between reported and computed bodyfat values less than 0.11 are most likely due to improper rounding. It appears that there are errors in computing brozek values for cases 11, 33, 49, 98, 152, and 235. The brozek values most likely should be 7.8, 12.1, 13.8, 11.7, 19.3, and 25.1, respectively (the computed brozek values using (5.1) with the reported density values). There also appear to be errors in computing the siri values for cases 169, and 200. Cases 169 and 200 have reported siri values of 34.3, and 23.6 which should be 36.2, and 23.1, respectively. Case 6 used a density of 1.0512 to compute the reported brozek and siri values instead of the reported 1.0502. The density value for case 6 will be changed to 1.0512. It is not obvious what exactly the discrepancy is for case 237.
<- bodyfat %>%
bodyfat mutate(siri_C = round(495/density - 450, 1),
brozek_C = round(457/density - 414.2, 1),
bmi_C = round( (weight_lbs*0.453592) /
*2.54/100)^2, 1) )
(height_in<- subset(bodyfat, abs(siri_C - siri) > 0.11 |
bodyfatCH abs(brozek - brozek_C) > 0.11)
c("siri", "siri_C", "brozek", "brozek_C", "density")] bodyfatCH[,
siri siri_C brozek brozek_C density
1: 20.9 21.3 20.6 21.0 1.0502
2: 7.1 7.1 7.5 7.8 1.0830
3: 11.8 11.8 13.4 12.1 1.0719
4: 13.6 13.6 13.4 13.8 1.0678
5: 11.3 11.3 11.1 11.7 1.0730
6: 19.6 19.6 19.1 19.3 1.0542
7: 34.3 36.2 34.7 34.7 1.0180
8: 0.0 -3.6 0.0 -2.1 1.1089
9: 23.6 23.1 22.6 22.6 1.0462
10: 25.8 25.8 25.5 25.1 1.0403
11: 24.8 24.9 24.0 24.2 1.0424
$brozek[11] <- 7.8
bodyfat$brozek[33] <- 12.1
bodyfat$brozek[49] <- 13.8
bodyfat$brozek[98] <- 11.7
bodyfat$brozek[152] <- 19.3
bodyfat$brozek[235] <- 25.1
bodyfat$siri[169] <- 36.2
bodyfat$siri[200] <- 23.1
bodyfat$density[6] <- 1.0512
bodyfat<- bodyfat %>%
bodyfat mutate(siri_C = round(495/density - 450, 1),
brozek_C = round(457/density - 414.2, 1),
bmi_C = round( (weight_lbs*0.453592) /
*2.54/100)^2, 1),
(height_infat_free_weight_B = round((1 - brozek_C/100) *
1)
weight_lbs ,
)<- subset(bodyfat, abs(siri_C - siri) > 0.11 |
bodyfatCH2 abs(brozek - brozek_C) > 0.11)
c("siri", "siri_C", "brozek", "brozek_C", "density")] bodyfatCH2[,
siri siri_C brozek brozek_C density
1: 0.0 -3.6 0 -2.1 1.1089
2: 24.8 24.9 24 24.2 1.0424
# Data Entry Errors?
which(abs(bodyfat$fat_free_weight -
bodyfat[$fat_free_weight_B) > 0.101),
bodyfatc("fat_free_weight", "fat_free_weight_B")]
fat_free_weight fat_free_weight_B
1: 172.3 171.7
2: 142.5 147.7
3: 117.6 117.0
4: 127.8 128.0
5: 125.9 125.7
6: 151.2 151.7
7: 168.4 167.8
8: 159.3 159.0
9: 149.3 149.0
10: 141.7 141.4
11: 118.5 121.0
12: 151.3 133.8
13: 117.5 118.2
sum(abs(bodyfat$fat_free_weight -
$fat_free_weight_B) > 0.101) bodyfat
[1] 13
# Data Entry and rounding errors
sum(bodyfat$fat_free_weight != bodyfat$fat_free_weight_B)
[1] 130
<- bodyfat[-182, ] # remove zero bodyfat case
bodyfat # Number of rounding and data entry errors for fat free weight
sum(bodyfat$fat_free_weight != bodyfat$fat_free_weight_B)
[1] 129
# Number of rounding discrepancies
sum(bodyfat$siri != bodyfat$siri_C)
[1] 29
sum(bodyfat$bmi != bodyfat$bmi_C)
[1] 98
sum(bodyfat$brozek != bodyfat$brozek_C)
[1] 39
Using the R
code above, there are 129 data entry or rounding errors for fat_free_weight
. For the variables siri
, bmi
, and brozek
there are 29, 98, and 39 rounding errors, respectively.
Figure 5.3 shows a plot of corrected Siri body fat values versus corrected Brozek body fat values. Since the corrected body fat values using the Brozek and Siri method are computed from the measured density, the values in Figure 5.3 should lie on a straight line. The R
code below can be used to create an interactive plotly graph.
<- ggplot(data = bodyfat, aes(x = brozek_C, y = siri_C, color = case)) +
gp geom_point() +
theme_bw() +
labs(x = "Brozek Body Fat Corrected", y = "Siri Body Fat Corrected")
ggplotly(gp)
Using the interactive graph in Figure 5.4, it appears that case 31 and case 86 have data entry errors for the diameter of the ankle. The ankle diameters for case 31 and 86 of 33.9 \(\mathtt{cm}\) and 33.7 \(\mathtt{cm}\) are most likely typos that should be 23.9 \(\mathtt{cm}\) and 23.7 \(\mathtt{cm}\). If one modifies the R
code above by replacing y = ankle_cm
with y = forearm_cm
, case 159 appears to be a data entry error for the diameter of the forearm. The reported diameter for the forearm of case 159 is 34.9 \(\mathtt{cm}\) which most likely should be 24.9 \(\mathtt{cm}\) since the individuals reported bicep diameter is only 27 \(\mathtt{cm}\).
<- ggplot(data = bodyfat,
p aes(y = ankle_cm, x = weight_lbs, color = case)) +
geom_point() +
theme_bw()
ggplotly(p)
# Fixing likely typos / data entry errors
$ankle_cm[31] <- 23.9
bodyfat$ankle_cm[86] <- 23.7
bodyfat$forearm_cm[159] <- 24.9 bodyfat
The R
code below removes the variables case
, brozek
, siri
, density
, bmi
, fat_free_weight
, siri_C
, and fat_free_weight_B
from bodyfat
and creates the additional variables age_sq
and am
. Using the function write.csv()
, the tibble bodyfatClean
is written to the current working directory as a CSV
file.
<- bodyfat[, -c(1, 2, 3, 4, 8, 9, 20, 23)]
bodyfatClean names(bodyfatClean)
[1] "age" "weight_lbs" "height_in" "neck_cm" "chest_cm"
[6] "abdomen_cm" "hip_cm" "thigh_cm" "knee_cm" "ankle_cm"
[11] "biceps_cm" "forearm_cm" "wrist_cm" "brozek_C" "bmi_C"
<- bodyfatClean %>%
bodyfatClean mutate(age_sq = age^2, abdomen_wrist = abdomen_cm - wrist_cm,
am = (weight_lbs/0.453592)^1.2/(height_in*2.54/100)^3.3)
write.csv(bodyfatClean, "./bodyfatClean.csv",
row.names = FALSE)
5.4 Conclusion
Learning different syntax for multiple algorithms is challenging even for people who work with multiple algorithms on a daily basis and even more challenging for students or researchers encountering the algorithms for the first time. Fortunately, using the caret package allows both students and researchers to fit models using a rich selection of algorithms with a unified syntax. Using cross validation is essential for building good predictive models and caret
simplifies the cross validation process across many algorithms with the trainControl()
function. To illustrate some of the functionality of caret
, a well-known data set Boston
from the MASS package was used where it was assumed that the data are stored correctly. Data from the wild is often of questionable quality, and textbooks seldom ask students to question the data authors or others provide. The body fat data set hosted by the Journal of Statistical Education provides an excellent opportunity for students to question and potentially re-code many of the entries from an original data set and subsequently propose a model to predict body fat that is appropriate for use in a clinical setting. The guided labs provide students an introduction to predictive model building using the caret suite of tools for both transformed and untransformed predictors. Using the body fat data set provides the reader with ample rationale to question many reported values and practice cleaning data in a guided fashion.