Chapter 5 Body Fat Data

In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables in http://jse.amstat.org/datasets/fat.dat.txt is available in the R package mfp by Gareth Ambler and Axel Benner (2015) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017). The data set has also been used to illustrate multivariate outliers in Olive, Pelawa Watagoda, and Rupasinghe Arachchige Don (2015). One of the questions posed in Johnson (1996) is “\(\ldots\) Examine the data and note any unusual cases. Sort the cases, for example, by height, weight, and percentage of fat and note the distributions. What should be done, if anything, about these unusual cases? Suggest some rules for changing or deleting outliers.” This is a fantastic activity to introduce students to data cleaning and one we pursue in more depth before using a “cleaned” data set to illustrate model building with a unified framework via the caret package created by Kuhn (2021).

5.1 Getting the Original Data

The R code below uses the fread() function from the data.table package written by Dowle and Srinivasan (2021) to create the data frame bodyfat. The URLs for the variable descriptions and the actual data are, respectively:

http://jse.amstat.org/datasets/fat.dat.txt

and

http://jse.amstat.org/datasets/fat.txt.

library(data.table)     # Load data.table package
url <- "http://jse.amstat.org/datasets/fat.dat.txt"  
bodyfat <- fread(url, col.names = c("case", "brozek", "siri", 
                                    "density", "age", 
                                    "weight_lbs", 
                                    "height_in", "bmi", 
                                    "fat_free_weight", "neck_cm", 
                                    "chest_cm", "abdomen_cm", 
                                    "hip_cm", "thigh_cm", 
                                    "knee_cm", "ankle_cm", 
                                    "biceps_cm", "forearm_cm",
                                    "wrist_cm"))

The first 10 rows and eight columns of the data stored in the bodyfat object are displayed in Table 5.1. Start by examining the relationship between density and brozek. Note that the article by Johnson (1996) defines the bodyfat determined with the brozek and siri methods in (5.1) and (5.2), respectively. The fat_free_weight determination is defined in the description of the variables provided online and repeated in (5.3). Since the relationship in (5.1) is a linear function of density, we should see a straight line in a scatterplot of brozek versus density.

\[\begin{equation} \text{bodyfatBrozek} = \frac{457}{\text{density}} - 414.2 \tag{5.1} \end{equation}\]

\[\begin{equation} \text{bodyfatSiri} = \frac{495}{\text{density}} - 450 \tag{5.2} \end{equation}\]

\[\begin{equation} \text{FatFreeWeight} = \left(1 -\frac{\text{brozek}}{100}\times \text{weight}\_\text{lbs}\right) \tag{5.3} \end{equation}\]

Table 5.1: The first 10 rows and eight columns of the bodyfat data
case brozek siri density age weight_lbs height_in bmi
1 12.6 12.3 1.0708 23 154.25 67.75 23.7
2 6.9 6.1 1.0853 22 173.25 72.25 23.4
3 24.6 25.3 1.0414 22 154.00 66.25 24.7
4 10.9 10.4 1.0751 26 184.75 72.25 24.9
5 27.8 28.7 1.0340 24 184.25 71.25 25.6
6 20.6 20.9 1.0502 24 210.25 74.75 26.5
7 19.0 19.2 1.0549 26 181.00 69.75 26.2
8 12.8 12.4 1.0704 25 176.00 72.50 23.6
9 5.1 4.1 1.0900 25 191.00 74.00 24.6
10 12.0 11.7 1.0722 23 198.25 73.50 25.8

5.2 Graphing the Data

Figure 5.1 shows a scatterplot of reported brozek bodyfat values versus reported body density, yet, not all values fall along a straight line. The plotly package written by Sievert et al. (2021) is used to obtain an interactive graph where the user can place their cursor over different points to obtain additional information. Consider using the R code below to obtain an interactive graph of brozek versus density.

library(ggplot2)
library(plotly)
p <- ggplot(data = bodyfat, aes(x = density, y = brozek, 
                                color = case)) +
  geom_point() + 
  theme_bw()
g <- ggplotly(p)
g

Figure 5.1: Scatterplot of brozek bodyfat versus body density

Figure 5.2: Scatterplot of weight_lbs versus height_in

Use the mouse to hover over points in Figure 5.1 that do not fall along a straight line. Note that cases 48, 76, and 96 appear to be errors based on Figure 5.1, and case 182 has an estimated negative body fat that was truncated to zero per the help file. Looking at the outliers in Figure 5.2, case 42 has a man weighing over 200 pounds who is less than 3 feet tall. Reading more, we can figure out that listed densities for cases 48, 76, and 96 are 1.0665, 1.0666, and 1.0991, and should be 1.0865, 1.0566, and 1.0591, respectively. Johnson (1996) suggests the height for case 42 is probably 69.5 inches. Since it is physically impossible for a human to live with no bodyfat and equations (5.1) and (5.2) have an upper limit on density of approximately 1.1, the density value of 1.1089 for case 182 is highly suspect. While it is possible that the density value for case 182 is a data entry error, it is not clear what the true value should be, so case 182 will be removed from the data set.

library(dplyr)
bodyfat[c(48, 76, 96, 42, 182), 
        c("density", "brozek", "siri", "height_in")]
   density brozek siri height_in
1:  1.0665    6.4  5.6     71.25
2:  1.0666   18.3 18.5     67.50
3:  1.0991   17.3 17.4     77.75
4:  1.0250   31.7 32.9     29.50
5:  1.1089    0.0  0.0     68.00
bodyfat$density[c(48, 76, 96)] <- c(1.0865, 1.0566, 1.0591)
bodyfat <- bodyfat %>% 
  mutate(siri_C = round(495/density - 450, 1),
         brozek_C = round(457/density - 414.2, 1),
         bmi_C = round((weight_lbs*0.453592) /
                       (height_in*2.54/100)^2, 1))
bodyfat[c(48, 76, 96, 42, 182), c("density", "brozek", 
                                  "brozek_C", "siri_C", 
                                   "siri")]
   density brozek brozek_C siri_C siri
1:  1.0865    6.4      6.4    5.6  5.6
2:  1.0566   18.3     18.3   18.5 18.5
3:  1.0591   17.3     17.3   17.4 17.4
4:  1.0250   31.7     31.7   32.9 32.9
5:  1.1089    0.0     -2.1   -3.6  0.0
bodyfat[c(48, 76, 96, 42, 182), c("density", "bmi_C", "bmi", 
                                  "height_in", "weight_lbs")]
   density bmi_C  bmi height_in weight_lbs
1:  1.0865  20.6 20.6     71.25     148.50
2:  1.0566  22.9 22.9     67.50     148.25
3:  1.0591  26.1 26.1     77.75     224.50
4:  1.0250 165.6 29.9     29.50     205.00
5:  1.1089  18.0 18.1     68.00     118.50

It seems reasonable to assume the reported bmi value for case 42 is correct and to compute a height_in value based on the bmi and weight_lbs values. Note that the units for the bmi variable are \(\mathtt{kg/m^2}\), and that there are 0.453592 kilos per pound and 2.54 \(\mathtt{cm}\) per inch.

weight_k <- 205 * 0.453592        
height_m <- sqrt(weight_k / 29.9)
height_m
[1] 1.763494
height_in <- height_m*100 / 2.54
height_in
[1] 69.4289

Since the computed height value for case 42 is 69.43 inches, it seems very likely that the recorded height of 29.5 inches was a data entry error and should be replaced with a value of 69.5 inches per the suggestion in Johnson (1996). At this point, four of the five questionable values are changed using the R code below. Case 182 — the no body fat estimate, will be dropped from the data set later.

bodyfat$density[c(48, 76, 96)] <- c(1.0865, 1.0566, 1.0591)
bodyfat$height_in[42] <- 69.5
# bodyfat <- bodyfat[-182, ]  # remove zero bodyfat case

Additional code and text explaining the decisions used to create a cleaned version of the data are provided in Section 5.3.

5.3 Further Cleaning

To evaluate the original computations, consider the R code below which computes the siri and brozek body fat percentages using equations (5.2) and (5.1), respectively. If one considers differences in absolute values between the reported and computed bodyfat values greater than 0.11, eleven cases are flagged for further evaluation. Differences in absolute value between reported and computed bodyfat values less than 0.11 are most likely due to improper rounding. It appears that there are errors in computing brozek values for cases 11, 33, 49, 98, 152, and 235. The brozek values most likely should be 7.8, 12.1, 13.8, 11.7, 19.3, and 25.1, respectively (the computed brozek values using (5.1) with the reported density values). There also appear to be errors in computing the siri values for cases 169, and 200. Cases 169 and 200 have reported siri values of 34.3, and 23.6 which should be 36.2, and 23.1, respectively. Case 6 used a density of 1.0512 to compute the reported brozek and siri values instead of the reported 1.0502. The density value for case 6 will be changed to 1.0512. It is not obvious what exactly the discrepancy is for case 237.

bodyfat <- bodyfat %>% 
  mutate(siri_C = round(495/density - 450, 1),
         brozek_C = round(457/density - 414.2, 1),
         bmi_C = round( (weight_lbs*0.453592) / 
                          (height_in*2.54/100)^2, 1) )
bodyfatCH <- subset(bodyfat, abs(siri_C - siri) > 0.11 | 
                   abs(brozek - brozek_C) > 0.11)
bodyfatCH[, c("siri", "siri_C", "brozek", "brozek_C", "density")]
    siri siri_C brozek brozek_C density
 1: 20.9   21.3   20.6     21.0  1.0502
 2:  7.1    7.1    7.5      7.8  1.0830
 3: 11.8   11.8   13.4     12.1  1.0719
 4: 13.6   13.6   13.4     13.8  1.0678
 5: 11.3   11.3   11.1     11.7  1.0730
 6: 19.6   19.6   19.1     19.3  1.0542
 7: 34.3   36.2   34.7     34.7  1.0180
 8:  0.0   -3.6    0.0     -2.1  1.1089
 9: 23.6   23.1   22.6     22.6  1.0462
10: 25.8   25.8   25.5     25.1  1.0403
11: 24.8   24.9   24.0     24.2  1.0424
bodyfat$brozek[11] <- 7.8
bodyfat$brozek[33] <- 12.1
bodyfat$brozek[49] <- 13.8
bodyfat$brozek[98] <- 11.7
bodyfat$brozek[152] <- 19.3
bodyfat$brozek[235] <- 25.1
bodyfat$siri[169] <- 36.2
bodyfat$siri[200] <- 23.1
bodyfat$density[6] <- 1.0512
bodyfat <- bodyfat %>% 
  mutate(siri_C = round(495/density - 450, 1),
         brozek_C = round(457/density - 414.2, 1),
         bmi_C = round( (weight_lbs*0.453592) / 
                          (height_in*2.54/100)^2, 1),
         fat_free_weight_B = round((1 - brozek_C/100) *
                                     weight_lbs ,1)
)
bodyfatCH2 <- subset(bodyfat, abs(siri_C - siri) > 0.11 | 
                   abs(brozek - brozek_C) > 0.11)
bodyfatCH2[,c("siri", "siri_C", "brozek", "brozek_C", "density")]
   siri siri_C brozek brozek_C density
1:  0.0   -3.6      0     -2.1  1.1089
2: 24.8   24.9     24     24.2  1.0424
# Data Entry Errors?
bodyfat[which(abs(bodyfat$fat_free_weight - 
                    bodyfat$fat_free_weight_B) > 0.101), 
        c("fat_free_weight", "fat_free_weight_B")]
    fat_free_weight fat_free_weight_B
 1:           172.3             171.7
 2:           142.5             147.7
 3:           117.6             117.0
 4:           127.8             128.0
 5:           125.9             125.7
 6:           151.2             151.7
 7:           168.4             167.8
 8:           159.3             159.0
 9:           149.3             149.0
10:           141.7             141.4
11:           118.5             121.0
12:           151.3             133.8
13:           117.5             118.2
sum(abs(bodyfat$fat_free_weight - 
        bodyfat$fat_free_weight_B) > 0.101)
[1] 13
# Data Entry and rounding errors
sum(bodyfat$fat_free_weight != bodyfat$fat_free_weight_B)
[1] 129
bodyfat <- bodyfat[-182, ]  # remove zero bodyfat case
# Number of rounding and data entry errors for fat free weight
sum(bodyfat$fat_free_weight != bodyfat$fat_free_weight_B)
[1] 128
# Number of rounding discrepancies
sum(bodyfat$siri != bodyfat$siri_C)
[1] 29
sum(bodyfat$bmi != bodyfat$bmi_C)
[1] 98
sum(bodyfat$brozek != bodyfat$brozek_C)
[1] 39

Using the R code above, there are 128 data entry or rounding errors for fat_free_weight. For the variables siri, bmi, and brozek there are 29, 98, and 39 rounding errors, respectively.

Figure 5.3 shows a plot of corrected Siri body fat values versus corrected Brozek body fat values. Since the corrected body fat values using the Brozek and Siri method are computed from the measured density, the values in Figure 5.3 should lie on a straight line. The R code below can be used to create an interactive plotly graph.

gp <- ggplot(data = bodyfat, aes(x = brozek_C, y = siri_C, color = case)) +
  geom_point() +
  theme_bw() +
  labs(x = "Brozek Body Fat Corrected", y = "Siri Body Fat Corrected")
ggplotly(gp)

Figure 5.3: Scatterplot of siri bodyfat versus brozek bodyfat

Using the interactive graph in Figure 5.4, it appears that case 31 and case 86 have data entry errors for the diameter of the ankle. The ankle diameters for case 31 and 86 of 33.9 \(\mathtt{cm}\) and 33.7 \(\mathtt{cm}\) are most likely typos that should be 23.9 \(\mathtt{cm}\) and 23.7 \(\mathtt{cm}\). If one modifies the R code above by replacing y = ankle_cm with y = forearm_cm, case 159 appears to be a data entry error for the diameter of the forearm. The reported diameter for the forearm of case 159 is 34.9 \(\mathtt{cm}\) which most likely should be 24.9 \(\mathtt{cm}\) since the individuals reported bicep diameter is only 27 \(\mathtt{cm}\).

p <- ggplot(data = bodyfat, 
            aes(y = ankle_cm, x = weight_lbs, color = case)) + 
  geom_point() +
  theme_bw()
ggplotly(p)

Figure 5.4: Scatterplot of ankle_cm versus weight_lbs

# Fixing likely typos / data entry errors
bodyfat$ankle_cm[31] <- 23.9
bodyfat$ankle_cm[86] <- 23.7
bodyfat$forearm_cm[159] <- 24.9

The R code below removes the variables case, brozek, siri, density, bmi, fat_free_weight, siri_C, and fat_free_weight_B from bodyfat and creates the additional variables age_sq and am. Using the function write.csv(), the tibble bodyfatClean is written to the current working directory as a CSV file.

bodyfatClean <- bodyfat[, -c(1, 2, 3, 4, 8, 9, 20, 23)]
names(bodyfatClean)
 [1] "age"        "weight_lbs" "height_in"  "neck_cm"    "chest_cm"  
 [6] "abdomen_cm" "hip_cm"     "thigh_cm"   "knee_cm"    "ankle_cm"  
[11] "biceps_cm"  "forearm_cm" "wrist_cm"   "brozek_C"   "bmi_C"     
bodyfatClean <- bodyfatClean %>% 
  mutate(age_sq = age^2, abdomen_wrist = abdomen_cm - wrist_cm,
         am = (weight_lbs/0.453592)^1.2/(height_in*2.54/100)^3.3)
write.csv(bodyfatClean, "./bodyfatClean.csv",
          row.names = FALSE)

5.4 Conclusion

Learning different syntax for multiple algorithms is challenging even for people who work with multiple algorithms on a daily basis and even more challenging for students or researchers encountering the algorithms for the first time. Fortunately, using the caret package allows both students and researchers to fit models using a rich selection of algorithms with a unified syntax. Using cross validation is essential for building good predictive models and caret simplifies the cross validation process across many algorithms with the trainControl() function. To illustrate some of the functionality of caret, a well-known data set Boston from the MASS package was used where it was assumed that the data are stored correctly. Data from the wild is often of questionable quality, and textbooks seldom ask students to question the data authors or others provide. The body fat data set hosted by the Journal of Statistical Education provides an excellent opportunity for students to question and potentially re-code many of the entries from an original data set and subsequently propose a model to predict body fat that is appropriate for use in a clinical setting. The guided labs provide students an introduction to predictive model building using the caret suite of tools for both transformed and untransformed predictors. Using the body fat data set provides the reader with ample rationale to question many reported values and practice cleaning data in a guided fashion.

References

Dowle, Matt, and Arun Srinivasan. 2021. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Gareth Ambler, original by, and modified by Axel Benner. 2015. Mfp: Multivariable Fractional Polynomials. https://CRAN.R-project.org/package=mfp.
Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” Journal of Statistics Education 4 (1). https://doi.org/10.1080/10691898.1996.11910505.
Kuhn, Max. 2021. Caret: Classification and Regression Training. https://github.com/topepo/caret/.
Matloff, Norman. 2017. Statistical Regression and Classification: From Linear Models to Machine Learning. 1 edition. Boca Raton: Chapman; Hall/CRC.
Olive, David J., Lasanthi C. R. Pelawa Watagoda, and Hasthika S. Rupasinghe Arachchige Don. 2015. “Visualizing and Testing the Multivariate Linear Regression Model.” International Journal of Statistics and Probability 4 (1): 126. https://doi.org/10.5539/ijsp.v4n1p126.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2021. Plotly: Create Interactive Web Graphics via Plotly.js. https://CRAN.R-project.org/package=plotly.