R resources

Body fat dataset

Information about this dataset can be found here and here.

You can access the bodyfat.csv file on bCourses or here. Make sure it is in your current working directory.

Loading the dataset

# make sure .csv file is in your working directory
bodyfat <- read.csv("bodyfat.csv")
head(bodyfat)
##   Density bodyfat Age Weight Height Neck Chest Abdomen   Hip Thigh Knee
## 1  1.0708    12.3  23 154.25  67.75 36.2  93.1    85.2  94.5  59.0 37.3
## 2  1.0853     6.1  22 173.25  72.25 38.5  93.6    83.0  98.7  58.7 37.3
## 3  1.0414    25.3  22 154.00  66.25 34.0  95.8    87.9  99.2  59.6 38.9
## 4  1.0751    10.4  26 184.75  72.25 37.4 101.8    86.4 101.2  60.1 37.3
## 5  1.0340    28.7  24 184.25  71.25 34.4  97.3   100.0 101.9  63.2 42.2
## 6  1.0502    20.9  24 210.25  74.75 39.0 104.5    94.4 107.8  66.0 42.0
##   Ankle Biceps Forearm Wrist
## 1  21.9   32.0    27.4  17.1
## 2  23.4   30.5    28.9  18.2
## 3  24.0   28.8    25.2  16.6
## 4  22.8   32.4    29.4  18.2
## 5  24.0   32.2    27.7  17.7
## 6  25.6   35.7    30.6  18.8

What do the numbers mean?

names(bodyfat)
##  [1] "Density" "bodyfat" "Age"     "Weight"  "Height"  "Neck"    "Chest"  
##  [8] "Abdomen" "Hip"     "Thigh"   "Knee"    "Ankle"   "Biceps"  "Forearm"
## [15] "Wrist"

From the description, the data are

estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men.

The variables are

  • Density determined from underwater weighing (g/cm^3)
  • Percent body fat from Siri’s (1956) equation
  • Age (years)
  • Weight (lbs)
  • Height (inches)
  • Neck circumference (cm)
  • Chest circumference (cm)
  • Abdomen 2 circumference (cm)
  • Hip circumference (cm)
  • Thigh circumference (cm)
  • Knee circumference (cm)
  • Ankle circumference (cm)
  • Biceps (extended) circumference (cm)
  • Forearm circumference (cm)
  • Wrist circumference (cm)

Histograms and box plots

Histograms and boxplots are good ways to visualize the distribution of the data.

hist(bodyfat$Biceps)

boxplot(bodyfat$Biceps)

Boxplots are good showing outliers. Let us find the outlier in the above plot.

idx <- which.max(bodyfat$Biceps) # find index of the largest point
bodyfat[idx, ]
##    Density bodyfat Age Weight Height Neck Chest Abdomen   Hip Thigh Knee
## 39  1.0202    35.2  46 363.15  72.25 51.2 136.2   148.1 147.7  87.3 49.1
##    Ankle Biceps Forearm Wrist
## 39  29.6     45      29  21.4

Removing extreme outliers can sometimes help your analysis of the data. In this case we do not have much reason to remove this point, since we are only looking at one variable. But in general, if idx is a vector containing the row numbers of the datapoints you want to remove, then bodyfat[-idx,] will remove those rows.

See Section 3.1 of Fox for more detail about histograms and boxplots (in particular, Section 3.1.4 for how box plots are constructed).

Scatter plots and smoothing

Scatter plots can show relationships between two variables.

plot(bodyfat$Biceps, bodyfat$Forearm)

If we want to summarize the relationship between the variables, we can always try locally weighted regression (loess). In Section 2.3 of Fox, they describe a local averaging where the value of the fitted curve at some \(x\) is the average of the \(y_i\) values for the datapoints \((x_i, y_i)\) near \(x\) (hence “local”). Loess is similar, but at each point \(x\) fits a polynomial in a neighborhood of \(x\) in such a way that nearby points have more influcence than distant points. The function scatter.smooth uses loess.

par(mfrow=c(1,2))
scatter.smooth(bodyfat$Biceps, bodyfat$Forearm, col='gray')
title("Loess smoothing")

beta <- coef(lm(Forearm ~ Biceps, bodyfat))
plot(bodyfat$Biceps, bodyfat$Forearm, col='gray')
abline(beta[1], beta[2])
title("Linear regression")

In this case the loess fit is essentially a line, and it is almost the same as the linear regression fit.

Let us see how loess behaves in other settings.

n <- 10
x <- seq(-1,1, length.out = n)
y <- x + rnorm(n, sd=0.5)
par(mfrow=c(1,2))
scatter.smooth(x,y)
abline(0,1, lty=2)
title("Loess smoothing")

beta <- coef(lm(y ~ x))
plot(x,y)
abline(beta[1], beta[2])
abline(0,1, lty=2)
title("Linear regression")

Here the data is a noisy realization of a linear relationship (the dotted line). Because there are few datapoints, loess tries to fit the data too much. Linear regression seems to do fairly well even with few datapoints.

If we increase the number of data points, loess improves.

n <- 500
x <- seq(-1,1, length.out = n)
y <- x + rnorm(n, sd=0.5)
par(mfrow=c(1,2))
scatter.smooth(x,y, col='gray')
abline(0,1, lty=2)
title("Loess smoothing")

beta <- coef(lm(y ~ x))
plot(x,y, col='gray')
abline(beta[1], beta[2])
abline(0,1, lty=2)
title("Linear regression")

What happens if the underlying relationship is not linear?

n <- 100
x <- seq(-1,1, length.out = n)
y <- x^2 + rnorm(n, sd=0.1)
par(mfrow=c(1,2))
scatter.smooth(x,y, col='gray')
lines(x,x^2, lty=2)
title("Loess smoothing")

beta <- coef(lm(y ~ x))
plot(x,y, col='gray')
abline(beta[1], beta[2])
lines(x,x^2, lty=2)
title("Linear regression")

Of course, a linear fit fails to summarize the data when the true relationship is far from linear. However, loess does a fair job (provided there are enough datapoints).

As we have seen loess can capture more complex relationships. However, it requires more datapoints to produce good models. Another severe drawback is that the curve it produces is not easily represented by a mathematical formula.

Read Section 2.3 of Fox

Pairs plot

We can visualize all possible scatter plots at once using pairs.

pairs(bodyfat)

Coplot

A coplot is a way to visualize the relationship between two variables conditioned on a third.

Let’s do an example where we condition on a categorical variable. Since the bodyfat dataset does not have categorical variables, we will look at the tips dataset for the moment. First, let us observe the relationship between tip amount and the bill total with a scatter plot.

# you may need to install the "reshape2" package
data(tips, package="reshape2")
names(tips)
## [1] "total_bill" "tip"        "sex"        "smoker"     "day"       
## [6] "time"       "size"
plot(tips$total_bill, tips$tip)

We can then separate the datapoints based on the gender of the customer, i.e. conditioning on the gender.

coplot(tip ~ total_bill | sex, data=tips)

We can also condition on two variables. Here we also condition on the meal.

coplot(tip ~ total_bill | sex * time, data=tips)

What happens when we condition on quantitative variables? Let us return to the body fat dataset and try to reproduce and interpret the example in the lecture notes. First, let us reproduce the pairs plot for three variables: body fat percentage, neck circumference, and abdominal circumference.

pairs(~bodyfat + Neck + Abdomen, data=bodyfat)

Now we reproduce the coplot from the lecture notes. We plot the relationship between body fat percentage and neck circumference, given abdominal circumference.

coplot(bodyfat ~ Neck | Abdomen, data=bodyfat)

One major source of confusion with the above plot is that it is unclear which scatter plot corresponds to which range of abdominal circumference. According to this Stack Overflow answer, it is left to right, then bottom to top.

All the confusion can be avoided if we force the scatter plots to be on one row. Now we reproduce the coplot from the lecture notes.

coplot(bodyfat ~ Neck | Abdomen, data=bodyfat, rows=1)

This is much clearer. So the first scatter plot contains only data from the men in the study with abdominal circumference in the first interval, and so on.

By definition, the number (the number of conditioning intervals) and overlap (the fraction of overlap of the intervals) options of the coplot() function are 6 and 0.5 respectively. We can play around with the options. For example, if we want 8 non-overlapping intervals, we have the following.

coplot(bodyfat ~ Neck | Abdomen, data=bodyfat, rows=1, number=8, overlap=0)

The pairs plot shows that both neck circumference and abdominal circumference seem like good predictors for body fat percentage. But the coplots show that if we know the abdominal circumference, the neck circumference does not give much more information, since each scatter plot in the coplot does not exhibit an obvious trend. This was discussed in lecture.

Working with lm()

Let’s look at the bodyfat vs. Density plot.

plot(bodyfat$Density, bodyfat$bodyfat, cex=0.5)

This looks suspiciously linear. Let’s try fitting a line.

\[\text{bodyfat} = \beta_1 \text{Density} + \beta_0\]

fit <- lm(bodyfat ~ Density, data=bodyfat)
summary(fit)
## 
## Call:
## lm(formula = bodyfat ~ Density, data = bodyfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8049 -0.2394 -0.1645  0.0411 17.1552 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  477.650      4.576   104.4   <2e-16 ***
## Density     -434.360      4.334  -100.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.307 on 250 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic: 1.004e+04 on 1 and 250 DF,  p-value: < 2.2e-16

Let us see what we could extract from the model.

names(fit)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
fit$coefficients # alternatively you can use coef(fit)
## (Intercept)     Density 
##    477.6499   -434.3600
head(fit$residuals) # alternatively you can use residuals(fit)
##            1            2            3            4            5 
## -0.237144946 -0.138924366 -0.007330123 -0.269396774  0.178405581 
##            6 
## -0.584961771
plot(bodyfat$Density, bodyfat$bodyfat, cex=0.5)
abline(coef(fit))

Looks like a good fit. However, let’s check the residual plot.

plot(bodyfat$Density, residuals(fit), cex=0.5)

Let’s remove the more egregious outliers to make the curve more clear.

idx <- which(abs(residuals(fit)) > 1)
plot(bodyfat$Density[-idx], residuals(fit)[-idx], cex=0.5)

Is this good or bad?

We are in a rare situation where we do know the relationship between the variables.

From the description, the body fat percentage is actually a function of density!

\[\text{body fat} = \frac{495}{\text{Density}} - 450\]

To see what happens, let us perform linear regression again, but with the correct transformation.

\[\text{bodyfat} = \beta_0 + \beta_1 \frac{1}{\text{Density}}\]

plot(1/bodyfat$Density, bodyfat$bodyfat, cex=0.5)

fitinv <- lm(bodyfat ~ I(1/Density), data=bodyfat)
coef(fitinv)
##  (Intercept) I(1/Density) 
##    -439.3035     483.7755

This is fairly close to the true coefficients of \(-450\) and \(495\).

The following figure compares the true relationship (solid), the linear fit (dashed), and the second fit after the appropriate transformation (dotted). If we did not check the residuals of our first fit and stayed with that model, what bad things might happen?

x <- seq(0.8, 1.2, length.out=100)
plot(x, 495/x - 450, type="l", ylab="body fat", xlab="density")
points(x, coef(fit)[1] + coef(fit)[2] * x, type="l", lty=2)
points(x, coef(fitinv)[1] + coef(fitinv)[2] * 1/x, type="l", lty=3)