bodyfat <- read.csv("bodyfat.csv")
head(bodyfat)
##   Density bodyfat Age Weight Height Neck Chest Abdomen   Hip Thigh Knee
## 1  1.0708    12.3  23 154.25  67.75 36.2  93.1    85.2  94.5  59.0 37.3
## 2  1.0853     6.1  22 173.25  72.25 38.5  93.6    83.0  98.7  58.7 37.3
## 3  1.0414    25.3  22 154.00  66.25 34.0  95.8    87.9  99.2  59.6 38.9
## 4  1.0751    10.4  26 184.75  72.25 37.4 101.8    86.4 101.2  60.1 37.3
## 5  1.0340    28.7  24 184.25  71.25 34.4  97.3   100.0 101.9  63.2 42.2
## 6  1.0502    20.9  24 210.25  74.75 39.0 104.5    94.4 107.8  66.0 42.0
##   Ankle Biceps Forearm Wrist
## 1  21.9   32.0    27.4  17.1
## 2  23.4   30.5    28.9  18.2
## 3  24.0   28.8    25.2  16.6
## 4  22.8   32.4    29.4  18.2
## 5  24.0   32.2    27.7  17.7
## 6  25.6   35.7    30.6  18.8

Let us model body fat percentage as a linear combination of weight, height, and abdomen.

y <- bodyfat$bodyfat
X <- cbind(1, bodyfat[, c("Weight", "Height", "Abdomen")])
p <- 3
n <- length(y)
X <- as.matrix(X)
beta_hat <- solve(t(X) %*% X, t(X) %*% y)
# check beta_hat
beta_hat
##                [,1]
## 1       -36.6147193
## Weight   -0.1307606
## Height   -0.1270307
## Abdomen   0.9515631
y_hat <- X %*% beta_hat
e <- y - y_hat
y_bar <- mean(y)
RSS <- sum(e^2)
RegSS <- sum((y_hat - y_bar)^2)
TSS <- sum((y - y_bar)^2)

# check if RegSS + RSS = TSS
RegSS + RSS
## [1] 17578.99
TSS
## [1] 17578.99
# check R^2
R2 <- RegSS / TSS
R2
## [1] 0.7210902
fit <- lm(bodyfat ~ Weight + Height + Abdomen, data=bodyfat)
summary(fit)
## 
## Call:
## lm(formula = bodyfat ~ Weight + Height + Abdomen, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4481  -3.1699  -0.0009   3.0830  10.1897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.61472    7.03842  -5.202 4.13e-07 ***
## Weight       -0.13076    0.02402  -5.443 1.26e-07 ***
## Height       -0.12703    0.08898  -1.428    0.155    
## Abdomen       0.95156    0.06253  15.218  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.446 on 248 degrees of freedom
## Multiple R-squared:  0.7211, Adjusted R-squared:  0.7177 
## F-statistic: 213.7 on 3 and 248 DF,  p-value: < 2.2e-16

\(t\)-test for individual coefficients

We showed that under the normal model \(y = X\beta^* + \epsilon\) with \(\epsilon \sim N_n(0, \sigma^2 I_n)\), the mean of \(\hat{\beta}\) is \(\beta^*\), and the covariance matrix of \(\hat{\beta}\) is \(\sigma^2 (X^\top X)^{-1}\).

Thus the variance of \(\hat{\beta}_j\) is \(\sigma^2 v_{jj}\), where \(v_{jj}\) is the \(j\)th diagonal element of \((X^\top X)^{-1}\).

Therefore, if we want to test the null hypothesis

\[H_0 : \beta_j^* = \beta_j^{(0)},\] then the statistic

\[\frac{\hat{\beta}_j - \beta^*_j}{\sigma \sqrt{v_{jj}}},\]

is standard normal under the null hypothesis \(H_0\). If we knew \(\sigma\), we could do a \(Z\)-test of this statistic with the standard normal distribution, but we typically do not know \(\sigma\).

In lecture notes and in lab today, we showed that

\[\frac{e^\top e}{n - p - 1} = \frac{\text{RSS}}{n - p - 1}\] is an unbiased estimator of \(\sigma^2\), so the standard error of \(\hat{\beta}_j\) is

\[\sqrt{\frac{e^\top e}{n - p - 1}} \sqrt{v_{jj}}.\]

It turns out (proof omitted) that \[t = \frac{\hat{\beta}_j - \beta^*_j}{\sqrt{\frac{e^\top e}{n - p - 1}} \sqrt{v_{jj}}},\] follows a \(t\)-distribution with \(n - p - 1\) degrees of freedom. Thus you can use this to test the null hypothesis. If \(|t|\) is very large, then we can reject the null hypothesis.

lm() does a \(t\)-test for \(H_0 : \beta_j^* = 0\) for each \(j\). Let us manually do this for the Weight coefficient. We can compute the standard error, \(t\)-statistic, and p-value.

V <- solve(t(X) %*% X)
rse <- sqrt(RSS / (n - p - 1))
rse
## [1] 4.446343
se <- rse * sqrt(V[2, 2])
se
## [1] 0.0240232
t <- beta_hat[2] / se
t
## [1] -5.443097
pval <- 2 * pt(t, n - p - 1)
pval
## [1] 1.257085e-07

Let’s check with lm().

summary(fit)
## 
## Call:
## lm(formula = bodyfat ~ Weight + Height + Abdomen, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4481  -3.1699  -0.0009   3.0830  10.1897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.61472    7.03842  -5.202 4.13e-07 ***
## Weight       -0.13076    0.02402  -5.443 1.26e-07 ***
## Height       -0.12703    0.08898  -1.428    0.155    
## Abdomen       0.95156    0.06253  15.218  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.446 on 248 degrees of freedom
## Multiple R-squared:  0.7211, Adjusted R-squared:  0.7177 
## F-statistic: 213.7 on 3 and 248 DF,  p-value: < 2.2e-16

\(F\)-test for all slopes

lm() also tests the global null hypothesis that all slopes are zero.

\[H_0 : \beta^*_1 = \beta^*_2 = \cdots = \beta^*_p = 0.\]

Under this null hypothesis, the statistic \[F_0 = \frac{\text{RegSS}/p}{\text{RSS}/(n-p-1)}\] follows an \(F\)-distribution with \(p\) and \(n - p - 1\) degrees of freedom. [See Section 6.2.1 of Fox.] If \(F_0\) is too large, we can reject the null hypothesis.

Fstat <- (RegSS / p) / (RSS / (n - p - 1))
Fstat
## [1] 213.7255
pval <- 1 - pf(Fstat, p, n - p - 1)
pval
## [1] 0
summary(fit)
## 
## Call:
## lm(formula = bodyfat ~ Weight + Height + Abdomen, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4481  -3.1699  -0.0009   3.0830  10.1897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.61472    7.03842  -5.202 4.13e-07 ***
## Weight       -0.13076    0.02402  -5.443 1.26e-07 ***
## Height       -0.12703    0.08898  -1.428    0.155    
## Abdomen       0.95156    0.06253  15.218  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.446 on 248 degrees of freedom
## Multiple R-squared:  0.7211, Adjusted R-squared:  0.7177 
## F-statistic: 213.7 on 3 and 248 DF,  p-value: < 2.2e-16

\(F\)-test for subset of slopes

Instead of testing for all slopes being zero, we can test for a subset being zero.

\[H_0 : \beta^*_1 = \beta^*_2 = \cdots = \beta^*_q = 0.\]

Let \(\text{RSS}_1\), \(\text{RegSS}_1\), and \(\text{TSS}_1\) be for the full model, and let \(\text{RSS}_0\), \(\text{RegSS}_0\), and \(\text{TSS}_0\) be for the null model.

Why are the following true? \[\text{RSS}_0 \ge \text{RSS}_1\] \[\text{TSS}_0 = \text{TSS}_1\] \[\text{RSS}_0 - \text{RSS}_1 = \text{RegSS}_1 - \text{RegSS}_0\] \[\text{RegSS}_0 \le \text{RegSS}_1\]

The \(F\)-statistic to test this null hypothesis is \[F = \frac{(\text{RegSS}_1 - \text{RegSS}_0) / q}{\text{RSS}_1 / (n - p - 1)}.\] If this is too large, we can reject the null.

[Note that the “all slopes” \(F\)-statistic is a special case of this one: take \(q = p\), and note that \(\text{RegSS}_0 = 0\).]

Let us test if both weight and abdomen are zero. We have already fitted the full model, so we just need to find \(\text{RegSS}_0\).

q <- 2
fit0 <- lm(bodyfat ~ Height, data=bodyfat)
y_hat0 <- fitted(fit0)
RegSS0 <- sum((y_hat0 - y_bar)^2)
RegSS0
## [1] 140.7976
RegSS
## [1] 12676.04
Fstat <- ((RegSS - RegSS0) / q) / (RSS / (n - p - 1))
Fstat
## [1] 317.0273
pval <- 1 - pf(Fstat, q, n - p - 1)
pval
## [1] 0