lm()
mean?bodyfat <- read.csv("bodyfat.csv")
head(bodyfat)
## Density bodyfat Age Weight Height Neck Chest Abdomen Hip Thigh Knee
## 1 1.0708 12.3 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3
## 2 1.0853 6.1 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3
## 3 1.0414 25.3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9
## 4 1.0751 10.4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3
## 5 1.0340 28.7 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2
## 6 1.0502 20.9 24 210.25 74.75 39.0 104.5 94.4 107.8 66.0 42.0
## Ankle Biceps Forearm Wrist
## 1 21.9 32.0 27.4 17.1
## 2 23.4 30.5 28.9 18.2
## 3 24.0 28.8 25.2 16.6
## 4 22.8 32.4 29.4 18.2
## 5 24.0 32.2 27.7 17.7
## 6 25.6 35.7 30.6 18.8
Let us model body fat percentage as a linear combination of weight, height, and abdomen.
y <- bodyfat$bodyfat
X <- cbind(1, bodyfat[, c("Weight", "Height", "Abdomen")])
p <- 3
n <- length(y)
X <- as.matrix(X)
beta_hat <- solve(t(X) %*% X, t(X) %*% y)
# check beta_hat
beta_hat
## [,1]
## 1 -36.6147193
## Weight -0.1307606
## Height -0.1270307
## Abdomen 0.9515631
y_hat <- X %*% beta_hat
e <- y - y_hat
y_bar <- mean(y)
RSS <- sum(e^2)
RegSS <- sum((y_hat - y_bar)^2)
TSS <- sum((y - y_bar)^2)
# check if RegSS + RSS = TSS
RegSS + RSS
## [1] 17578.99
TSS
## [1] 17578.99
# check R^2
R2 <- RegSS / TSS
R2
## [1] 0.7210902
fit <- lm(bodyfat ~ Weight + Height + Abdomen, data=bodyfat)
summary(fit)
##
## Call:
## lm(formula = bodyfat ~ Weight + Height + Abdomen, data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4481 -3.1699 -0.0009 3.0830 10.1897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.61472 7.03842 -5.202 4.13e-07 ***
## Weight -0.13076 0.02402 -5.443 1.26e-07 ***
## Height -0.12703 0.08898 -1.428 0.155
## Abdomen 0.95156 0.06253 15.218 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.446 on 248 degrees of freedom
## Multiple R-squared: 0.7211, Adjusted R-squared: 0.7177
## F-statistic: 213.7 on 3 and 248 DF, p-value: < 2.2e-16
We showed that under the normal model \(y = X\beta^* + \epsilon\) with \(\epsilon \sim N_n(0, \sigma^2 I_n)\), the mean of \(\hat{\beta}\) is \(\beta^*\), and the covariance matrix of \(\hat{\beta}\) is \(\sigma^2 (X^\top X)^{-1}\).
Thus the variance of \(\hat{\beta}_j\) is \(\sigma^2 v_{jj}\), where \(v_{jj}\) is the \(j\)th diagonal element of \((X^\top X)^{-1}\).
Therefore, if we want to test the null hypothesis
\[H_0 : \beta_j^* = \beta_j^{(0)},\] then the statistic
\[\frac{\hat{\beta}_j - \beta^*_j}{\sigma \sqrt{v_{jj}}},\]
is standard normal under the null hypothesis \(H_0\). If we knew \(\sigma\), we could do a \(Z\)-test of this statistic with the standard normal distribution, but we typically do not know \(\sigma\).
In lecture notes and in lab today, we showed that
\[\frac{e^\top e}{n - p - 1} = \frac{\text{RSS}}{n - p - 1}\] is an unbiased estimator of \(\sigma^2\), so the standard error of \(\hat{\beta}_j\) is
\[\sqrt{\frac{e^\top e}{n - p - 1}} \sqrt{v_{jj}}.\]
It turns out (proof omitted) that \[t = \frac{\hat{\beta}_j - \beta^*_j}{\sqrt{\frac{e^\top e}{n - p - 1}} \sqrt{v_{jj}}},\] follows a \(t\)-distribution with \(n - p - 1\) degrees of freedom. Thus you can use this to test the null hypothesis. If \(|t|\) is very large, then we can reject the null hypothesis.
lm()
does a \(t\)-test for \(H_0 : \beta_j^* = 0\) for each \(j\). Let us manually do this for the Weight
coefficient. We can compute the standard error, \(t\)-statistic, and p-value.
V <- solve(t(X) %*% X)
rse <- sqrt(RSS / (n - p - 1))
rse
## [1] 4.446343
se <- rse * sqrt(V[2, 2])
se
## [1] 0.0240232
t <- beta_hat[2] / se
t
## [1] -5.443097
pval <- 2 * pt(t, n - p - 1)
pval
## [1] 1.257085e-07
Let’s check with lm()
.
summary(fit)
##
## Call:
## lm(formula = bodyfat ~ Weight + Height + Abdomen, data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4481 -3.1699 -0.0009 3.0830 10.1897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.61472 7.03842 -5.202 4.13e-07 ***
## Weight -0.13076 0.02402 -5.443 1.26e-07 ***
## Height -0.12703 0.08898 -1.428 0.155
## Abdomen 0.95156 0.06253 15.218 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.446 on 248 degrees of freedom
## Multiple R-squared: 0.7211, Adjusted R-squared: 0.7177
## F-statistic: 213.7 on 3 and 248 DF, p-value: < 2.2e-16
lm()
also tests the global null hypothesis that all slopes are zero.
\[H_0 : \beta^*_1 = \beta^*_2 = \cdots = \beta^*_p = 0.\]
Under this null hypothesis, the statistic \[F_0 = \frac{\text{RegSS}/p}{\text{RSS}/(n-p-1)}\] follows an \(F\)-distribution with \(p\) and \(n - p - 1\) degrees of freedom. [See Section 6.2.1 of Fox.] If \(F_0\) is too large, we can reject the null hypothesis.
Fstat <- (RegSS / p) / (RSS / (n - p - 1))
Fstat
## [1] 213.7255
pval <- 1 - pf(Fstat, p, n - p - 1)
pval
## [1] 0
summary(fit)
##
## Call:
## lm(formula = bodyfat ~ Weight + Height + Abdomen, data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4481 -3.1699 -0.0009 3.0830 10.1897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.61472 7.03842 -5.202 4.13e-07 ***
## Weight -0.13076 0.02402 -5.443 1.26e-07 ***
## Height -0.12703 0.08898 -1.428 0.155
## Abdomen 0.95156 0.06253 15.218 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.446 on 248 degrees of freedom
## Multiple R-squared: 0.7211, Adjusted R-squared: 0.7177
## F-statistic: 213.7 on 3 and 248 DF, p-value: < 2.2e-16
Instead of testing for all slopes being zero, we can test for a subset being zero.
\[H_0 : \beta^*_1 = \beta^*_2 = \cdots = \beta^*_q = 0.\]
Let \(\text{RSS}_1\), \(\text{RegSS}_1\), and \(\text{TSS}_1\) be for the full model, and let \(\text{RSS}_0\), \(\text{RegSS}_0\), and \(\text{TSS}_0\) be for the null model.
Why are the following true? \[\text{RSS}_0 \ge \text{RSS}_1\] \[\text{TSS}_0 = \text{TSS}_1\] \[\text{RSS}_0 - \text{RSS}_1 = \text{RegSS}_1 - \text{RegSS}_0\] \[\text{RegSS}_0 \le \text{RegSS}_1\]
The \(F\)-statistic to test this null hypothesis is \[F = \frac{(\text{RegSS}_1 - \text{RegSS}_0) / q}{\text{RSS}_1 / (n - p - 1)}.\] If this is too large, we can reject the null.
[Note that the “all slopes” \(F\)-statistic is a special case of this one: take \(q = p\), and note that \(\text{RegSS}_0 = 0\).]
Let us test if both weight and abdomen are zero. We have already fitted the full model, so we just need to find \(\text{RegSS}_0\).
q <- 2
fit0 <- lm(bodyfat ~ Height, data=bodyfat)
y_hat0 <- fitted(fit0)
RegSS0 <- sum((y_hat0 - y_bar)^2)
RegSS0
## [1] 140.7976
RegSS
## [1] 12676.04
Fstat <- ((RegSS - RegSS0) / q) / (RSS / (n - p - 1))
Fstat
## [1] 317.0273
pval <- 1 - pf(Fstat, q, n - p - 1)
pval
## [1] 0