library(faraway)
library(SignifReg)
library(leaps)
library(ggplot2)
data(seatpos)
help(seatpos)
head(seatpos)
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150
n <- dim(seatpos)[1]
p <- dim(seatpos)[2] - 1
I am not releasing the .Rmd
file because some of the code would be what you do on your homework. Below, some code blocks have been hidden.
alpha <- 0.15
Hidden code using SignifReg
.
mod.backward
##
## Call:
## lm(formula = reg, data = data)
##
## Coefficients:
## (Intercept) Age HtShoes Leg
## 456.2137 0.5998 -2.3023 -6.8297
Hidden code implementing manual backward elimination
Pseudo-code:
lm
outputThe loop runs until it terminates early (see above) or until all variables have been eliminated.
## Removing Ht
## Removing Weight
## Removing Seated
## Removing Arm
## Removing Thigh
mod.backward.manual
##
## Call:
## lm(formula = f, data = seatpos)
##
## Coefficients:
## (Intercept) Age HtShoes Leg
## 456.2137 0.5998 -2.3023 -6.8297
Hidden code using SignifReg
.
mod.forward
##
## Call:
## lm(formula = reg, data = data)
##
## Coefficients:
## (Intercept) Ht Leg Age
## 452.1976 -2.3254 -6.7390 0.5807
Hidden code implementing manual forward selection
Pseudo-code:
lm
summaryThe outer loop runs until it terminates early (see above) or until all variables have been added.
## Adding Ht
## Adding Leg
## Adding Age
mod.forward.manual
##
## Call:
## lm(formula = f, data = seatpos)
##
## Coefficients:
## (Intercept) Ht Leg Age
## 452.1976 -2.3254 -6.7390 0.5807
Tips for the homework:
SignifReg
; some of the default settings may not be what you want.lm
can be a string. You can use paste0
to append to strings. For example, in backward selection you might start with the full model "bodyfat ~ ."
and then if you decide to remove “Age” then you can edit this formula to become "bodyfat ~ . - Age"
. Similarly in forward selection you might start with "bodyfat ~ 1"
and append to get "bodyfat ~ 1 + Age"
.summary
of an lm
object by using coef()
or $coefficients
on the summary
object.If we have \(p\) explanatory variables, how many subsets of variables are there?
Exhaustive search: for all linear models coming from the possible subsets of variables, compute [criteria value] and pick the model with the highest/lowest value.
Before we review the criteria mentioned in lecture, let us remind ourselves of the log likelihood in the linear model \(y = X \beta + \epsilon\) with \(\epsilon \sim N(0, \sigma^2 I)\).
\[ \log \text{likelihood}(\beta, \sigma^2) = \log p(y ; \beta, \sigma^2) = - \frac{n}{2} \log(2 \pi \sigma^2) - \frac{\|y - X\beta\|^2}{2 \sigma^2} \] Check that this is maximized with \(\widehat{\beta}_{MLE} = (X^\top X)^{-1} X^\top y\) and \(\widehat{\sigma}^2_{MLE} = \frac{\text{RSS}}{n}\). By plugging this back into the likelihood, we see that the maximimum of the likelihood is \[-\frac{n}{2} \log(2\pi) - \frac{n}{2} \log \left(\frac{\text{RSS}}{n}\right) - \frac{n}{2} = - \frac{n}{2}\left(\log (2\pi e) + \log \left(\frac{\text{RSS}}{n}\right)\right) \]
We now list our criteria for the model \(m\) being the linear model with some subset of \(p(m)\) variables being the columns of the design matrix \(X(m)\).
Suppose we pick one of these criteria and want to do an exhaustive search over all subsets of explanatory variables and find the best model (largest adjusted \(R^2\), or smallest AIC/BIC/\(C_p\)). Note that all four criteria are functions of \(\text{RSS}(m)\) and \(p(m)\), and these are the only places \(m\) enters the criteria. Thus, we can do the optimization in two stages:
Step 1 is automatically done for you by the regsubsets
function in the leaps
package.
rs <- regsubsets(hipcenter ~ ., data=seatpos, nvmax=p)
names(summary(rs))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
summary(rs)$which
## (Intercept) Age Weight HtShoes Ht Seated Arm Thigh Leg
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 4 TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## 5 TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
summary(rs)$rss
## [1] 47615.79 44834.69 41938.09 41485.01 41313.00 41277.90 41266.80 41261.78
On your homework, you may use regsubsets
for Step 1. For step 2, I encourage you to do it manually, using the RSS values computed by regsubsets
. For adjusted \(R^2\) and Mallows’s \(C_p\), you will be able to check your work with the regsubsets
output.
For BIC, you will find that computing it manually will give you different values than regsubsets
for some reason. However, they will only differ by an additive constant.
Hidden code computing BIC manually using the above formulas
# Using the BIC() function
bicfn <- rep(0, p)
bicfn.baseline <- BIC(lm(hipcenter~1, data=seatpos))
for (i in 1:p) {
dat.tmp <- seatpos[, c(summary(rs)$which[i,-1], T)]
mod <- lm(hipcenter~., data=dat.tmp)
bicfn[i] <- BIC(mod)
}
mybic # above formula without 2 * log(2 * pi * e)
## [1] 278.3418 279.6925 280.7922 284.0170 287.4967 291.1020 294.7293 298.3623
mybic2 # above formula with 2 * log(2 * pi * e)
## [1] 386.1812 387.5318 388.6315 391.8563 395.3360 398.9413 402.5687 406.2016
bicfn
## [1] 389.8188 391.1694 392.2691 395.4939 398.9736 402.5789 406.2062 409.8392
summary(rs)$bic
## [1] -31.36698 -30.01632 -28.91667 -25.69185 -22.21215 -18.60686 -14.97950
## [8] -11.34653
diff(mybic)
## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969
diff(mybic2)
## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969
diff(bicfn)
## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969
diff(summary(rs)$bic)
## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969
mybic - mybic.baseline
## [1] -35.00457 -33.65391 -32.55426 -29.32943 -25.84974 -22.24445 -18.61709
## [8] -14.98412
mybic2 - mybic2.baseline
## [1] -35.00457 -33.65391 -32.55426 -29.32943 -25.84974 -22.24445 -18.61709
## [8] -14.98412
bicfn - bicfn.baseline
## [1] -35.00457 -33.65391 -32.55426 -29.32943 -25.84974 -22.24445 -18.61709
## [8] -14.98412
The last few lines seem to indicate that the BIC computed by regsubsets
is a “drop” in BIC from some baseline. This post suggests that the baseline is the BIC of the intercept-only model, but our computations seem to not exactly match this… In the end, it does not really matter since they all differ by an additive constant.
Unfortunately, regsubsets
does not compute AIC for you. Lucky for us, there are two built-in functions that compute seemingly different values for AIC. However, they only differ by an additive constant. Again, I encourage you to compute it manually.
Hidden code computing AIC using the above formula
# Using AIC() and extractAIC() functions
aicfn <- rep(0, p)
aicfn2 <- rep(0, p)
for (i in 1:p) {
dat.tmp <- seatpos[, c(summary(rs)$which[i,-1], T)]
mod <- lm(hipcenter~., data=dat.tmp)
aicfn[i] <- AIC(mod)
aicfn2[i] <- extractAIC(mod)[2]
}
myaic # using the above formula without 2 * log(2 * pi * e)
## [1] 275.0667 274.7798 274.2418 275.8291 277.6712 279.6389 281.6286 283.6240
myaic2 # using the above formula with 2 * log(2 * pi * e)
## [1] 280.7424 280.4555 279.9176 281.5048 283.3469 285.3146 287.3044 289.2998
aicfn
## [1] 384.9060 384.6191 384.0811 385.6684 387.5105 389.4782 391.4680 393.4634
aicfn2
## [1] 275.0667 274.7798 274.2418 275.8291 277.6712 279.6389 281.6286 283.6240
diff(myaic)
## [1] -0.2869252 -0.5379362 1.5872367 1.8421065 1.9677041 1.9897765
## [7] 1.9953826
diff(myaic2)
## [1] -0.2869252 -0.5379362 1.5872367 1.8421065 1.9677041 1.9897765
## [7] 1.9953826
diff(aicfn)
## [1] -0.2869252 -0.5379362 1.5872367 1.8421065 1.9677041 1.9897765
## [7] 1.9953826
diff(aicfn2)
## [1] -0.2869252 -0.5379362 1.5872367 1.8421065 1.9677041 1.9897765
## [7] 1.9953826
See this answer for an in-depth criticism of stepwise methods, and more generally about the perils of data dredging (developing and confirming a model based on the same dataset). But maybe pretend you didn’t read this when you do your homework…
“If you torture the data long enough, it will confess.”—Ronald Coase
Standard normal:
n <- 1000
z <- rnorm(n)
qqnorm(z)
plot(qnorm((1:n)/(n+1)), sort(z), xlab="Normal quantiles", ylab="Sample quantiles")
abline(a=0, b=1)
How does changing the mean and variance affect the line?
z <- rnorm(n, mean=5, sd=10)
qqnorm(z)
plot(qnorm((1:n)/(n+1)), sort(z), xlab="Normal quantiles", ylab="Sample quantiles")
abline(a=5, b=10)
z <- c(rnorm(n/2, -3), rnorm(n/2, 3))
hist(z)
qqnorm(z)
abline(a=0, b=1)
z <- c(-3-rexp(n/4, rate=0.1), rnorm(n/2, sd=0.5), 3+rexp(n/4, rate=0.1))
hist(z)
qqnorm(z)
abline(a=0, b=1)
Q-Q plot with respect to other distribution
z <- rchisq(n, 1)
x <- qchisq((1:n)/(n+1), 1)
plot(x, sort(z), xlab="Chi squared (df=1) quantiles", ylab="Sample quantiles", cex=0.2)
abline(a=0, b=1)
n <- dim(seatpos)[1]
z <- rstudent(lm(hipcenter ~ ., dat=seatpos))
x <- qt((1:n)/(n+1), n - p - 2)
plot(x, sort(z), xlab="t (df=n-p-2) quantiles", ylab="Sample quantiles")
abline(a=0, b=1)