library(faraway)
library(SignifReg)
library(leaps)
library(ggplot2)

data(seatpos)
help(seatpos)
head(seatpos)

##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3  -206.300
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9  -178.210
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0   -71.673
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0  -257.720
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9  -173.230
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4  -185.150

n <- dim(seatpos)[1]
p <- dim(seatpos)[2] - 1

I am not releasing the .Rmd file because some of the code would be what you do on your homework. Below, some code blocks have been hidden.

Stepwise procedures using p-value

Backward elimination

alpha <- 0.15

Hidden code using SignifReg.

mod.backward

## 
## Call:
## lm(formula = reg, data = data)
## 
## Coefficients:
## (Intercept)          Age      HtShoes          Leg  
##    456.2137       0.5998      -2.3023      -6.8297

Hidden code implementing manual backward elimination

Pseudo-code:

Start with full model
Loop
- Get p-values from lm output
- Find which variable has the biggest p-value
- If p-value is smaller than your level $\alpha$, then all variables are “significant”, so end the procedure and keep the current model
- Otherwise, remove this variable with the largest p-value from your model, and repeat the loop for this smaller model.

The loop runs until it terminates early (see above) or until all variables have been eliminated.

## Removing Ht

## Removing Weight

## Removing Seated

## Removing Arm

## Removing Thigh

mod.backward.manual

## 
## Call:
## lm(formula = f, data = seatpos)
## 
## Coefficients:
## (Intercept)          Age      HtShoes          Leg  
##    456.2137       0.5998      -2.3023      -6.8297

Forward selection

Hidden code using SignifReg.

mod.forward

## 
## Call:
## lm(formula = reg, data = data)
## 
## Coefficients:
## (Intercept)           Ht          Leg          Age  
##    452.1976      -2.3254      -6.7390       0.5807

Hidden code implementing manual forward selection

Pseudo-code:

Start with intercept-only model
Maintain a set of “inactive” variables to be added to the model
Loop
- Inner loop: for each inactive variable,
  - Tentatively add it to the current model and get its p-value from the lm summary
- Find out which inactive variable has the smallest p-value
- If this smallest p-value is larger than your level $\alpha$, then none of the inactive variables became significant after adding them to the model, so stop the procedure and use your current model
- Otherwise, add this inactive variable (corresponding to the smallest p-value) to your current model, remove it from the set of inactive variables, and repeat the loop with this larger model.

The outer loop runs until it terminates early (see above) or until all variables have been added.

## Adding Ht

## Adding Leg

## Adding Age

mod.forward.manual

## 
## Call:
## lm(formula = f, data = seatpos)
## 
## Coefficients:
## (Intercept)           Ht          Leg          Age  
##    452.1976      -2.3254      -6.7390       0.5807

Tips for the homework:

If you want to use a function, you need to know what it is doing. Be sure to check the optional arguments of SignifReg; some of the default settings may not be what you want.
The following tips are regarding the manual implementation of the stepwise methods. They are suggestions following my [rather crude] way to implement them, but there may be cleverer ways.
- The formula argument for lm can be a string. You can use paste0 to append to strings. For example, in backward selection you might start with the full model "bodyfat ~ ." and then if you decide to remove “Age” then you can edit this formula to become "bodyfat ~ . - Age". Similarly in forward selection you might start with "bodyfat ~ 1" and append to get "bodyfat ~ 1 + Age".
- You can get p-values from the summary of an lm object by using coef() or $coefficients on the summary object.

Exhaustive search using criteria

If we have $p$ explanatory variables, how many subsets of variables are there?

Exhaustive search: for all linear models coming from the possible subsets of variables, compute [criteria value] and pick the model with the highest/lowest value.

Before we review the criteria mentioned in lecture, let us remind ourselves of the log likelihood in the linear model $y = X \beta + \epsilon$ with $\epsilon \sim N(0, \sigma^2 I)$.

\[ \log \text{likelihood}(\beta, \sigma^2) = \log p(y ; \beta, \sigma^2) = - \frac{n}{2} \log(2 \pi \sigma^2) - \frac{\|y - X\beta\|^2}{2 \sigma^2} \] Check that this is maximized with $\widehat{\beta}_{MLE} = (X^\top X)^{-1} X^\top y$ and $\widehat{\sigma}^2_{MLE} = \frac{\text{RSS}}{n}$. By plugging this back into the likelihood, we see that the maximimum of the likelihood is \[-\frac{n}{2} \log(2\pi) - \frac{n}{2} \log \left(\frac{\text{RSS}}{n}\right) - \frac{n}{2} = - \frac{n}{2}\left(\log (2\pi e) + \log \left(\frac{\text{RSS}}{n}\right)\right) \]

We now list our criteria for the model $m$ being the linear model with some subset of $p(m)$ variables being the columns of the design matrix $X(m)$.

Adjusted $R^2$. \[(\text{Adj. $R^2$})(m) := 1 - \frac{\text{RSS}(m) / (n - p(m) - 1)}{\text{TSS} / (n - 1)}\]
AIC. \[\begin{align*} \text{AIC}(m) &:= - 2 \log(\text{max value of likelihood in $m$}) + 2(\text{number of parameters in $m$}) \\ &= n \log(2 \pi e) + n \log \left(\frac{\text{RSS(m)}}{n}\right) + 2(p(m)+1) \end{align*}\] We compare $\text{AIC}(m)$ for all models $m$, so we can drop the constant $n \log(2 \pi e)$ since it does not depend on $m$. Beware that there are various implementations of AIC that might give you different values, but they should only differ by some additive constant.
BIC. \[\begin{align*} \text{BIC}(m) &:= - 2 \log(\text{max value of likelihood in $m$}) + \log(n) \cdot (\text{number of parameters in $m$}) \\ &= n \log(2 \pi e) + n \log \left(\frac{\text{RSS(m)}}{n}\right) + \log(n) \cdot (p(m)+1). \end{align*}\] Similarly, the $n \log(2\pi e)$ term can be dropped. How does BIC compare to AIC?
Mallows’s $C_p$. Suppose the data did come from the model $y = X \beta + \epsilon$ with $\epsilon \sim N(0, I_n)$. What is $X\beta$ in terms of $y$ (in plain English)? If $m$ is some smaller linear model (on a subset of the explanatory variables), what is $H(m) y$ (in plain English)? What is \[\mathbb{E} \|H(m) Y - X \beta\|^2\] (in plain English)? Mallows’s criterion seeks to choose $m$ to minimize this expectation. In lecture you computed this to be \[\mathbb{E} \|H(m) Y - X \beta\|^2 = \sigma^2 (p(m) + 1) + \beta^\top X^\top (I - H(m)) X \beta.\] What is this quantity when $m$ is the full model? For general $m$ do you see a sort of bias-variance trade-off? In lecture you showed that \[\text{RSS}(m) - \widehat{\sigma}^2 (n - 2 - 2 p(m))\] is an unbiased estimator of that expectation. (What is the definition of $\widehat{\sigma}^2$?) Mallows’s criterion is defined as follows. \[C_p(m) := \frac{\text{RSS}(m)}{\widehat{\sigma}^2} - (n - 2 - 2 p(m)).\] What is $C_p(m)$ when $m$ is the full model?

Suppose we pick one of these criteria and want to do an exhaustive search over all subsets of explanatory variables and find the best model (largest adjusted $R^2$, or smallest AIC/BIC/$C_p$). Note that all four criteria are functions of $\text{RSS}(m)$ and $p(m)$, and these are the only places $m$ enters the criteria. Thus, we can do the optimization in two stages:

For each subset size in $p' = 0, \ldots, p$, find the model with $p'$ variables that has the smallest RSS.
For these $p+1$ models, pick the one with the largest/smallest criteria value.

Step 1 is automatically done for you by the regsubsets function in the leaps package.

rs <- regsubsets(hipcenter ~ ., data=seatpos, nvmax=p)
names(summary(rs))

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

summary(rs)$which

##   (Intercept)   Age Weight HtShoes    Ht Seated   Arm Thigh   Leg
## 1        TRUE FALSE  FALSE   FALSE  TRUE  FALSE FALSE FALSE FALSE
## 2        TRUE FALSE  FALSE   FALSE  TRUE  FALSE FALSE FALSE  TRUE
## 3        TRUE  TRUE  FALSE   FALSE  TRUE  FALSE FALSE FALSE  TRUE
## 4        TRUE  TRUE  FALSE    TRUE FALSE  FALSE FALSE  TRUE  TRUE
## 5        TRUE  TRUE  FALSE    TRUE FALSE  FALSE  TRUE  TRUE  TRUE
## 6        TRUE  TRUE  FALSE    TRUE FALSE   TRUE  TRUE  TRUE  TRUE
## 7        TRUE  TRUE   TRUE    TRUE FALSE   TRUE  TRUE  TRUE  TRUE
## 8        TRUE  TRUE   TRUE    TRUE  TRUE   TRUE  TRUE  TRUE  TRUE

summary(rs)$rss

## [1] 47615.79 44834.69 41938.09 41485.01 41313.00 41277.90 41266.80 41261.78

On your homework, you may use regsubsets for Step 1. For step 2, I encourage you to do it manually, using the RSS values computed by regsubsets. For adjusted $R^2$ and Mallows’s $C_p$, you will be able to check your work with the regsubsets output.

For BIC, you will find that computing it manually will give you different values than regsubsets for some reason. However, they will only differ by an additive constant.

Hidden code computing BIC manually using the above formulas

# Using the BIC() function
bicfn <- rep(0, p)
bicfn.baseline <- BIC(lm(hipcenter~1, data=seatpos))
for (i in 1:p) {
  dat.tmp <- seatpos[, c(summary(rs)$which[i,-1], T)]
  mod <- lm(hipcenter~., data=dat.tmp)
  bicfn[i] <- BIC(mod)
}

mybic # above formula without 2 * log(2 * pi * e)

## [1] 278.3418 279.6925 280.7922 284.0170 287.4967 291.1020 294.7293 298.3623

mybic2 # above formula with 2 * log(2 * pi * e)

## [1] 386.1812 387.5318 388.6315 391.8563 395.3360 398.9413 402.5687 406.2016

bicfn

## [1] 389.8188 391.1694 392.2691 395.4939 398.9736 402.5789 406.2062 409.8392

summary(rs)$bic

## [1] -31.36698 -30.01632 -28.91667 -25.69185 -22.21215 -18.60686 -14.97950
## [8] -11.34653

diff(mybic)

## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969

diff(mybic2)

## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969

diff(bicfn)

## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969

diff(summary(rs)$bic)

## [1] 1.350661 1.099650 3.224823 3.479693 3.605290 3.627363 3.632969

mybic - mybic.baseline

## [1] -35.00457 -33.65391 -32.55426 -29.32943 -25.84974 -22.24445 -18.61709
## [8] -14.98412

mybic2 - mybic2.baseline

## [1] -35.00457 -33.65391 -32.55426 -29.32943 -25.84974 -22.24445 -18.61709
## [8] -14.98412

bicfn - bicfn.baseline

## [1] -35.00457 -33.65391 -32.55426 -29.32943 -25.84974 -22.24445 -18.61709
## [8] -14.98412

The last few lines seem to indicate that the BIC computed by regsubsets is a “drop” in BIC from some baseline. This post suggests that the baseline is the BIC of the intercept-only model, but our computations seem to not exactly match this… In the end, it does not really matter since they all differ by an additive constant.

Unfortunately, regsubsets does not compute AIC for you. Lucky for us, there are two built-in functions that compute seemingly different values for AIC. However, they only differ by an additive constant. Again, I encourage you to compute it manually.

Hidden code computing AIC using the above formula

# Using AIC() and extractAIC() functions
aicfn <- rep(0, p)
aicfn2 <- rep(0, p)
for (i in 1:p) {
  dat.tmp <- seatpos[, c(summary(rs)$which[i,-1], T)]
  mod <- lm(hipcenter~., data=dat.tmp)
  aicfn[i] <- AIC(mod)
  aicfn2[i] <- extractAIC(mod)[2]
}

myaic # using the above formula without 2 * log(2 * pi * e)

## [1] 275.0667 274.7798 274.2418 275.8291 277.6712 279.6389 281.6286 283.6240

myaic2 # using the above formula with 2 * log(2 * pi * e)

## [1] 280.7424 280.4555 279.9176 281.5048 283.3469 285.3146 287.3044 289.2998

aicfn

## [1] 384.9060 384.6191 384.0811 385.6684 387.5105 389.4782 391.4680 393.4634

aicfn2

## [1] 275.0667 274.7798 274.2418 275.8291 277.6712 279.6389 281.6286 283.6240

diff(myaic)

## [1] -0.2869252 -0.5379362  1.5872367  1.8421065  1.9677041  1.9897765
## [7]  1.9953826

diff(myaic2)

## [1] -0.2869252 -0.5379362  1.5872367  1.8421065  1.9677041  1.9897765
## [7]  1.9953826

diff(aicfn)

## [1] -0.2869252 -0.5379362  1.5872367  1.8421065  1.9677041  1.9897765
## [7]  1.9953826

diff(aicfn2)

## [1] -0.2869252 -0.5379362  1.5872367  1.8421065  1.9677041  1.9897765
## [7]  1.9953826

Everything is bad

See this answer for an in-depth criticism of stepwise methods, and more generally about the perils of data dredging (developing and confirming a model based on the same dataset). But maybe pretend you didn’t read this when you do your homework…

“If you torture the data long enough, it will confess.”—Ronald Coase

More Q-Q plot examples.

Standard normal:

n <- 1000
z <- rnorm(n)
qqnorm(z)
plot(qnorm((1:n)/(n+1)), sort(z), xlab="Normal quantiles", ylab="Sample quantiles")
abline(a=0, b=1)

How does changing the mean and variance affect the line?

z <- rnorm(n, mean=5, sd=10)
qqnorm(z)
plot(qnorm((1:n)/(n+1)), sort(z), xlab="Normal quantiles", ylab="Sample quantiles")
abline(a=5, b=10)

z <- c(rnorm(n/2, -3), rnorm(n/2, 3))
hist(z)
qqnorm(z)
abline(a=0, b=1)

z <- c(-3-rexp(n/4, rate=0.1), rnorm(n/2, sd=0.5), 3+rexp(n/4, rate=0.1))
hist(z)
qqnorm(z)
abline(a=0, b=1)

Q-Q plot with respect to other distribution

z <- rchisq(n, 1)
x <- qchisq((1:n)/(n+1), 1)
plot(x, sort(z), xlab="Chi squared (df=1) quantiles", ylab="Sample quantiles", cex=0.2)
abline(a=0, b=1)

n <- dim(seatpos)[1]
z <- rstudent(lm(hipcenter ~ ., dat=seatpos))
x <- qt((1:n)/(n+1), n - p - 2)
plot(x, sort(z), xlab="t (df=n-p-2) quantiles", ylab="Sample quantiles")
abline(a=0, b=1)

STAT 151A: Lab 09: Variable selection

Billy Fang

October 26, 2017