Please fill out the course evaluation at https://course-evaluations.berkeley.edu/. It is important for the department to get your feedback so that we can improve the department’s courses for future students.

library(ggplot2)
library(reshape2)

Classification trees

The rpart vignette has all the details about rpart. Some of the notes below are adapted from there.

In a tree estimator, you can think of your dataset as a crowd of people that start at the top of the tree and walk downward. At each node is a sign telling people to go right or left depending on their characteristics (e.g., go left if you are under 18 years old, otherwise go right).

In the case of classification, you have some unknown categorical variable (not necessarily binary), say, “favorite color,” and you hope that at the bottom of the tree, each terminal node has a group of people that mostly like the same color. In other words, the broad goal is to choose good splitting criteria at each node so that the “purity” of the groups increases after the split.

Impurity measures

How do we measure impurity at a node $G$ [before a split]? Suppose the response variable has $1,\ldots,C$ categories, and the proportion of the $n_G$ people [at this node $G$] in each category are $p_1, \ldots, p_C$ respectively. [This is different than the $\bar{p}_1$ and $\bar{p}_2$ in your lecture notes, which correspond to binary classification proportions in two different groups.]

The impurity of the group of datapoints at this node is \[I(G) = f(p_1) + f(p_2) + \cdots + f(p_C),\] for some concave function $f$ with $f(0)=f(1)=0$. (This latter condition ensures $I(G)=0$ if $G$ is pure.)

In lecture, we only considered binary classification (label $0$ or $1$). There, if $p$ denotes the proportion of the $n_G$ people with label $1$, the impurity is simply \[I(G) = f(p) + f(1-p).\]

Examples of $f$:

Gini index: $f(p) = p(1-p)$. In general, $I(G) = \sum_{i=1}^C p_i(1-p_i) = 1 - \sum_{i=1}^C p_i^2$ In binary classification, $I(G)=2p(1-p)$.
Cross-entropy, deviance, information index: $f(p) = - p\log p$. In general, $I(G) = -\sum_{i=1}^C p_i \log p_i$. In binary classification, $I(G) = - p \log p - (1-p) \log(1-p)$.
Misclassification error: $f(p) = \min\{p,1-p\}$. In general, $I(G) = \sum_{i=1}^C \min\{p_i, 1- p_i\}$. In binary classification, $I(G) = 2 \min\{p, 1-p\}$. This is not a good splitting criterion, see below.

Above, the various factors of $2$ do not really matter, since we are concerned with comparing the value of the impurity for different values of $p$.

The following is a plot of the various impurity measures in the case of binary classification.

p <- seq(1e-3, 1 - 1e-3, length.out=100)
dat <- data.frame(p=p, Gini=2*p*(1-p), Entropy=-(p*log(p)+(1-p)*log(1-p))/log(2)/2, Class=pmin(p, 1-p))
dat.melt <- melt(dat, id.vars="p", variable.name="Impurity")
ggplot(dat.melt, aes(x=p, y=value)) + geom_line(aes(group=Impurity, linetype=Impurity)) + theme_classic()

Again, the values of curve do not really matter, since we can scale each curve by any constant to get an equivalent impurity measure. It is the shape of the curve that matters. For visualization’s sake, I have chosen to scale by constants so that $f(1/2)$ has the same value.

Suppose we split $G$ into two groups $G_L$ and $G_R$ of size $n_L$ and $n_R$ respectively. The impurity reduction of this split is \[\Delta I := \frac{1}{n}\left[n_G I(G) - n_L I(G_L) - n_R I(G_R)\right],\] where $n$ is the number of datapoints in the full dataset. We want to choose a split (i.e., choose a variable $X_j$ and a cutoff $c$) that maximizes this impurity reduction, which is equivalent to minimizing \[n_L I(G_L) + n_R I(G_R).\]

Suppose we are in a binary classification situation, and we let $p_L$ and $p_R$ denote the proportion of $1$s in $G_L$ and $G_R$ respectively. [These are the $\bar{p}_1$ and $\bar{p}_2$ from your lecture notes.]

Gini: we want to minimize $2 [n_L p_L(1-p_L) + n_R p_R(1-p_R)]$. Note that you showed that this is equivalent to the RSS in a regression tree if you treat the response variable as numerical in $\{0,1\}$.
Cross-entropy: we want to minimize $-n_L[p_L \log p_L + (1-p_L) \log (1-p_L)] - n_R [p_R \log p_R + (1-p_R) \log (1-p_R)]$.
Misclassification error: we want to minimize $2 [n_L \min\{p_L, 1-p_L\} + n_R \min\{p_R, 1-p_R\}]$.

By default, rpart uses Gini. You can also use the entropy by passing in an argument (see below).

Using misclassification error as a splitting criterion has drawbacks, so I don’t think rpart offers an option to use it.

Suppose in the full dataset we have a binary response variable that is $1$ for $80$ of the datapoints, and $0$ for the other $20$. If our first split gives us $G_L$ with $40$ datapoints all labeled with $1$, while $G_R$ has $40$ datapoints labeled $1$ and $20$ datapoints labeled $0$, then what is the impurity reduction $\Delta I$?

For misclassification error it is $\Delta I = \frac{1}{100}\left[100\cdot \frac{20}{100} - 40 \cdot 0 - 60 \cdot \frac{20}{60} \right] = 0$.
For Gini and entropy, you can check that $\Delta I > 0$.

So, despite this split being quite good, the misclassification error impurity measure does not really recognize it. The issue is the linearity of the misclassification error impurity function. Note that we are measuring the impurity of a split by a weighted average of two impurities. The strict concavity of the Gini and entropy functions ensures that after a split, the impurity strictly decreases. The picture below (taken from here) visualizes this.

Stopping criteria and pruning

We have discussed how to choose splits, but when do we stop?

For a nonnegative complexity criterion $\alpha$ (denoted as cp in rpart), we define the cost complexity of a tree $T$ by \[C_\alpha(T) := C(T) + \alpha C(T_{\text{root}}) |T|,\] where $C(\cdot)$ returns some notion of error/risk of a tree, and $T_{root}$ is the tree with no splits (just root node).

In the case of regression trees, we used $C(T)=\text{RSS}(T)$ and thus $C(T_{\text{root}})=\text{TSS}$, which gives the formula in your lecture notes.
For classification, we use $C(T)=\frac{\#\text{errors}}{n}$. Exercise. In words, what is $C(T_{\text{root}})$?

Warning: sometimes you will see $C_\alpha(T)$ defined as $C(T) + \alpha |T|$ instead (e.g., compare pages 12 and 24 of the rpart vignette). This does not really matter too much, since $C(T_{\text{root}})$ is a constant that does not depend on $T$. But cp in rpart is defined using the earlier definition.

rpart has various stopping criteria. See help(rpart.control) for some of these parameters.

minsplit (default is 20): The minimum number of observations that must exist in a node in order for a split to be attempted. This parameter can save computation time, since smaller nodes are almost always pruned away by cross validation.
minbucket (default is minsplit/3): The minimum number of observations in a terminal node.
cp (default is 0.01): The threshold complexity parameter.

rpart continues splitting until either the minsplit or minbucket conditions are violated, or if a split does not decrease the cost $C_\alpha$ of the tree. Specifically, suppose $T$ is a tree and $T'$ is a tree with one more split that we are considering. Then \[C_\alpha(T) - C_\alpha(T') = [C(T) - C(T')] - \alpha C(T_{\text{root}}).\] This gives a nice interpretation of $\alpha$: rpart will not consider a split if $C(T) - C(T')$ (the decrease in error) is smaller than $\alpha C(T_{\text{root}})$. Thus if we use $\alpha = 0$, then rpart will construct the entire tree (up to the minsplit and minbucket conditions.) If we use $\alpha=1$, then rpart will not split beyond the root.

So after calling rpart with some fixed value of $\alpha$ (a.k.a. cp, default value is 0.01), we obtain some tree. How can we prune with rpart? If we use printcp(), it shows what happens if we consider sub-trees that result from using values of $\alpha$ larger than whatever the cp value was in our original call to rpart(), along with cross-validation scores for each value of $\alpha$. [Why do larger values of $\alpha$ yield smaller trees?] As you continuously increase $\alpha$ to $1$, you will get a finite number of sub-trees, and I believe printcp() lists one value of $\alpha$ for each sub-tree. You can then use prune() to produce a sub-tree using one of the other values of $\alpha$.

Example

In lecture you looked at some code for a regression tree example. Here we will look at a classification tree example.

library(DAAG)
data(spam7)
help(spam7)
spam = spam7

library(rpart)
sprt = rpart(yesno ~ ., method = "class", data = spam)
plot(sprt, margin=0.1)
text(sprt)
sprt

## n= 4601 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 4601 1813 n (0.6059552 0.3940448)  
##    2) dollar< 0.0555 3471  816 n (0.7649092 0.2350908)  
##      4) bang< 0.0915 2420  246 n (0.8983471 0.1016529) *
##      5) bang>=0.0915 1051  481 y (0.4576594 0.5423406)  
##       10) crl.tot< 85.5 535  175 n (0.6728972 0.3271028)  
##         20) bang< 0.7735 418  106 n (0.7464115 0.2535885) *
##         21) bang>=0.7735 117   48 y (0.4102564 0.5897436)  
##           42) crl.tot< 17 43   12 n (0.7209302 0.2790698) *
##           43) crl.tot>=17 74   17 y (0.2297297 0.7702703) *
##       11) crl.tot>=85.5 516  121 y (0.2344961 0.7655039) *
##    3) dollar>=0.0555 1130  133 y (0.1176991 0.8823009) *

Make sure you understand how to read every part of the text output.

As mentioned earlier, the default call to rpart uses Gini. To use entropy, you need to pass another argument, as follows.

isprt = rpart(yesno ~ ., method = "class", parms = list(split = 'information'), data = spam)
plot(isprt, margin=0.1)
text(isprt)
isprt

## n= 4601 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 4601 1813 n (0.6059552 0.3940448)  
##    2) dollar< 0.0555 3471  816 n (0.7649092 0.2350908)  
##      4) bang< 0.0875 2407  242 n (0.8994599 0.1005401) *
##      5) bang>=0.0875 1064  490 y (0.4605263 0.5394737)  
##       10) crl.tot< 85.5 541  175 n (0.6765250 0.3234750)  
##         20) bang< 0.7735 424  106 n (0.7500000 0.2500000) *
##         21) bang>=0.7735 117   48 y (0.4102564 0.5897436)  
##           42) crl.tot< 17 43   12 n (0.7209302 0.2790698) *
##           43) crl.tot>=17 74   17 y (0.2297297 0.7702703) *
##       11) crl.tot>=85.5 523  124 y (0.2370937 0.7629063) *
##    3) dollar>=0.0555 1130  133 y (0.1176991 0.8823009) *

\[\begin{array}{c|cc} & \text{predict $0$} & \text{predict $1$} \\ \hline \text{actual $0$} & a & b \\ \text{actual $1$} & c & d \end{array}\]

Precision is $\frac{d}{b+d}$
Recall is $\frac{d}{c+d}$

confusion <- function (y, yhat, thres)
{
  n <- length(thres)
  conf <- matrix(0,length(thres),ncol=4)
  colnames(conf) <- c("a","b","c","d")
  for ( i in 1:n)
  {
    a <- sum((!y) & (yhat<=thres[i]))
    b <- sum((!y) & (yhat>thres[i]))
    c <- sum((y) & (yhat<=thres[i])) 
    d <- sum((y) & (yhat>thres[i]))
    conf[i,] <- c(a,b,c,d)
  }
  return(conf)
}

y.tr = predict(sprt, spam)[,2]
thres <- seq(0.05, 0.95, by = 0.05)
y = as.numeric(spam$yesno == "y")
tree.conf = confusion(y, y.tr, thres)
qplot(thres, tree.conf[,2]+tree.conf[,3], xlab="threshold", ylab="b+c", geom="line") + theme_classic()
# pick threshold 0.5
conf <- as.vector(tree.conf[which(thres==0.5),])
precision <- conf[4] / (conf[2] + conf[4])
recall <- conf[4] / (conf[3] + conf[4])
c(precision, recall)

## [1] 0.8424419 0.7992278

You can modify the $\alpha$ value you want to use in the original call to rpart.

set.seed(0)
dense.sprt <- rpart(yesno ~ ., method="class", data=spam, cp=0.001)
plot(dense.sprt, margin=0.1)
text(dense.sprt)

Be careful when reading the printcp output. the rel error and xerror columns are not the misclassification errors [on the full dataset] and cross-validation errors. You need to multiply each column by the Root node error above the table to get the actual errors. The scaling is chosen so that the first row always has rel error and xerror equal to 1.0, for readability.

We can pick the value of $\alpha$ that has smallest cross-validation error. (Remember, the cross-validation errors are random.) In practice one might instead follow the “1 standard error” rule; see discussion of lambda.1se in glmnet below.

table.cp <- printcp(dense.sprt)

## 
## Classification tree:
## rpart(formula = yesno ~ ., data = spam, method = "class", cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] bang    crl.tot dollar  money   n000   
## 
## Root node error: 1813/4601 = 0.39404
## 
## n= 4601 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4765582      0   1.00000 1.00000 0.018282
## 2  0.0755654      1   0.52344 0.54937 0.015408
## 3  0.0115830      3   0.37231 0.39162 0.013516
## 4  0.0104799      4   0.36073 0.38058 0.013358
## 5  0.0063431      5   0.35025 0.36790 0.013172
## 6  0.0055157     10   0.31660 0.35301 0.012947
## 7  0.0044126     11   0.31109 0.33922 0.012732
## 8  0.0038610     12   0.30667 0.33536 0.012670
## 9  0.0027579     16   0.29123 0.32377 0.012482
## 10 0.0022063     17   0.28847 0.32101 0.012436
## 11 0.0019305     18   0.28627 0.32488 0.012500
## 12 0.0016547     20   0.28240 0.32929 0.012572
## 13 0.0010000     25   0.27413 0.32763 0.012545

cp.x <- table.cp[which.min(table.cp[,4]), 1]
cp.x

## [1] 0.002206288

fsprt <- rpart(yesno ~ ., method="class", data=spam, cp=cp.x)
plot(fsprt, margin=0.1)
text(fsprt)
fconf <- table(spam$yesno, as.numeric(predict(fsprt)[,2] >= 0.5))
fconf

##    
##        0    1
##   n 2630  158
##   y  365 1448

fconf <- as.vector(t(fconf))
fprecision <- fconf[4] / (fconf[2] + fconf[4])
frecall <- fconf[4] / (fconf[3] + fconf[4])
c(fprecision, frecall)

## [1] 0.9016189 0.7986762

Compare with our original model with default cp=0.01.

plot(sprt, margin=0.1)
text(sprt)
c(precision, recall)

## [1] 0.8424419 0.7992278

An example with categorical explanatory variables

trn <- read.csv("train_titanic.csv", header = TRUE)
rt <- rpart(Survived ~ as.factor(Pclass)+ Sex + SibSp + Parch + Fare + Embarked,
            method="class", data=trn)  
plot(rt, margin=0.1)
text(rt)
rt

## n= 891 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 891 342 0 (0.61616162 0.38383838)  
##     2) Sex=male 577 109 0 (0.81109185 0.18890815) *
##     3) Sex=female 314  81 1 (0.25796178 0.74203822)  
##       6) as.factor(Pclass)=3 144  72 0 (0.50000000 0.50000000)  
##        12) Fare>=23.35 27   3 0 (0.88888889 0.11111111) *
##        13) Fare< 23.35 117  48 1 (0.41025641 0.58974359)  
##          26) Embarked=S 63  31 0 (0.50793651 0.49206349)  
##            52) Fare< 10.825 37  15 0 (0.59459459 0.40540541) *
##            53) Fare>=10.825 26  10 1 (0.38461538 0.61538462)  
##             106) Fare>=17.6 10   3 0 (0.70000000 0.30000000) *
##             107) Fare< 17.6 16   3 1 (0.18750000 0.81250000) *
##          27) Embarked=C,Q 54  16 1 (0.29629630 0.70370370) *
##       7) as.factor(Pclass)=1,2 170   9 1 (0.05294118 0.94705882) *

Shrinkage: Ridge regression, LASSO, elastic-net

Timeline of concepts we have studied in this class:

Least squares, linear regression, hypothesis testing: Legendre (1805), Gauss (1809, 1822), Laplace (1810), Galton (1886), Pearson, Fisher, Neyman, …
Generalized linear models: Nelder, Wedderburn (1972)
Classification and regression trees (CART): Breiman (1984)
Ridge regression: Tikhonov (1960s-), Hoerl (1959)
LASSO: Tibshirani (1996)

Motivation

Let us center and normalize (divide by standard deviation) the columns of our design matrix $X$ to put all the variables on the same scale, and let us center our response variable $y$. Then we will not have an intercept term.

At the beginning of the course, I mentioned the following result which holds for any $n \times p$ matrix $X$. \[X^\top X \text{ is invertible} \iff \text{rank}(X) = p.\] Also recall the fact $\text{rank}(X) \le \min\{n, p\}$. Exercise. If $p > n$, how many solutions $\widehat{\beta}$ are there to the normal equation $X^\top X \widehat{\beta} = X^\top y$?

Some notation: the $\ell_q$ norm is defined by $\|x\|_q = \left(\sum_i |x_i|^q\right)^{1/q}$. So the $\ell_2$ norm is $\|x\|_2 = \sqrt{\sum_i x_i^2}$, also known as the Euclidean norm. Also, the $\ell_1$ norm is $\|x\|_1 = \sum_i |x_i|$.

Ridge regression has two equivalent formulations. One is the constrained form with a radius $c$ as a tuning parameter. \[\widehat{\beta}_{RR} = \arg\min_{\beta \in \mathbb{R}^p : \|\beta\|_2^2 \le c} \|y - X\beta\|_2^2\] The other is the Lagrangian/penalized form with penalty parameter $\lambda$. \[\widehat{\beta}_{RR} = \arg\min_{\beta \in \mathbb{R}^p} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2\] Each $c > 0$ corresponds to some value of $\lambda \ge 0$ and vice versa, but this is not a one-to-one correspondence, since all large values of $c$ correspond to $\lambda = 0$. However, there is a one-to-one correspondence if you restrict $\lambda > 0$ to be strictly positive, and $c < \|\widehat{\beta}_{OLS}\|^2$. See this Piazza post for more explanation.

Exercise. Qualitatively how do $c$ and $\lambda$ depend on each other? Specifically, incresing $c$ corresponds to increasing $\lambda$ or decreasing $\lambda$?

The constrained formulation is good for intuition (see pictures below), but we typically focus on the penalized formulation in practice.

Exercise. Show that $X^\top X + \lambda I_p$ is invertible. (Hint 1: What is an equivalent condition for invertibility? Hint 2: How do the eigenvalues of $X^\top X$ relate to the eigenvalues of $X^\top X + \lambda I_p$? Hint 3: $X^\top X$ is positive semi-definite, i.e. it has nonnegative eigenvalues.)

Exercise. Show that (with the penalized formulation of ridge regression), \[\widehat{\beta} = (X^\top X + \lambda I_p)^{-1} X^\top y.\] (Hint: set the gradient of the objective function to zero.) How many solutions are there?

Exercise. The fitted value $\widehat{y} = X \widehat{\beta}_{RR}$ can be written as $\widehat{y} = H_\lambda y$, i.e., this is a linear estimator. Write down $H_\lambda$. Its trace is the effective degrees of freedom of this estimator; check that this is $p$ when $\lambda=0$.

Exercise. Roughly argue why this is a biased estimator of $\beta$.

LASSO is defined similarly. One is the constrained form with a radius $c$ as a tuning parameter. \[\widehat{\beta}_{LASSO} = \arg\min_{\beta \in \mathbb{R}^p : \|\beta\|_1 \le c} \|y - X\beta\|_2^2\] The other is the Lagrangian/penalized form with penalty parameter $\lambda$. \[\widehat{\beta}_{LASSO} = \arg\min_{\beta \in \mathbb{R}^p} \|y - X\beta\|^2 + \lambda \|\beta\|_1\] Again, we focus on the penalized formulation.

Here are a few properties regarding the solution set of this optimization problem (see Lemma 1 of this paper).

The LASSO solution is either unique or there are infinitely many solutions.
Every LASSO solution $\widehat{\beta}$ gives the same fitted value $X \widehat{\beta}$.
If $\lambda > 0$, then every LASSO solution $\widehat{\beta}$ has the same $\ell^1$ norm $\|\widehat{\beta}\|_1$.

Let us visualize the constrained formulations of ridge regression and LASSO in the case $p=2$. In both cases, the objective function that we seek to minimize is $f(\beta) = \|y - X \beta\|^2$, which is a quadratic function in the vector $\beta$, so it can be shown to have elliptical contours. The constraint sets $\|\beta\|_2^2 \le c$ and $\|\beta\|_1 \le c$ are a circle and a square respectively.

Exercise. In terms of the contours and the two constraint sets, describe where the solution to ridge regression and LASSO appear in the above pictures. In particular, what is $\widehat{\beta}_1$ in the LASSO solution?

Exercise. Explain why in the above pictures $n \ge p$.

Exercise. Compare the ridge and LASSO solutions with the OLS solution. Why are these methods sometimes referred to as “shrinkage” estimators?

Exercise. When the radius $c$ is very large, what are the solutions to each optimization problem?

The above picture attempts to illustrate an attractive property of the LASSO estimator: it tends to return solutions that are sparse (having few nonzero entries). This suggests another possible advantage of LASSO beyond being able to deal with the case $p > n$: if the true parameter $\beta$ is sparse (i.e., only a few of the $p$ explanatory variables actually enter our linear model $y = X \beta + \epsilon$), then maybe LASSO can do a good job estimating $\beta$.

Why don’t we do best subset selection instead? We could ask for \[\min_{\beta : \sum_{j=1}^p I(\beta_j \ne 0) \le k} \|y - X \beta \|^2,\] i.e. look for the $\beta$ that gives the best fit among all $\beta$ that have $\le k$ nonzero components. (Recall that this is essentially what regsubsets does.) However, as the number of subsets you need to explore grows exponentially in $k$ Even the toy example below with the modest sizes $n=100$, $p=150$, and $k\approx p/4$ takes a long time. (Note that $\binom{p}{k}$ here is already on the order of $10^{35}$.) Imagine what happens when $p$ is even larger.

library(leaps)
library(glmnet)
n <- 100
p <- 150
k <- floor(p / 4)
X <- matrix(rnorm(n*p), n, p)
beta <- rep(0, p)
idx <- sample(1:p, k)
beta[idx] <- 1
beta

##   [1] 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [36] 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1
##  [71] 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
## [106] 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0
## [141] 0 0 1 0 0 0 0 0 0 1

y <- X %*% beta + rnorm(n, sd=0.5)

# run these lines to freeze your computer
# rs <- regsubsets(x=X, y=y, nvmax=k, really.big=T)
# summary(rs)

It turns out that it is possible to solve the LASSO optimization problem using “hill climbing” methods; in R, glmnet uses coordinate descent. Running glmnet on the above example is almost instantaneous.

Side note: ridge regression and LASSO arise naturally from a Bayesian perspective by considering the Gaussian prior and the Laplace prior respectively, on the coefficients $\beta_j$.

Using `glmnet`

The glmnet vignette is very useful. The following examples are from here.

glmnet by solves the following problem \[\min_{\beta_0, \beta} \frac{1}{n}\sum_{i=1}^n (y_i - \beta_0 - \beta^\top x_i)^2 + \lambda \left[\frac{1-\alpha}{2} \|\beta\|_2^2 + \alpha \|\beta\|_1\right].\]

Where $\alpha \in [0,1]$ is a parameter alpha to be passed into the call to glmnet(). By default alpha=1 which gives LASSO. To do ridge regression, choose alpha=0. Intermediate values of alpha give a mixture of the two penalties; this general estimator is called elastic net regularization.

A call to glmnet will solve this optimization problem for many values of $\lambda$. You can manually specify how many values of $\lambda$ to try by changing nlambda (default is 100), or even pass in your own sequence of $\lambda$ values to lambda.

library(glmnet)

Let us consider an example where $p=2000$ but only $k=20$ entries of $\beta$ are nonzero.

set.seed(0)
n <- 1000
p <- 2000
k <- 20
X <- matrix(rnorm(n*p), n, p)
beta <- c(rep(1, k), rep(0, p-k))
y <- X %*% beta + rnorm(n)

idx <- sample(1:n, 0.66*n)
X.train <- X[idx,]
y.train <- y[idx]
X.test <- X[-idx,]
y.test <- y[-idx]

# LASSO
fit <- glmnet(X.train, y.train)
plot(fit)
plot(fit, xvar="lambda")
fit

## 
## Call:  glmnet(x = X.train, y = y.train) 
## 
##         Df     %Dev  Lambda
##   [1,]   0 0.000000 1.38400
##   [2,]   1 0.007732 1.32100
##   [3,]   3 0.018680 1.26100
##   [4,]   3 0.036120 1.20400
##   [5,]   6 0.058230 1.14900
##   [6,]   7 0.088860 1.09700
##   [7,]   9 0.126700 1.04700
##   [8,]  14 0.171100 0.99960
##   [9,]  14 0.220400 0.95410
##  [10,]  17 0.268700 0.91080
##  [11,]  17 0.319500 0.86940
##  [12,]  18 0.368400 0.82980
##  [13,]  18 0.413800 0.79210
##  [14,]  20 0.460300 0.75610
##  [15,]  20 0.504400 0.72180
##  [16,]  20 0.544600 0.68900
##  [17,]  20 0.581200 0.65760
##  [18,]  20 0.614600 0.62770
##  [19,]  20 0.645000 0.59920
##  [20,]  20 0.672700 0.57200
##  [21,]  20 0.697900 0.54600
##  [22,]  20 0.720900 0.52120
##  [23,]  20 0.741900 0.49750
##  [24,]  20 0.761000 0.47490
##  [25,]  20 0.778400 0.45330
##  [26,]  20 0.794200 0.43270
##  [27,]  20 0.808700 0.41300
##  [28,]  20 0.821800 0.39420
##  [29,]  20 0.833800 0.37630
##  [30,]  20 0.844800 0.35920
##  [31,]  20 0.854700 0.34290
##  [32,]  20 0.863800 0.32730
##  [33,]  20 0.872100 0.31240
##  [34,]  20 0.879600 0.29820
##  [35,]  20 0.886500 0.28470
##  [36,]  20 0.892700 0.27170
##  [37,]  20 0.898400 0.25940
##  [38,]  20 0.903600 0.24760
##  [39,]  20 0.908300 0.23630
##  [40,]  20 0.912600 0.22560
##  [41,]  20 0.916600 0.21530
##  [42,]  20 0.920100 0.20560
##  [43,]  20 0.923400 0.19620
##  [44,]  20 0.926400 0.18730
##  [45,]  20 0.929100 0.17880
##  [46,]  20 0.931600 0.17070
##  [47,]  20 0.933800 0.16290
##  [48,]  20 0.935800 0.15550
##  [49,]  20 0.937700 0.14840
##  [50,]  23 0.939500 0.14170
##  [51,]  24 0.941200 0.13520
##  [52,]  26 0.942800 0.12910
##  [53,]  27 0.944400 0.12320
##  [54,]  27 0.945800 0.11760
##  [55,]  29 0.947200 0.11230
##  [56,]  34 0.948600 0.10720
##  [57,]  37 0.949900 0.10230
##  [58,]  42 0.951300 0.09766
##  [59,]  45 0.952600 0.09322
##  [60,]  61 0.954000 0.08898
##  [61,]  71 0.955600 0.08494
##  [62,]  84 0.957200 0.08108
##  [63,]  97 0.958900 0.07739
##  [64,] 107 0.960500 0.07387
##  [65,] 117 0.962100 0.07052
##  [66,] 126 0.963700 0.06731
##  [67,] 144 0.965300 0.06425
##  [68,] 156 0.966900 0.06133
##  [69,] 172 0.968500 0.05854
##  [70,] 186 0.970000 0.05588
##  [71,] 196 0.971500 0.05334
##  [72,] 205 0.972900 0.05092
##  [73,] 213 0.974200 0.04860
##  [74,] 229 0.975500 0.04640
##  [75,] 239 0.976700 0.04429
##  [76,] 257 0.978000 0.04227
##  [77,] 276 0.979200 0.04035
##  [78,] 294 0.980400 0.03852
##  [79,] 306 0.981500 0.03677
##  [80,] 311 0.982600 0.03510
##  [81,] 327 0.983600 0.03350
##  [82,] 337 0.984600 0.03198
##  [83,] 343 0.985500 0.03052
##  [84,] 354 0.986400 0.02914
##  [85,] 362 0.987200 0.02781
##  [86,] 373 0.988000 0.02655
##  [87,] 389 0.988800 0.02534
##  [88,] 399 0.989500 0.02419
##  [89,] 405 0.990200 0.02309
##  [90,] 421 0.990900 0.02204
##  [91,] 433 0.991500 0.02104
##  [92,] 442 0.992100 0.02008
##  [93,] 450 0.992600 0.01917
##  [94,] 460 0.993100 0.01830
##  [95,] 473 0.993600 0.01747
##  [96,] 480 0.994000 0.01667
##  [97,] 496 0.994500 0.01592
##  [98,] 507 0.994900 0.01519
##  [99,] 517 0.995200 0.01450
## [100,] 529 0.995600 0.01384

You can pick a particular value of $\lambda$ to use for prediction by using the s parameter.

# Manually pick some lambda
lambda <- 0.6
head(coef(fit, s=lambda), 31)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -0.06033661
## V1           0.35783795
## V2           0.49382138
## V3           0.48093568
## V4           0.48699787
## V5           0.40389194
## V6           0.34392391
## V7           0.20942318
## V8           0.45099496
## V9           0.39386595
## V10          0.41449659
## V11          0.23196174
## V12          0.39679559
## V13          0.50657539
## V14          0.32789130
## V15          0.50899255
## V16          0.39449459
## V17          0.29856477
## V18          0.61545884
## V19          0.49073542
## V20          0.58281857
## V21          .         
## V22          .         
## V23          .         
## V24          .         
## V25          .         
## V26          .         
## V27          .         
## V28          .         
## V29          .         
## V30          .

yhat.lambda <- predict(fit, newx=X.test, s=lambda)
mean((yhat.lambda - y.test)^2)

## [1] 8.08269

Alternatively you can pick $\lambda$ by cross-validation by using cv.glmnet.

cvfit <- cv.glmnet(X.train, y.train)
plot(cvfit)
log(cvfit$lambda.min)

## [1] -2.698428

log(cvfit$lambda.1se)

## [1] -2.233259

lambda.min is the choice of $\lambda$ that minimizes the cross validation error. lambda.1se is the largest lambda (resulting in more shrinkage) such that the cross validation is within one standard error of the minimum. Apparently lambda.1se is a standard choice; erring on the side of parsimony (simpler model) helps mitigate the fact that the cross validation errors are only estimates of the risk curves. [See here for more discussion.]

head(coef(cvfit, s=cvfit$lambda.1se), 31)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) 0.006803268
## V1          0.929112497
## V2          0.846626145
## V3          0.907562891
## V4          0.965250547
## V5          0.919974322
## V6          0.852307175
## V7          0.801137450
## V8          0.866816067
## V9          0.897085107
## V10         0.891687225
## V11         0.821478764
## V12         0.889524147
## V13         0.926431562
## V14         0.826475507
## V15         0.936252930
## V16         0.859245430
## V17         0.876292023
## V18         0.948283398
## V19         0.890718534
## V20         0.972174863
## V21         .          
## V22         .          
## V23         .          
## V24         .          
## V25         .          
## V26         .          
## V27         .          
## V28         .          
## V29         .          
## V30         .

yhat.cvlambda <- predict(cvfit, newx=X.test, s=cvfit$lambda.1se)
mean((yhat.cvlambda - y.test)^2)

## [1] 1.392953

Let’s look at ridge regression

# Ridge
fit <- glmnet(X.train, y.train, alpha=0)
plot(fit)
plot(fit, xvar="lambda")

cvfit <- cv.glmnet(X.train, y.train, alpha=0)
plot(cvfit)
head(coef(cvfit, s=cvfit$lambda.1se), 31)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -0.1241521163
## V1           0.1035240482
## V2           0.1324763659
## V3           0.1207788741
## V4           0.1202793742
## V5           0.1111811138
## V6           0.0961574424
## V7           0.0834096124
## V8           0.1224087312
## V9           0.1099697054
## V10          0.1156369448
## V11          0.0835027180
## V12          0.1086204498
## V13          0.1270814365
## V14          0.1112939820
## V15          0.1287499578
## V16          0.1085714298
## V17          0.0965797599
## V18          0.1438644679
## V19          0.1265657453
## V20          0.1348795572
## V21         -0.0009848824
## V22         -0.0129740714
## V23          0.0114050240
## V24         -0.0101038484
## V25          0.0050290544
## V26         -0.0025458250
## V27         -0.0232819059
## V28          0.0127580150
## V29         -0.0005004794
## V30         -0.0013555446

yhat.cvlambda <- predict(cvfit, newx=X.test, s=cvfit$lambda.1se)
mean((yhat.cvlambda - y.test)^2)

## [1] 17.41077

Let’s compare LASSO, ridge, and some other choices of the elastic net parameter $\alpha$. Recall that in our example example where $p=2000$ but only $k=20$ entries of $\beta$ are nonzero, and we have $n=1000$ datapoints.

t <- 5
fits <- list()
MSEs <- rep(0, t+1)
for (i in 0:t) {
  fits[[i+1]] <- cv.glmnet(X.train, y.train, alpha=i/t)
  yhat <- predict(fits[[i+1]], newx=X.test, s=fits[[i+1]]$lambda.1se)
  MSEs[i+1] <- mean((yhat - y.test)^2)
}
MSEs

## [1] 16.885109  1.662305  1.482336  1.419935  1.415476  1.410864

(which.min(MSEs) - 1)/t

## [1] 1

LASSO does best in this case because few components of the true $\beta$ are nonzero.

Let’s now consider an example where there are a lot more nonzero coefficients.

set.seed(0)
n <- 1000
p <- 2000
k <- 500
X <- matrix(rnorm(n*p), n, p)
beta <- c(rep(1, k), rep(0, p-k))
y <- X %*% beta + rnorm(n)

idx <- sample(1:n, 0.66*n)
X.train <- X[idx,]
y.train <- y[idx]
X.test <- X[-idx,]
y.test <- y[-idx]

fit.l <- glmnet(X.train, y.train)
plot(fit.l, xvar="lambda")
fit.r <- glmnet(X.train, y.train, alpha=0)
plot(fit.r, xvar="lambda")

t <- 5
fits <- list()
MSEs <- rep(0, t+1)
for (i in 0:t) {
  fits[[i+1]] <- cv.glmnet(X.train, y.train, alpha=i/t)
  yhat <- predict(fits[[i+1]], newx=X.test, s=fits[[i+1]]$lambda.1se)
  MSEs[i+1] <- mean((yhat - y.test)^2)
}
MSEs

## [1] 362.4554 421.8606 429.0294 437.6898 430.0722 433.3925

(which.min(MSEs) - 1)/t

## [1] 0

Here ridge regression does best. Here the sparsity of the true $\beta$ is not small enough for LASSO to have better prediction error than ridge regression (and LASSO probably does not find the right variables, from looking at the paths plot above). However, despite the better prediction error, the ridge regression estimate is not very interpretable since it involves all $p=2000$ variables.

Ridge and elastic net also perform better than LASSO when the explanatory variables are correlated/colinear. See here for some intuition, and here for a simulation.

Logistic with LASSO

As suggested by the name of the package, glmnet can also do regularization with other GLM models. We simply replace the least squares term $\|y - X \beta\|^2$ with the negative log likelihood of whichever GLM we want.

In the case of logistic regression, it is \[\min_{\beta_0, \beta} -\left[\frac{1}{n} \sum_{i=1}^n y_i (\beta_0 + x_i^\top \beta) - \log(1+\exp(\beta_0 + x_i^\top\beta))\right] + \lambda \left[\frac{1-\alpha}{2} \|\beta\|_2^2 + \alpha \|\beta\|_1\right].\]

We can ask glmnet to use this by adding family="binomial" to the call. Here is an example on the Titanic dataset.

titanic <- read.csv("train_titanic.csv")
titanic <- na.omit(titanic)

# need to use model.matrix() to create design matrix for glmnet
dd <- titanic[, c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
dd$Pclass <- as.factor(dd$Pclass)
X <- model.matrix(Survived~., dd)
X <- X[,-1]

fit <- glmnet(X, titanic$Survived, family="binomial")
plot(fit, xvar="lambda")
cvfit <- cv.glmnet(X, titanic$Survived, family="binomial")
plot(cvfit)
coef(cvfit, s=cvfit$lambda.1se)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  2.072540422
## Pclass2     -0.232963212
## Pclass3     -1.368366406
## Sexmale     -2.172555695
## Age         -0.017343599
## SibSp       -0.106753585
## Parch        .          
## Fare         0.002004744
## EmbarkedC    0.364434572
## EmbarkedQ    .          
## EmbarkedS    .

STAT 151A: Lab 12

Billy Fang

December 1, 2017

Classification trees

Impurity measures

Stopping criteria and pruning

Example

An example with categorical explanatory variables

Shrinkage: Ridge regression, LASSO, elastic-net

Motivation

Using `glmnet`

Logistic with LASSO

STAT 151A: Lab 12

Billy Fang

December 1, 2017

Classification trees

Impurity measures

Stopping criteria and pruning

Example

An example with categorical explanatory variables

Shrinkage: Ridge regression, LASSO, elastic-net

Motivation

Using glmnet

Logistic with LASSO

Using `glmnet`