Please fill out the course evaluation at https://course-evaluations.berkeley.edu/. It is important for the department to get your feedback so that we can improve the department’s courses for future students.
library(ggplot2)
library(reshape2)
The rpart vignette has all the details about rpart
. Some of the notes below are adapted from there.
In a tree estimator, you can think of your dataset as a crowd of people that start at the top of the tree and walk downward. At each node is a sign telling people to go right or left depending on their characteristics (e.g., go left if you are under 18 years old, otherwise go right).
In the case of classification, you have some unknown categorical variable (not necessarily binary), say, “favorite color,” and you hope that at the bottom of the tree, each terminal node has a group of people that mostly like the same color. In other words, the broad goal is to choose good splitting criteria at each node so that the “purity” of the groups increases after the split.
How do we measure impurity at a node \(G\) [before a split]? Suppose the response variable has \(1,\ldots,C\) categories, and the proportion of the \(n_G\) people [at this node \(G\)] in each category are \(p_1, \ldots, p_C\) respectively. [This is different than the \(\bar{p}_1\) and \(\bar{p}_2\) in your lecture notes, which correspond to binary classification proportions in two different groups.]
The impurity of the group of datapoints at this node is \[I(G) = f(p_1) + f(p_2) + \cdots + f(p_C),\] for some concave function \(f\) with \(f(0)=f(1)=0\). (This latter condition ensures \(I(G)=0\) if \(G\) is pure.)
In lecture, we only considered binary classification (label \(0\) or \(1\)). There, if \(p\) denotes the proportion of the \(n_G\) people with label \(1\), the impurity is simply \[I(G) = f(p) + f(1-p).\]
Examples of \(f\):
Above, the various factors of \(2\) do not really matter, since we are concerned with comparing the value of the impurity for different values of \(p\).
The following is a plot of the various impurity measures in the case of binary classification.
p <- seq(1e-3, 1 - 1e-3, length.out=100)
dat <- data.frame(p=p, Gini=2*p*(1-p), Entropy=-(p*log(p)+(1-p)*log(1-p))/log(2)/2, Class=pmin(p, 1-p))
dat.melt <- melt(dat, id.vars="p", variable.name="Impurity")
ggplot(dat.melt, aes(x=p, y=value)) + geom_line(aes(group=Impurity, linetype=Impurity)) + theme_classic()
Again, the values of curve do not really matter, since we can scale each curve by any constant to get an equivalent impurity measure. It is the shape of the curve that matters. For visualization’s sake, I have chosen to scale by constants so that \(f(1/2)\) has the same value.
Suppose we split \(G\) into two groups \(G_L\) and \(G_R\) of size \(n_L\) and \(n_R\) respectively. The impurity reduction of this split is \[\Delta I := \frac{1}{n}\left[n_G I(G) - n_L I(G_L) - n_R I(G_R)\right],\] where \(n\) is the number of datapoints in the full dataset. We want to choose a split (i.e., choose a variable \(X_j\) and a cutoff \(c\)) that maximizes this impurity reduction, which is equivalent to minimizing \[n_L I(G_L) + n_R I(G_R).\]
Suppose we are in a binary classification situation, and we let \(p_L\) and \(p_R\) denote the proportion of \(1\)s in \(G_L\) and \(G_R\) respectively. [These are the \(\bar{p}_1\) and \(\bar{p}_2\) from your lecture notes.]
By default, rpart
uses Gini. You can also use the entropy by passing in an argument (see below).
Using misclassification error as a splitting criterion has drawbacks, so I don’t think rpart
offers an option to use it.
Suppose in the full dataset we have a binary response variable that is \(1\) for \(80\) of the datapoints, and \(0\) for the other \(20\). If our first split gives us \(G_L\) with \(40\) datapoints all labeled with \(1\), while \(G_R\) has \(40\) datapoints labeled \(1\) and \(20\) datapoints labeled \(0\), then what is the impurity reduction \(\Delta I\)?
So, despite this split being quite good, the misclassification error impurity measure does not really recognize it. The issue is the linearity of the misclassification error impurity function. Note that we are measuring the impurity of a split by a weighted average of two impurities. The strict concavity of the Gini and entropy functions ensures that after a split, the impurity strictly decreases. The picture below (taken from here) visualizes this.
We have discussed how to choose splits, but when do we stop?
For a nonnegative complexity criterion \(\alpha\) (denoted as cp
in rpart
), we define the cost complexity of a tree \(T\) by \[C_\alpha(T) := C(T) + \alpha C(T_{\text{root}}) |T|,\] where \(C(\cdot)\) returns some notion of error/risk of a tree, and \(T_{root}\) is the tree with no splits (just root node).
Warning: sometimes you will see \(C_\alpha(T)\) defined as \(C(T) + \alpha |T|\) instead (e.g., compare pages 12 and 24 of the rpart vignette). This does not really matter too much, since \(C(T_{\text{root}})\) is a constant that does not depend on \(T\). But cp
in rpart
is defined using the earlier definition.
rpart
has various stopping criteria. See help(rpart.control)
for some of these parameters.
minsplit
(default is 20
): The minimum number of observations that must exist in a node in order for a split to be attempted. This parameter can save computation time, since smaller nodes are almost always pruned away by cross validation.minbucket
(default is minsplit/3
): The minimum number of observations in a terminal node.cp
(default is 0.01
): The threshold complexity parameter.rpart
continues splitting until either the minsplit
or minbucket
conditions are violated, or if a split does not decrease the cost \(C_\alpha\) of the tree. Specifically, suppose \(T\) is a tree and \(T'\) is a tree with one more split that we are considering. Then \[C_\alpha(T) - C_\alpha(T') = [C(T) - C(T')] - \alpha C(T_{\text{root}}).\] This gives a nice interpretation of \(\alpha\): rpart
will not consider a split if \(C(T) - C(T')\) (the decrease in error) is smaller than \(\alpha C(T_{\text{root}})\). Thus if we use \(\alpha = 0\), then rpart
will construct the entire tree (up to the minsplit
and minbucket
conditions.) If we use \(\alpha=1\), then rpart
will not split beyond the root.
So after calling rpart
with some fixed value of \(\alpha\) (a.k.a. cp
, default value is 0.01
), we obtain some tree. How can we prune with rpart
? If we use printcp()
, it shows what happens if we consider sub-trees that result from using values of \(\alpha\) larger than whatever the cp
value was in our original call to rpart()
, along with cross-validation scores for each value of \(\alpha\). [Why do larger values of \(\alpha\) yield smaller trees?] As you continuously increase \(\alpha\) to \(1\), you will get a finite number of sub-trees, and I believe printcp()
lists one value of \(\alpha\) for each sub-tree. You can then use prune()
to produce a sub-tree using one of the other values of \(\alpha\).
In lecture you looked at some code for a regression tree example. Here we will look at a classification tree example.
library(DAAG)
data(spam7)
help(spam7)
spam = spam7
library(rpart)
sprt = rpart(yesno ~ ., method = "class", data = spam)
plot(sprt, margin=0.1)
text(sprt)
sprt
## n= 4601
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4601 1813 n (0.6059552 0.3940448)
## 2) dollar< 0.0555 3471 816 n (0.7649092 0.2350908)
## 4) bang< 0.0915 2420 246 n (0.8983471 0.1016529) *
## 5) bang>=0.0915 1051 481 y (0.4576594 0.5423406)
## 10) crl.tot< 85.5 535 175 n (0.6728972 0.3271028)
## 20) bang< 0.7735 418 106 n (0.7464115 0.2535885) *
## 21) bang>=0.7735 117 48 y (0.4102564 0.5897436)
## 42) crl.tot< 17 43 12 n (0.7209302 0.2790698) *
## 43) crl.tot>=17 74 17 y (0.2297297 0.7702703) *
## 11) crl.tot>=85.5 516 121 y (0.2344961 0.7655039) *
## 3) dollar>=0.0555 1130 133 y (0.1176991 0.8823009) *
Make sure you understand how to read every part of the text output.
As mentioned earlier, the default call to rpart
uses Gini. To use entropy, you need to pass another argument, as follows.
isprt = rpart(yesno ~ ., method = "class", parms = list(split = 'information'), data = spam)
plot(isprt, margin=0.1)
text(isprt)
isprt
## n= 4601
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4601 1813 n (0.6059552 0.3940448)
## 2) dollar< 0.0555 3471 816 n (0.7649092 0.2350908)
## 4) bang< 0.0875 2407 242 n (0.8994599 0.1005401) *
## 5) bang>=0.0875 1064 490 y (0.4605263 0.5394737)
## 10) crl.tot< 85.5 541 175 n (0.6765250 0.3234750)
## 20) bang< 0.7735 424 106 n (0.7500000 0.2500000) *
## 21) bang>=0.7735 117 48 y (0.4102564 0.5897436)
## 42) crl.tot< 17 43 12 n (0.7209302 0.2790698) *
## 43) crl.tot>=17 74 17 y (0.2297297 0.7702703) *
## 11) crl.tot>=85.5 523 124 y (0.2370937 0.7629063) *
## 3) dollar>=0.0555 1130 133 y (0.1176991 0.8823009) *
\[\begin{array}{c|cc} & \text{predict $0$} & \text{predict $1$} \\ \hline \text{actual $0$} & a & b \\ \text{actual $1$} & c & d \end{array}\]
confusion <- function (y, yhat, thres)
{
n <- length(thres)
conf <- matrix(0,length(thres),ncol=4)
colnames(conf) <- c("a","b","c","d")
for ( i in 1:n)
{
a <- sum((!y) & (yhat<=thres[i]))
b <- sum((!y) & (yhat>thres[i]))
c <- sum((y) & (yhat<=thres[i]))
d <- sum((y) & (yhat>thres[i]))
conf[i,] <- c(a,b,c,d)
}
return(conf)
}
y.tr = predict(sprt, spam)[,2]
thres <- seq(0.05, 0.95, by = 0.05)
y = as.numeric(spam$yesno == "y")
tree.conf = confusion(y, y.tr, thres)
qplot(thres, tree.conf[,2]+tree.conf[,3], xlab="threshold", ylab="b+c", geom="line") + theme_classic()
# pick threshold 0.5
conf <- as.vector(tree.conf[which(thres==0.5),])
precision <- conf[4] / (conf[2] + conf[4])
recall <- conf[4] / (conf[3] + conf[4])
c(precision, recall)
## [1] 0.8424419 0.7992278
You can modify the \(\alpha\) value you want to use in the original call to rpart
.
set.seed(0)
dense.sprt <- rpart(yesno ~ ., method="class", data=spam, cp=0.001)
plot(dense.sprt, margin=0.1)
text(dense.sprt)
Be careful when reading the printcp
output. the rel error
and xerror
columns are not the misclassification errors [on the full dataset] and cross-validation errors. You need to multiply each column by the Root node error
above the table to get the actual errors. The scaling is chosen so that the first row always has rel error
and xerror
equal to 1.0
, for readability.
We can pick the value of \(\alpha\) that has smallest cross-validation error. (Remember, the cross-validation errors are random.) In practice one might instead follow the “1 standard error” rule; see discussion of lambda.1se
in glmnet
below.
table.cp <- printcp(dense.sprt)
##
## Classification tree:
## rpart(formula = yesno ~ ., data = spam, method = "class", cp = 0.001)
##
## Variables actually used in tree construction:
## [1] bang crl.tot dollar money n000
##
## Root node error: 1813/4601 = 0.39404
##
## n= 4601
##
## CP nsplit rel error xerror xstd
## 1 0.4765582 0 1.00000 1.00000 0.018282
## 2 0.0755654 1 0.52344 0.54937 0.015408
## 3 0.0115830 3 0.37231 0.39162 0.013516
## 4 0.0104799 4 0.36073 0.38058 0.013358
## 5 0.0063431 5 0.35025 0.36790 0.013172
## 6 0.0055157 10 0.31660 0.35301 0.012947
## 7 0.0044126 11 0.31109 0.33922 0.012732
## 8 0.0038610 12 0.30667 0.33536 0.012670
## 9 0.0027579 16 0.29123 0.32377 0.012482
## 10 0.0022063 17 0.28847 0.32101 0.012436
## 11 0.0019305 18 0.28627 0.32488 0.012500
## 12 0.0016547 20 0.28240 0.32929 0.012572
## 13 0.0010000 25 0.27413 0.32763 0.012545
cp.x <- table.cp[which.min(table.cp[,4]), 1]
cp.x
## [1] 0.002206288
fsprt <- rpart(yesno ~ ., method="class", data=spam, cp=cp.x)
plot(fsprt, margin=0.1)
text(fsprt)
fconf <- table(spam$yesno, as.numeric(predict(fsprt)[,2] >= 0.5))
fconf
##
## 0 1
## n 2630 158
## y 365 1448
fconf <- as.vector(t(fconf))
fprecision <- fconf[4] / (fconf[2] + fconf[4])
frecall <- fconf[4] / (fconf[3] + fconf[4])
c(fprecision, frecall)
## [1] 0.9016189 0.7986762
Compare with our original model with default cp=0.01
.
plot(sprt, margin=0.1)
text(sprt)
c(precision, recall)
## [1] 0.8424419 0.7992278
trn <- read.csv("train_titanic.csv", header = TRUE)
rt <- rpart(Survived ~ as.factor(Pclass)+ Sex + SibSp + Parch + Fare + Embarked,
method="class", data=trn)
plot(rt, margin=0.1)
text(rt)
rt
## n= 891
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 891 342 0 (0.61616162 0.38383838)
## 2) Sex=male 577 109 0 (0.81109185 0.18890815) *
## 3) Sex=female 314 81 1 (0.25796178 0.74203822)
## 6) as.factor(Pclass)=3 144 72 0 (0.50000000 0.50000000)
## 12) Fare>=23.35 27 3 0 (0.88888889 0.11111111) *
## 13) Fare< 23.35 117 48 1 (0.41025641 0.58974359)
## 26) Embarked=S 63 31 0 (0.50793651 0.49206349)
## 52) Fare< 10.825 37 15 0 (0.59459459 0.40540541) *
## 53) Fare>=10.825 26 10 1 (0.38461538 0.61538462)
## 106) Fare>=17.6 10 3 0 (0.70000000 0.30000000) *
## 107) Fare< 17.6 16 3 1 (0.18750000 0.81250000) *
## 27) Embarked=C,Q 54 16 1 (0.29629630 0.70370370) *
## 7) as.factor(Pclass)=1,2 170 9 1 (0.05294118 0.94705882) *
Timeline of concepts we have studied in this class:
Let us center and normalize (divide by standard deviation) the columns of our design matrix \(X\) to put all the variables on the same scale, and let us center our response variable \(y\). Then we will not have an intercept term.
At the beginning of the course, I mentioned the following result which holds for any \(n \times p\) matrix \(X\). \[X^\top X \text{ is invertible} \iff \text{rank}(X) = p.\] Also recall the fact \(\text{rank}(X) \le \min\{n, p\}\). Exercise. If \(p > n\), how many solutions \(\widehat{\beta}\) are there to the normal equation \(X^\top X \widehat{\beta} = X^\top y\)?
Some notation: the \(\ell_q\) norm is defined by \(\|x\|_q = \left(\sum_i |x_i|^q\right)^{1/q}\). So the \(\ell_2\) norm is \(\|x\|_2 = \sqrt{\sum_i x_i^2}\), also known as the Euclidean norm. Also, the \(\ell_1\) norm is \(\|x\|_1 = \sum_i |x_i|\).
Ridge regression has two equivalent formulations. One is the constrained form with a radius \(c\) as a tuning parameter. \[\widehat{\beta}_{RR} = \arg\min_{\beta \in \mathbb{R}^p : \|\beta\|_2^2 \le c} \|y - X\beta\|_2^2\] The other is the Lagrangian/penalized form with penalty parameter \(\lambda\). \[\widehat{\beta}_{RR} = \arg\min_{\beta \in \mathbb{R}^p} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2\] Each \(c > 0\) corresponds to some value of \(\lambda \ge 0\) and vice versa, but this is not a one-to-one correspondence, since all large values of \(c\) correspond to \(\lambda = 0\). However, there is a one-to-one correspondence if you restrict \(\lambda > 0\) to be strictly positive, and \(c < \|\widehat{\beta}_{OLS}\|^2\). See this Piazza post for more explanation.
Exercise. Qualitatively how do \(c\) and \(\lambda\) depend on each other? Specifically, incresing \(c\) corresponds to increasing \(\lambda\) or decreasing \(\lambda\)?
The constrained formulation is good for intuition (see pictures below), but we typically focus on the penalized formulation in practice.
Exercise. Show that \(X^\top X + \lambda I_p\) is invertible. (Hint 1: What is an equivalent condition for invertibility? Hint 2: How do the eigenvalues of \(X^\top X\) relate to the eigenvalues of \(X^\top X + \lambda I_p\)? Hint 3: \(X^\top X\) is positive semi-definite, i.e. it has nonnegative eigenvalues.)
Exercise. Show that (with the penalized formulation of ridge regression), \[\widehat{\beta} = (X^\top X + \lambda I_p)^{-1} X^\top y.\] (Hint: set the gradient of the objective function to zero.) How many solutions are there?
Exercise. The fitted value \(\widehat{y} = X \widehat{\beta}_{RR}\) can be written as \(\widehat{y} = H_\lambda y\), i.e., this is a linear estimator. Write down \(H_\lambda\). Its trace is the effective degrees of freedom of this estimator; check that this is \(p\) when \(\lambda=0\).
Exercise. Roughly argue why this is a biased estimator of \(\beta\).
LASSO is defined similarly. One is the constrained form with a radius \(c\) as a tuning parameter. \[\widehat{\beta}_{LASSO} = \arg\min_{\beta \in \mathbb{R}^p : \|\beta\|_1 \le c} \|y - X\beta\|_2^2\] The other is the Lagrangian/penalized form with penalty parameter \(\lambda\). \[\widehat{\beta}_{LASSO} = \arg\min_{\beta \in \mathbb{R}^p} \|y - X\beta\|^2 + \lambda \|\beta\|_1\] Again, we focus on the penalized formulation.
Here are a few properties regarding the solution set of this optimization problem (see Lemma 1 of this paper).
Let us visualize the constrained formulations of ridge regression and LASSO in the case \(p=2\). In both cases, the objective function that we seek to minimize is \(f(\beta) = \|y - X \beta\|^2\), which is a quadratic function in the vector \(\beta\), so it can be shown to have elliptical contours. The constraint sets \(\|\beta\|_2^2 \le c\) and \(\|\beta\|_1 \le c\) are a circle and a square respectively.
Exercise. In terms of the contours and the two constraint sets, describe where the solution to ridge regression and LASSO appear in the above pictures. In particular, what is \(\widehat{\beta}_1\) in the LASSO solution?
Exercise. Explain why in the above pictures \(n \ge p\).
Exercise. Compare the ridge and LASSO solutions with the OLS solution. Why are these methods sometimes referred to as “shrinkage” estimators?
Exercise. When the radius \(c\) is very large, what are the solutions to each optimization problem?
The above picture attempts to illustrate an attractive property of the LASSO estimator: it tends to return solutions that are sparse (having few nonzero entries). This suggests another possible advantage of LASSO beyond being able to deal with the case \(p > n\): if the true parameter \(\beta\) is sparse (i.e., only a few of the \(p\) explanatory variables actually enter our linear model \(y = X \beta + \epsilon\)), then maybe LASSO can do a good job estimating \(\beta\).
Why don’t we do best subset selection instead? We could ask for \[\min_{\beta : \sum_{j=1}^p I(\beta_j \ne 0) \le k} \|y - X \beta \|^2,\] i.e. look for the \(\beta\) that gives the best fit among all \(\beta\) that have \(\le k\) nonzero components. (Recall that this is essentially what regsubsets
does.) However, as the number of subsets you need to explore grows exponentially in \(k\) Even the toy example below with the modest sizes \(n=100\), \(p=150\), and \(k\approx p/4\) takes a long time. (Note that \(\binom{p}{k}\) here is already on the order of \(10^{35}\).) Imagine what happens when \(p\) is even larger.
library(leaps)
library(glmnet)
n <- 100
p <- 150
k <- floor(p / 4)
X <- matrix(rnorm(n*p), n, p)
beta <- rep(0, p)
idx <- sample(1:p, k)
beta[idx] <- 1
beta
## [1] 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [36] 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1
## [71] 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
## [106] 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0
## [141] 0 0 1 0 0 0 0 0 0 1
y <- X %*% beta + rnorm(n, sd=0.5)
# run these lines to freeze your computer
# rs <- regsubsets(x=X, y=y, nvmax=k, really.big=T)
# summary(rs)
It turns out that it is possible to solve the LASSO optimization problem using “hill climbing” methods; in R, glmnet
uses coordinate descent. Running glmnet
on the above example is almost instantaneous.
Side note: ridge regression and LASSO arise naturally from a Bayesian perspective by considering the Gaussian prior and the Laplace prior respectively, on the coefficients \(\beta_j\).
glmnet
The glmnet vignette is very useful. The following examples are from here.
glmnet
by solves the following problem \[\min_{\beta_0, \beta} \frac{1}{n}\sum_{i=1}^n (y_i - \beta_0 - \beta^\top x_i)^2 + \lambda \left[\frac{1-\alpha}{2} \|\beta\|_2^2 + \alpha \|\beta\|_1\right].\]
Where \(\alpha \in [0,1]\) is a parameter alpha
to be passed into the call to glmnet()
. By default alpha=1
which gives LASSO. To do ridge regression, choose alpha=0
. Intermediate values of alpha
give a mixture of the two penalties; this general estimator is called elastic net regularization.
A call to glmnet
will solve this optimization problem for many values of \(\lambda\). You can manually specify how many values of \(\lambda\) to try by changing nlambda
(default is 100
), or even pass in your own sequence of \(\lambda\) values to lambda
.
library(glmnet)
Let us consider an example where \(p=2000\) but only \(k=20\) entries of \(\beta\) are nonzero.
set.seed(0)
n <- 1000
p <- 2000
k <- 20
X <- matrix(rnorm(n*p), n, p)
beta <- c(rep(1, k), rep(0, p-k))
y <- X %*% beta + rnorm(n)
idx <- sample(1:n, 0.66*n)
X.train <- X[idx,]
y.train <- y[idx]
X.test <- X[-idx,]
y.test <- y[-idx]
# LASSO
fit <- glmnet(X.train, y.train)
plot(fit)
plot(fit, xvar="lambda")
fit
##
## Call: glmnet(x = X.train, y = y.train)
##
## Df %Dev Lambda
## [1,] 0 0.000000 1.38400
## [2,] 1 0.007732 1.32100
## [3,] 3 0.018680 1.26100
## [4,] 3 0.036120 1.20400
## [5,] 6 0.058230 1.14900
## [6,] 7 0.088860 1.09700
## [7,] 9 0.126700 1.04700
## [8,] 14 0.171100 0.99960
## [9,] 14 0.220400 0.95410
## [10,] 17 0.268700 0.91080
## [11,] 17 0.319500 0.86940
## [12,] 18 0.368400 0.82980
## [13,] 18 0.413800 0.79210
## [14,] 20 0.460300 0.75610
## [15,] 20 0.504400 0.72180
## [16,] 20 0.544600 0.68900
## [17,] 20 0.581200 0.65760
## [18,] 20 0.614600 0.62770
## [19,] 20 0.645000 0.59920
## [20,] 20 0.672700 0.57200
## [21,] 20 0.697900 0.54600
## [22,] 20 0.720900 0.52120
## [23,] 20 0.741900 0.49750
## [24,] 20 0.761000 0.47490
## [25,] 20 0.778400 0.45330
## [26,] 20 0.794200 0.43270
## [27,] 20 0.808700 0.41300
## [28,] 20 0.821800 0.39420
## [29,] 20 0.833800 0.37630
## [30,] 20 0.844800 0.35920
## [31,] 20 0.854700 0.34290
## [32,] 20 0.863800 0.32730
## [33,] 20 0.872100 0.31240
## [34,] 20 0.879600 0.29820
## [35,] 20 0.886500 0.28470
## [36,] 20 0.892700 0.27170
## [37,] 20 0.898400 0.25940
## [38,] 20 0.903600 0.24760
## [39,] 20 0.908300 0.23630
## [40,] 20 0.912600 0.22560
## [41,] 20 0.916600 0.21530
## [42,] 20 0.920100 0.20560
## [43,] 20 0.923400 0.19620
## [44,] 20 0.926400 0.18730
## [45,] 20 0.929100 0.17880
## [46,] 20 0.931600 0.17070
## [47,] 20 0.933800 0.16290
## [48,] 20 0.935800 0.15550
## [49,] 20 0.937700 0.14840
## [50,] 23 0.939500 0.14170
## [51,] 24 0.941200 0.13520
## [52,] 26 0.942800 0.12910
## [53,] 27 0.944400 0.12320
## [54,] 27 0.945800 0.11760
## [55,] 29 0.947200 0.11230
## [56,] 34 0.948600 0.10720
## [57,] 37 0.949900 0.10230
## [58,] 42 0.951300 0.09766
## [59,] 45 0.952600 0.09322
## [60,] 61 0.954000 0.08898
## [61,] 71 0.955600 0.08494
## [62,] 84 0.957200 0.08108
## [63,] 97 0.958900 0.07739
## [64,] 107 0.960500 0.07387
## [65,] 117 0.962100 0.07052
## [66,] 126 0.963700 0.06731
## [67,] 144 0.965300 0.06425
## [68,] 156 0.966900 0.06133
## [69,] 172 0.968500 0.05854
## [70,] 186 0.970000 0.05588
## [71,] 196 0.971500 0.05334
## [72,] 205 0.972900 0.05092
## [73,] 213 0.974200 0.04860
## [74,] 229 0.975500 0.04640
## [75,] 239 0.976700 0.04429
## [76,] 257 0.978000 0.04227
## [77,] 276 0.979200 0.04035
## [78,] 294 0.980400 0.03852
## [79,] 306 0.981500 0.03677
## [80,] 311 0.982600 0.03510
## [81,] 327 0.983600 0.03350
## [82,] 337 0.984600 0.03198
## [83,] 343 0.985500 0.03052
## [84,] 354 0.986400 0.02914
## [85,] 362 0.987200 0.02781
## [86,] 373 0.988000 0.02655
## [87,] 389 0.988800 0.02534
## [88,] 399 0.989500 0.02419
## [89,] 405 0.990200 0.02309
## [90,] 421 0.990900 0.02204
## [91,] 433 0.991500 0.02104
## [92,] 442 0.992100 0.02008
## [93,] 450 0.992600 0.01917
## [94,] 460 0.993100 0.01830
## [95,] 473 0.993600 0.01747
## [96,] 480 0.994000 0.01667
## [97,] 496 0.994500 0.01592
## [98,] 507 0.994900 0.01519
## [99,] 517 0.995200 0.01450
## [100,] 529 0.995600 0.01384
You can pick a particular value of \(\lambda\) to use for prediction by using the s
parameter.
# Manually pick some lambda
lambda <- 0.6
head(coef(fit, s=lambda), 31)
## 31 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.06033661
## V1 0.35783795
## V2 0.49382138
## V3 0.48093568
## V4 0.48699787
## V5 0.40389194
## V6 0.34392391
## V7 0.20942318
## V8 0.45099496
## V9 0.39386595
## V10 0.41449659
## V11 0.23196174
## V12 0.39679559
## V13 0.50657539
## V14 0.32789130
## V15 0.50899255
## V16 0.39449459
## V17 0.29856477
## V18 0.61545884
## V19 0.49073542
## V20 0.58281857
## V21 .
## V22 .
## V23 .
## V24 .
## V25 .
## V26 .
## V27 .
## V28 .
## V29 .
## V30 .
yhat.lambda <- predict(fit, newx=X.test, s=lambda)
mean((yhat.lambda - y.test)^2)
## [1] 8.08269
Alternatively you can pick \(\lambda\) by cross-validation by using cv.glmnet
.
cvfit <- cv.glmnet(X.train, y.train)
plot(cvfit)
log(cvfit$lambda.min)
## [1] -2.698428
log(cvfit$lambda.1se)
## [1] -2.233259
lambda.min
is the choice of \(\lambda\) that minimizes the cross validation error. lambda.1se
is the largest lambda (resulting in more shrinkage) such that the cross validation is within one standard error of the minimum. Apparently lambda.1se
is a standard choice; erring on the side of parsimony (simpler model) helps mitigate the fact that the cross validation errors are only estimates of the risk curves. [See here for more discussion.]
head(coef(cvfit, s=cvfit$lambda.1se), 31)
## 31 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.006803268
## V1 0.929112497
## V2 0.846626145
## V3 0.907562891
## V4 0.965250547
## V5 0.919974322
## V6 0.852307175
## V7 0.801137450
## V8 0.866816067
## V9 0.897085107
## V10 0.891687225
## V11 0.821478764
## V12 0.889524147
## V13 0.926431562
## V14 0.826475507
## V15 0.936252930
## V16 0.859245430
## V17 0.876292023
## V18 0.948283398
## V19 0.890718534
## V20 0.972174863
## V21 .
## V22 .
## V23 .
## V24 .
## V25 .
## V26 .
## V27 .
## V28 .
## V29 .
## V30 .
yhat.cvlambda <- predict(cvfit, newx=X.test, s=cvfit$lambda.1se)
mean((yhat.cvlambda - y.test)^2)
## [1] 1.392953
Let’s look at ridge regression
# Ridge
fit <- glmnet(X.train, y.train, alpha=0)
plot(fit)
plot(fit, xvar="lambda")
cvfit <- cv.glmnet(X.train, y.train, alpha=0)
plot(cvfit)
head(coef(cvfit, s=cvfit$lambda.1se), 31)
## 31 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.1241521163
## V1 0.1035240482
## V2 0.1324763659
## V3 0.1207788741
## V4 0.1202793742
## V5 0.1111811138
## V6 0.0961574424
## V7 0.0834096124
## V8 0.1224087312
## V9 0.1099697054
## V10 0.1156369448
## V11 0.0835027180
## V12 0.1086204498
## V13 0.1270814365
## V14 0.1112939820
## V15 0.1287499578
## V16 0.1085714298
## V17 0.0965797599
## V18 0.1438644679
## V19 0.1265657453
## V20 0.1348795572
## V21 -0.0009848824
## V22 -0.0129740714
## V23 0.0114050240
## V24 -0.0101038484
## V25 0.0050290544
## V26 -0.0025458250
## V27 -0.0232819059
## V28 0.0127580150
## V29 -0.0005004794
## V30 -0.0013555446
yhat.cvlambda <- predict(cvfit, newx=X.test, s=cvfit$lambda.1se)
mean((yhat.cvlambda - y.test)^2)
## [1] 17.41077
Let’s compare LASSO, ridge, and some other choices of the elastic net parameter \(\alpha\). Recall that in our example example where \(p=2000\) but only \(k=20\) entries of \(\beta\) are nonzero, and we have \(n=1000\) datapoints.
t <- 5
fits <- list()
MSEs <- rep(0, t+1)
for (i in 0:t) {
fits[[i+1]] <- cv.glmnet(X.train, y.train, alpha=i/t)
yhat <- predict(fits[[i+1]], newx=X.test, s=fits[[i+1]]$lambda.1se)
MSEs[i+1] <- mean((yhat - y.test)^2)
}
MSEs
## [1] 16.885109 1.662305 1.482336 1.419935 1.415476 1.410864
(which.min(MSEs) - 1)/t
## [1] 1
LASSO does best in this case because few components of the true \(\beta\) are nonzero.
Let’s now consider an example where there are a lot more nonzero coefficients.
set.seed(0)
n <- 1000
p <- 2000
k <- 500
X <- matrix(rnorm(n*p), n, p)
beta <- c(rep(1, k), rep(0, p-k))
y <- X %*% beta + rnorm(n)
idx <- sample(1:n, 0.66*n)
X.train <- X[idx,]
y.train <- y[idx]
X.test <- X[-idx,]
y.test <- y[-idx]
fit.l <- glmnet(X.train, y.train)
plot(fit.l, xvar="lambda")
fit.r <- glmnet(X.train, y.train, alpha=0)
plot(fit.r, xvar="lambda")
t <- 5
fits <- list()
MSEs <- rep(0, t+1)
for (i in 0:t) {
fits[[i+1]] <- cv.glmnet(X.train, y.train, alpha=i/t)
yhat <- predict(fits[[i+1]], newx=X.test, s=fits[[i+1]]$lambda.1se)
MSEs[i+1] <- mean((yhat - y.test)^2)
}
MSEs
## [1] 362.4554 421.8606 429.0294 437.6898 430.0722 433.3925
(which.min(MSEs) - 1)/t
## [1] 0
Here ridge regression does best. Here the sparsity of the true \(\beta\) is not small enough for LASSO to have better prediction error than ridge regression (and LASSO probably does not find the right variables, from looking at the paths plot above). However, despite the better prediction error, the ridge regression estimate is not very interpretable since it involves all \(p=2000\) variables.
Ridge and elastic net also perform better than LASSO when the explanatory variables are correlated/colinear. See here for some intuition, and here for a simulation.
As suggested by the name of the package, glmnet
can also do regularization with other GLM models. We simply replace the least squares term \(\|y - X \beta\|^2\) with the negative log likelihood of whichever GLM we want.
In the case of logistic regression, it is \[\min_{\beta_0, \beta} -\left[\frac{1}{n} \sum_{i=1}^n y_i (\beta_0 + x_i^\top \beta) - \log(1+\exp(\beta_0 + x_i^\top\beta))\right] + \lambda \left[\frac{1-\alpha}{2} \|\beta\|_2^2 + \alpha \|\beta\|_1\right].\]
We can ask glmnet
to use this by adding family="binomial"
to the call. Here is an example on the Titanic dataset.
titanic <- read.csv("train_titanic.csv")
titanic <- na.omit(titanic)
# need to use model.matrix() to create design matrix for glmnet
dd <- titanic[, c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
dd$Pclass <- as.factor(dd$Pclass)
X <- model.matrix(Survived~., dd)
X <- X[,-1]
fit <- glmnet(X, titanic$Survived, family="binomial")
plot(fit, xvar="lambda")
cvfit <- cv.glmnet(X, titanic$Survived, family="binomial")
plot(cvfit)
coef(cvfit, s=cvfit$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.072540422
## Pclass2 -0.232963212
## Pclass3 -1.368366406
## Sexmale -2.172555695
## Age -0.017343599
## SibSp -0.106753585
## Parch .
## Fare 0.002004744
## EmbarkedC 0.364434572
## EmbarkedQ .
## EmbarkedS .