So far we have mainly considered the model \(y_i \overset{\text{indep}}{\sim} N(\mu_i, \sigma^2)\), where \(\mu_i = x_i^\top \beta\). In the last few weeks you learned about a much more general family of models (generalized linear models) that contains this linear model as a special case. The generality of GLMs helps combat modeling deficiencies of the Gaussian linear model, by allowing for more distributions, and also letting \(x_i^\top \beta\) model a parameter of the distribution that might not be the mean.
Roughtly, a GLM is specified by two things.
We assume the \(y_i\) are independent and each follow a distribution from the exponential dispersion family (EDF), which is a family of distributions that includes many common distributions. The PDF/PMF of \(y_i\) is of the form \[f(y_i ; \theta_i ,\phi_i) = h(y_i, \phi_i) \exp \left(\frac{y_i \theta_i - b(\theta_i)}{\phi_i}\right).\] Let’s meet the cast.
Note that the canonical parameter \(\theta_i\) appears only in two places in the form of the PDF/PMF.
Refer to your lecture notes for examples of common distributions (Bernoulli, binomial, Poisson, Gamma) in this family. You should try to write them in EDF form on your own for practice. Remember to follow the rules (\(h\) cannot depend on \(\theta_i\), etc.).
We have specified a distribution for \(y_i\), but we have not yet specified how \(x_i^\top \beta\) enters the model. In our linear model, \[\mu_i \equiv \mathbb{E}[y_i] = x_i^\top \beta,\] but we might consider letting \(x_i^\top \beta\) be a different parameter \(g(\mu_i)\) of the distribution, rather than the mean.
What is the mean of this distribution, i.e. what is \(\mathbb{E}[y_i]\)? A very important property of \(b\) is \[\mu_i \equiv \mathbb{E}[y_i] = b'(\theta_i).\] We can rewrite this as \[g(\mu_i) = \theta_i,\] where \(g := (b')^{-1}\) is called the canonical link function. If we use this link function, then \(x_i^\top \beta\) enters the model directly as \[\theta_i = g(\mu_i) = x_i^\top \beta.\]
The Bernoulli PMF can be written as \[f(y_i; p_i) = p_i^{y_i} (1 - p_i)^{1 - y_i} = \exp \left(y_i \log \frac{p_i}{1 - p_i} + \log(1 - p_i) \right), \qquad y_i \in \{0,1\}.\] (Note that \(p_i = \mu_i\).) So, the canonical parameter is the log-odds \(\theta_i = \log \frac{p_i}{1 - p_i}\), and if we choose to use the canonical link function, our model would have \(\log \frac{p_i}{1 - p_i} = x_i^\top \beta\) or equivalently \[p_i = \frac{\exp(x_i^\top \beta)}{1 + \exp(x_i^\top \beta)}.\] This is logistic regression.
In probit regression, one uses the link function \(g = \Phi^{-1}\) instead of the canonical link function \(g(p_i) = \log \frac{p_i}{1 - p_i}\). This leads to \[p_i = \Phi(x_i^\top \beta)\] instead.
The Poisson PMF can be written as \[f(y_i; \lambda_i) = e^{- \lambda_i} \frac{\lambda_i^{y_i}}{y_i!} = \frac{1}{y_i!} \exp(y_i \log \lambda_i - \lambda_i), \qquad y_i \in \{0, 1, 2, \ldots,\}.\] (Note \(\mu_i = \lambda_i\).) Thus if we use the canonical link function, we would have \(\theta_i := \log \lambda_i = x_i^\top \beta\), or equivalently \[\lambda_i = \exp(x_i^\top \beta).\]
One definition of residual deviance for a model is \[- 2 \log (\text{maximized likelihood of our model}) %= - 2\sum_{i=1}^n \log h(y_i, \phi_i) - 2 \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi_i)} .\]
The null deviance is \[-2 \log (\text{maximized likelihood of the intercept-only model}).\]
Roughly, the larger the difference between residual deviance and null deviance, the better your model explains the variation in the data.
Warning: sometimes these deviance terms are defined with an added constant. For example, your textbook adds an extra term \(+ 2 \log (\text{maximized log likelihood of saturated model})\) (see 15.1.1 or 15.3.3 of Fox) to both definitions. For the sake of comparing the residual deviance and null deviance, this does not really change anything, since the difference between the two will remain the same with or without this added constant. However, this extra term does “standardize” things in some sense; for example in the linear model examples above, it precisely gets rid of the annoying \(n \log (2 \pi \sigma^2)\) term.
glm
summaryThis is an excerpt of code from yesterday’s lecture
library(DAAG)
## Loading required package: lattice
data(frogs)
help(frogs)
frogs.glm0 <- glm(formula = pres.abs ~ altitude + log(distance) +
log(NoOfPools) + NoOfSites + avrain + meanmin + meanmax,
family = binomial, data = frogs)
summary(frogs.glm0)
##
## Call:
## glm(formula = pres.abs ~ altitude + log(distance) + log(NoOfPools) +
## NoOfSites + avrain + meanmin + meanmax, family = binomial,
## data = frogs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9795 -0.7193 -0.2785 0.7964 2.5658
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.090e+01 1.327e+02 0.308 0.757845
## altitude -6.648e-03 3.866e-02 -0.172 0.863466
## log(distance) -7.593e-01 2.554e-01 -2.973 0.002945 **
## log(NoOfPools) 5.727e-01 2.162e-01 2.649 0.008083 **
## NoOfSites -8.979e-04 1.074e-01 -0.008 0.993330
## avrain -6.793e-03 5.999e-02 -0.113 0.909848
## meanmin 5.305e+00 1.543e+00 3.439 0.000584 ***
## meanmax -3.173e+00 4.839e+00 -0.656 0.512048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 279.99 on 211 degrees of freedom
## Residual deviance: 197.62 on 204 degrees of freedom
## AIC: 213.62
##
## Number of Fisher Scoring iterations: 5
n <- dim(frogs)[1]
p <- 7
y <- frogs$pres.abs
ybar <- mean(y)
phat <- fitted(frogs.glm0)
# residual deviance
my.res.dev <- -2 * sum(y * log(phat) + (1 - y) * log(1 - phat))
c(my.res.dev, summary(frogs.glm0)$deviance)
## [1] 197.6249 197.6249
# residual degrees of freedom
c(n - p - 1, summary(frogs.glm0)$df.residual)
## [1] 204 204
# null deviance
my.null.dev <- -2 * n * (ybar * log(ybar) + (1 - ybar) * log(1 - ybar))
c(my.null.dev, summary(frogs.glm0)$null.deviance)
## [1] 279.987 279.987
# null degrees of freedom
c(n - 1, summary(frogs.glm0)$df.null)
## [1] 211 211
# AIC is residual deviance + 2 * (p + 1)
c(my.res.dev + 2 * (p + 1), summary(frogs.glm0)$aic)
## [1] 213.6249 213.6249
Before we turn to our application, let us glance at a comparison of what linear regression achieves.
frogs.lm <- lm(formula = pres.abs ~ altitude + log(distance) +
log(NoOfPools) + NoOfSites + avrain + meanmin + meanmax,
data = frogs)
phat.lm <- fitted(frogs.lm)
plot(phat.lm, phat, xlab="linear model fitted prob.", ylab="logistic fitted prob.", ylim=c(0,1))
abline(0, 1)
Suppose we get data for a new site and want to predict whether frogs will be at the site or not. If we use our fitted model, we will obtain an estimate for the probability of frogs being present. But how do we turn this into a prediction? Should we use a cutoff of \(0.5\), and predict yes if the estimated probability is \(\ge 0.5\)?
Consider the following confusion matrix. \[\begin{array}{c|cc} & \text{predict $0$} & \text{predict $1$} \\ \hline \text{actual $0$} & a & b \\ \text{actual $1$} & c & d \end{array}\]
We would like \(a\) and \(d\) to be large, and we would like \(b\) and \(c\) to be small.
Let us compare the values of \(b+c\) on our given data, for various threshold values.
thres.vec <- seq(0, 1, by=0.05)
conf <- matrix(0, nrow=length(thres.vec), ncol=4)
conf.lm <- matrix(0, nrow=length(thres.vec), ncol=4)
for (i in 1:length(thres.vec)) {
thres <- thres.vec[i]
conf[i, 1] <- sum((!y) & (phat < thres))
conf[i, 2] <- sum((!y) & (phat >= thres))
conf[i, 3] <- sum(y & (phat < thres))
conf[i, 4] <- sum(y & (phat >= thres))
}
for (i in 1:length(thres.vec)) {
thres <- thres.vec[i]
conf.lm[i, 1] <- sum((!y) & (phat.lm < thres))
conf.lm[i, 2] <- sum((!y) & (phat.lm >= thres))
conf.lm[i, 3] <- sum(y & (phat.lm < thres))
conf.lm[i, 4] <- sum(y & (phat.lm >= thres))
}
bplusc <- conf[,2] + conf[,3]
bplusc.lm <- conf.lm[,2] + conf.lm[,3]
matplot(thres.vec, cbind(bplusc, bplusc.lm), xlab="threshold", ylab="b+c", type='l', col='black')
thres.vec[which.min(bplusc)]
## [1] 0.55
legend(x=0.8, y=120, lty=1:2, col=1:2, c("glm","lm"))
We see that \(0.55\) is the best threshold for logistic regression if we want to minimize \(b+c\), the number of errors.
Our choice of \(b+c\) was arbitrary, and assumed that we consider both types of errors (false positive, false negative) as equally bad.
For example, suppose a bank wants to decide whether to accept a customer’s request for a loan. There are two types of errors.
The second error is more severe for the bank, so in that situation one might put more weight on one error over the other.
As another example, consider a medical test for detecting a lethal disease.
Depending on your perspective, you may put more weight on one error over the other.
Let us see what happens if we used \(b + 3c\) for our frog example instead. (You really want to find frogs, so you would be really sad if you did not visit a site that had frogs only becuase your model mistakenly predicted that there are no frogs there.)
bplus3c <- conf[,2] + 3*conf[,3]
bplus3c.lm <- conf.lm[,2] + 3*conf.lm[,3]
matplot(thres.vec, cbind(bplus3c, bplus3c.lm), xlab="threshold", ylab="b+3c", type='l')
thres.vec[which.min(bplus3c)]
## [1] 0.25
legend(x=0.8, y=120, lty=1:2, c("glm","lm"))
As expected, the best threshold (on our data) lowered (to \(0.25\)), since we want to over-label sites as “yes” to avoid missing sites that frogs.
The receiver operator characteristic (ROC) curve is a plot of true positive rate \(\frac{d}{c+d}\) against false positive rate \(\frac{b}{a+b}\) as you change the threshold.
tpr <- conf[,4] / (conf[,3] + conf[,4])
fpr <- conf[,2] / (conf[,1] + conf[,2])
tpr.lm <- conf.lm[,4] / (conf.lm[,3] + conf.lm[,4])
fpr.lm <- conf.lm[,2] / (conf.lm[,1] + conf.lm[,2])
matplot(cbind(fpr, fpr.lm), cbind(tpr, tpr.lm), type='l', col='black', xlim=c(0,1), ylim=c(0,1),
xlab="false positive rate", ylab="true positive rate", main="ROC curve")
legend(0.8, 0.2, lty=1:2, c("glm", "lm"))
Some exploratory data analysis suggested log transforms for some of the variables
library(DAAG)
data(spam7)
help(spam7)
spam <- spam7
head(spam)
## crl.tot dollar bang money n000 make yesno
## 1 278 0.000 0.778 0.00 0.00 0.00 y
## 2 1028 0.180 0.372 0.43 0.43 0.21 y
## 3 2259 0.184 0.276 0.06 1.16 0.06 y
## 4 191 0.000 0.137 0.00 0.00 0.00 y
## 5 191 0.000 0.135 0.00 0.00 0.00 y
## 6 54 0.000 0.000 0.00 0.00 0.00 y
n <- dim(spam)[1]
s <- 1e-3 # Correction for zero. Log(0) is infinicty
spam.glm <- glm(yesno~ log(crl.tot) + log(dollar+s) + log(bang+s)
+log(money+s) + log(n000+s) + log(make+s),
family=binomial, data=spam)
summary(spam.glm)
##
## Call:
## glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang +
## s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial,
## data = spam)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1657 -0.4367 -0.2863 0.3609 2.7152
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.11947 0.36342 11.335 < 2e-16 ***
## log(crl.tot) 0.30228 0.03693 8.185 2.71e-16 ***
## log(dollar + s) 0.32586 0.02365 13.777 < 2e-16 ***
## log(bang + s) 0.40984 0.01597 25.661 < 2e-16 ***
## log(money + s) 0.34563 0.02800 12.345 < 2e-16 ***
## log(n000 + s) 0.18947 0.02931 6.463 1.02e-10 ***
## log(make + s) -0.11418 0.02206 -5.177 2.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6170.2 on 4600 degrees of freedom
## Residual deviance: 3245.1 on 4594 degrees of freedom
## AIC: 3259.1
##
## Number of Fisher Scoring iterations: 6
spam.lm <- lm(as.numeric(yesno=="y")~ log(crl.tot) + log(dollar+s)
+ log(bang+s) +log(money+s) + log(n000+s) + log(make+s)
,data=spam)
summary(spam.lm)
##
## Call:
## lm(formula = as.numeric(yesno == "y") ~ log(crl.tot) + log(dollar +
## s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make +
## s), data = spam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10937 -0.13830 -0.05674 0.15262 1.05619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.078531 0.034188 31.547 < 2e-16 ***
## log(crl.tot) 0.028611 0.003978 7.193 7.38e-13 ***
## log(dollar + s) 0.054878 0.002934 18.703 < 2e-16 ***
## log(bang + s) 0.064522 0.001919 33.619 < 2e-16 ***
## log(money + s) 0.039776 0.002751 14.457 < 2e-16 ***
## log(n000 + s) 0.018530 0.002815 6.582 5.16e-11 ***
## log(make + s) -0.017380 0.002370 -7.335 2.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3391 on 4594 degrees of freedom
## Multiple R-squared: 0.5193, Adjusted R-squared: 0.5186
## F-statistic: 827.1 on 6 and 4594 DF, p-value: < 2.2e-16
p <- 6
y <- as.numeric(spam$yesno) - 1
phat <- fitted(spam.glm)
phat.lm <- fitted(spam.lm)
We can use the same code as before to consider different thresholds for prediction.
## [1] 0.5