Brief review of GLMs so far

So far we have mainly considered the model $y_i \overset{\text{indep}}{\sim} N(\mu_i, \sigma^2)$, where $\mu_i = x_i^\top \beta$. In the last few weeks you learned about a much more general family of models (generalized linear models) that contains this linear model as a special case. The generality of GLMs helps combat modeling deficiencies of the Gaussian linear model, by allowing for more distributions, and also letting $x_i^\top \beta$ model a parameter of the distribution that might not be the mean.

Roughtly, a GLM is specified by two things.

Distribution (e.g., from the exponential dispersion family).
Link function: how does $x_i^\top \beta$ enter the model? (e.g., canonical link function)

Exponential dispersion model

We assume the $y_i$ are independent and each follow a distribution from the exponential dispersion family (EDF), which is a family of distributions that includes many common distributions. The PDF/PMF of $y_i$ is of the form \[f(y_i ; \theta_i ,\phi_i) = h(y_i, \phi_i) \exp \left(\frac{y_i \theta_i - b(\theta_i)}{\phi_i}\right).\] Let’s meet the cast.

canonical parameter $\theta_i$ (might not be the mean of the distribution)
cumulant function / normalization function $b(\theta_i)$
dispersion parameter $\phi_i$ (often assumed to be the same for all $i$)
base measure $h(y_i, \phi_i)$ does not depend on $\theta_i$

Note that the canonical parameter $\theta_i$ appears only in two places in the form of the PDF/PMF.

Refer to your lecture notes for examples of common distributions (Bernoulli, binomial, Poisson, Gamma) in this family. You should try to write them in EDF form on your own for practice. Remember to follow the rules ($h$ cannot depend on $\theta_i$, etc.).

Link function

We have specified a distribution for $y_i$, but we have not yet specified how $x_i^\top \beta$ enters the model. In our linear model, \[\mu_i \equiv \mathbb{E}[y_i] = x_i^\top \beta,\] but we might consider letting $x_i^\top \beta$ be a different parameter $g(\mu_i)$ of the distribution, rather than the mean.

What is the mean of this distribution, i.e. what is $\mathbb{E}[y_i]$? A very important property of $b$ is \[\mu_i \equiv \mathbb{E}[y_i] = b'(\theta_i).\] We can rewrite this as \[g(\mu_i) = \theta_i,\] where $g := (b')^{-1}$ is called the canonical link function. If we use this link function, then $x_i^\top \beta$ enters the model directly as \[\theta_i = g(\mu_i) = x_i^\top \beta.\]

Example: Logistic regression and probit regression

The Bernoulli PMF can be written as \[f(y_i; p_i) = p_i^{y_i} (1 - p_i)^{1 - y_i} = \exp \left(y_i \log \frac{p_i}{1 - p_i} + \log(1 - p_i) \right), \qquad y_i \in \{0,1\}.\] (Note that $p_i = \mu_i$.) So, the canonical parameter is the log-odds $\theta_i = \log \frac{p_i}{1 - p_i}$, and if we choose to use the canonical link function, our model would have $\log \frac{p_i}{1 - p_i} = x_i^\top \beta$ or equivalently \[p_i = \frac{\exp(x_i^\top \beta)}{1 + \exp(x_i^\top \beta)}.\] This is logistic regression.

In probit regression, one uses the link function $g = \Phi^{-1}$ instead of the canonical link function $g(p_i) = \log \frac{p_i}{1 - p_i}$. This leads to \[p_i = \Phi(x_i^\top \beta)\] instead.

Example: Poisson regression

The Poisson PMF can be written as \[f(y_i; \lambda_i) = e^{- \lambda_i} \frac{\lambda_i^{y_i}}{y_i!} = \frac{1}{y_i!} \exp(y_i \log \lambda_i - \lambda_i), \qquad y_i \in \{0, 1, 2, \ldots,\}.\] (Note $\mu_i = \lambda_i$.) Thus if we use the canonical link function, we would have $\theta_i := \log \lambda_i = x_i^\top \beta$, or equivalently \[\lambda_i = \exp(x_i^\top \beta).\]

Deviances

One definition of residual deviance for a model is \[- 2 \log (\text{maximized likelihood of our model}) %= - 2\sum_{i=1}^n \log h(y_i, \phi_i) - 2 \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi_i)} .\]

Example: linear model. Consider the usual linear model $y \sim N(X \beta, \sigma^2 I_n)$, where $\sigma^2$ is assumed to be known. One can check (exercise) that the residual deviance here is \[- 2 \ell(\widehat{\beta}) = \frac{\|y - X \widehat{\beta}\|^2}{{\sigma}^2} + n \log (2 \pi {\sigma}^2) = \frac{\text{RSS}}{{\sigma}^2} + n \log(2 \pi {\sigma}^2).\] So residual deviance is in some sense a generalization of $\text{RSS}$.
Example: logistic model. In the logistic model $y_i \sim \text{Bern}(p_i)$ with $\log \frac{p_i}{1 - p_i} = x_i^\top \beta$, this would be \[- 2 \ell(\widehat{\beta}) = - 2 \sum_{i=1}^n \left[y_i \log \widehat{p}_i + (1 - y_i) \log (1 - \widehat{p}_i)\right]\] where $\widehat{p}$ satisfies $\log \frac{\widehat{p}_i}{1 - \widehat{p}_i} = x_i^\top \widehat{\beta}$ and $\widehat{\beta}$ is the MLE for which we do not have a closed expression.

The null deviance is \[-2 \log (\text{maximized likelihood of the intercept-only model}).\]

Example: linear model. Consider the linear model, $y_i \sim N(\beta_0, \sigma^2)$, where $\sigma^2$ is assumed to be known. We know the MLE is $\widehat{\beta}_0 = \overline{y}$, so the null deviance is \[\frac{\|y - \overline{y}\|^2}{\sigma^2} + n \log (2 \pi \sigma^2) = \frac{\text{TSS}}{\sigma^2} + n \log (2 \pi \sigma^2)\] So null deviance is in some sense a generalization of $\text{TSS}$.
Example: logistic model. In the logistic model with only an intercept, we have $y_i \sim \text{Bern}(p) = \text{Bern}\left(\frac{\exp(\beta_0)}{1 + \exp(\beta_0)}\right)$ (i.e., each $y_i$ is Bernoulli with a common parameter $p$) so one can check (exercise) that the likelihood is minimized when $\widehat{\beta}_0 = \log \frac{\overline{y}}{1 - \overline{y}}$ (or equivalently, when $\widehat{p} = \overline{y}$). Thus, the null deviance is \[- 2 \sum_{i=1}^n \left[y_i \log \overline{y} + (1 - y_i) \log (1 - \overline{y})\right] = - 2 n \left[\overline{y} \log \overline{y} + (1 - \overline{y}) \log (1 - \overline{y})\right].\]

Roughly, the larger the difference between residual deviance and null deviance, the better your model explains the variation in the data.

Warning: sometimes these deviance terms are defined with an added constant. For example, your textbook adds an extra term $+ 2 \log (\text{maximized log likelihood of saturated model})$ (see 15.1.1 or 15.3.3 of Fox) to both definitions. For the sake of comparing the residual deviance and null deviance, this does not really change anything, since the difference between the two will remain the same with or without this added constant. However, this extra term does “standardize” things in some sense; for example in the linear model examples above, it precisely gets rid of the annoying $n \log (2 \pi \sigma^2)$ term.

Deviance computations in `glm` summary

This is an excerpt of code from yesterday’s lecture

library(DAAG)

## Loading required package: lattice

data(frogs)
help(frogs)

frogs.glm0 <- glm(formula = pres.abs ~ altitude + log(distance) +
                    log(NoOfPools) + NoOfSites + avrain + meanmin + meanmax,
                  family = binomial, data = frogs)
summary(frogs.glm0)

## 
## Call:
## glm(formula = pres.abs ~ altitude + log(distance) + log(NoOfPools) + 
##     NoOfSites + avrain + meanmin + meanmax, family = binomial, 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9795  -0.7193  -0.2785   0.7964   2.5658  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     4.090e+01  1.327e+02   0.308 0.757845    
## altitude       -6.648e-03  3.866e-02  -0.172 0.863466    
## log(distance)  -7.593e-01  2.554e-01  -2.973 0.002945 ** 
## log(NoOfPools)  5.727e-01  2.162e-01   2.649 0.008083 ** 
## NoOfSites      -8.979e-04  1.074e-01  -0.008 0.993330    
## avrain         -6.793e-03  5.999e-02  -0.113 0.909848    
## meanmin         5.305e+00  1.543e+00   3.439 0.000584 ***
## meanmax        -3.173e+00  4.839e+00  -0.656 0.512048    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.62  on 204  degrees of freedom
## AIC: 213.62
## 
## Number of Fisher Scoring iterations: 5

n <- dim(frogs)[1]
p <- 7
y <- frogs$pres.abs
ybar <- mean(y)
phat <- fitted(frogs.glm0)

# residual deviance
my.res.dev <- -2 * sum(y * log(phat) + (1 - y) * log(1 - phat))
c(my.res.dev, summary(frogs.glm0)$deviance)

## [1] 197.6249 197.6249

# residual degrees of freedom
c(n - p - 1, summary(frogs.glm0)$df.residual)

## [1] 204 204

# null deviance
my.null.dev <- -2 * n * (ybar * log(ybar) + (1 - ybar) * log(1 - ybar))
c(my.null.dev, summary(frogs.glm0)$null.deviance)

## [1] 279.987 279.987

# null degrees of freedom
c(n - 1, summary(frogs.glm0)$df.null)

## [1] 211 211

# AIC is residual deviance + 2 * (p + 1)
c(my.res.dev + 2 * (p + 1), summary(frogs.glm0)$aic)

## [1] 213.6249 213.6249

Before we turn to our application, let us glance at a comparison of what linear regression achieves.

frogs.lm <- lm(formula = pres.abs ~ altitude + log(distance) +
                    log(NoOfPools) + NoOfSites + avrain + meanmin + meanmax,
                  data = frogs)
phat.lm <- fitted(frogs.lm)
plot(phat.lm, phat, xlab="linear model fitted prob.", ylab="logistic fitted prob.", ylim=c(0,1))
abline(0, 1)

Prediction application in logistic regression

Suppose we get data for a new site and want to predict whether frogs will be at the site or not. If we use our fitted model, we will obtain an estimate for the probability of frogs being present. But how do we turn this into a prediction? Should we use a cutoff of $0.5$, and predict yes if the estimated probability is $\ge 0.5$?

Consider the following confusion matrix. \[\begin{array}{c|cc} & \text{predict $0$} & \text{predict $1$} \\ \hline \text{actual $0$} & a & b \\ \text{actual $1$} & c & d \end{array}\]

We would like $a$ and $d$ to be large, and we would like $b$ and $c$ to be small.

Let us compare the values of $b+c$ on our given data, for various threshold values.

thres.vec <- seq(0, 1, by=0.05)
conf <- matrix(0, nrow=length(thres.vec), ncol=4)
conf.lm <- matrix(0, nrow=length(thres.vec), ncol=4)
for (i in 1:length(thres.vec)) {
  thres <- thres.vec[i]
  conf[i, 1] <- sum((!y) & (phat < thres))
  conf[i, 2] <- sum((!y) & (phat >= thres))
  conf[i, 3] <- sum(y & (phat < thres))
  conf[i, 4] <- sum(y & (phat >= thres))
}

for (i in 1:length(thres.vec)) {
  thres <- thres.vec[i]
  conf.lm[i, 1] <- sum((!y) & (phat.lm < thres))
  conf.lm[i, 2] <- sum((!y) & (phat.lm >= thres))
  conf.lm[i, 3] <- sum(y & (phat.lm < thres))
  conf.lm[i, 4] <- sum(y & (phat.lm >= thres))
}

bplusc <- conf[,2] + conf[,3]
bplusc.lm <- conf.lm[,2] + conf.lm[,3]
matplot(thres.vec, cbind(bplusc, bplusc.lm), xlab="threshold", ylab="b+c", type='l', col='black')
thres.vec[which.min(bplusc)]

## [1] 0.55

legend(x=0.8, y=120, lty=1:2, col=1:2, c("glm","lm"))

We see that $0.55$ is the best threshold for logistic regression if we want to minimize $b+c$, the number of errors.

Our choice of $b+c$ was arbitrary, and assumed that we consider both types of errors (false positive, false negative) as equally bad.

For example, suppose a bank wants to decide whether to accept a customer’s request for a loan. There are two types of errors.

The bank denies the loan request, but the customer would have repaid the loan.
The bank grants the loan request, but the customer goes bankrupt.

The second error is more severe for the bank, so in that situation one might put more weight on one error over the other.

As another example, consider a medical test for detecting a lethal disease.

The test indicates the patient has the disease, but the patient is actually healthy.
The patient actually has the disease, but the test does not detect it.

Depending on your perspective, you may put more weight on one error over the other.

Let us see what happens if we used $b + 3c$ for our frog example instead. (You really want to find frogs, so you would be really sad if you did not visit a site that had frogs only becuase your model mistakenly predicted that there are no frogs there.)

bplus3c <- conf[,2] + 3*conf[,3]
bplus3c.lm <- conf.lm[,2] + 3*conf.lm[,3]
matplot(thres.vec, cbind(bplus3c, bplus3c.lm), xlab="threshold", ylab="b+3c", type='l')
thres.vec[which.min(bplus3c)]

## [1] 0.25

legend(x=0.8, y=120, lty=1:2, c("glm","lm"))

As expected, the best threshold (on our data) lowered (to $0.25$), since we want to over-label sites as “yes” to avoid missing sites that frogs.

The receiver operator characteristic (ROC) curve is a plot of true positive rate $\frac{d}{c+d}$ against false positive rate $\frac{b}{a+b}$ as you change the threshold.

tpr <- conf[,4] / (conf[,3] + conf[,4])
fpr <- conf[,2] / (conf[,1] + conf[,2])
tpr.lm <- conf.lm[,4] / (conf.lm[,3] + conf.lm[,4])
fpr.lm <- conf.lm[,2] / (conf.lm[,1] + conf.lm[,2])
matplot(cbind(fpr, fpr.lm), cbind(tpr, tpr.lm), type='l', col='black', xlim=c(0,1), ylim=c(0,1),
        xlab="false positive rate", ylab="true positive rate", main="ROC curve")
legend(0.8, 0.2, lty=1:2, c("glm", "lm"))

Another example with spam data

Some exploratory data analysis suggested log transforms for some of the variables

library(DAAG)
data(spam7)
help(spam7)
spam <- spam7
head(spam)

##   crl.tot dollar  bang money n000 make yesno
## 1     278  0.000 0.778  0.00 0.00 0.00     y
## 2    1028  0.180 0.372  0.43 0.43 0.21     y
## 3    2259  0.184 0.276  0.06 1.16 0.06     y
## 4     191  0.000 0.137  0.00 0.00 0.00     y
## 5     191  0.000 0.135  0.00 0.00 0.00     y
## 6      54  0.000 0.000  0.00 0.00 0.00     y

n <- dim(spam)[1]

s <- 1e-3 # Correction for zero. Log(0) is infinicty

spam.glm <- glm(yesno~ log(crl.tot) + log(dollar+s) + log(bang+s)
                +log(money+s) + log(n000+s) + log(make+s),
                family=binomial, data=spam)
summary(spam.glm)

## 
## Call:
## glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang + 
##     s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial, 
##     data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1657  -0.4367  -0.2863   0.3609   2.7152  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.11947    0.36342  11.335  < 2e-16 ***
## log(crl.tot)     0.30228    0.03693   8.185 2.71e-16 ***
## log(dollar + s)  0.32586    0.02365  13.777  < 2e-16 ***
## log(bang + s)    0.40984    0.01597  25.661  < 2e-16 ***
## log(money + s)   0.34563    0.02800  12.345  < 2e-16 ***
## log(n000 + s)    0.18947    0.02931   6.463 1.02e-10 ***
## log(make + s)   -0.11418    0.02206  -5.177 2.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 3245.1  on 4594  degrees of freedom
## AIC: 3259.1
## 
## Number of Fisher Scoring iterations: 6

spam.lm <-  lm(as.numeric(yesno=="y")~ log(crl.tot) + log(dollar+s)
               + log(bang+s) +log(money+s) + log(n000+s) + log(make+s)
               ,data=spam)
summary(spam.lm)

## 
## Call:
## lm(formula = as.numeric(yesno == "y") ~ log(crl.tot) + log(dollar + 
##     s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make + 
##     s), data = spam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10937 -0.13830 -0.05674  0.15262  1.05619 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.078531   0.034188  31.547  < 2e-16 ***
## log(crl.tot)     0.028611   0.003978   7.193 7.38e-13 ***
## log(dollar + s)  0.054878   0.002934  18.703  < 2e-16 ***
## log(bang + s)    0.064522   0.001919  33.619  < 2e-16 ***
## log(money + s)   0.039776   0.002751  14.457  < 2e-16 ***
## log(n000 + s)    0.018530   0.002815   6.582 5.16e-11 ***
## log(make + s)   -0.017380   0.002370  -7.335 2.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3391 on 4594 degrees of freedom
## Multiple R-squared:  0.5193, Adjusted R-squared:  0.5186 
## F-statistic: 827.1 on 6 and 4594 DF,  p-value: < 2.2e-16

p <- 6
y <- as.numeric(spam$yesno) - 1
phat <- fitted(spam.glm)
phat.lm <- fitted(spam.lm)

We can use the same code as before to consider different thresholds for prediction.

## [1] 0.5

STAT151A: Lab 11

Billy Fang

11/16/2017

Brief review of GLMs so far

Exponential dispersion model

Link function

Example: Logistic regression and probit regression

Example: Poisson regression

Deviances

Deviance computations in `glm` summary

Prediction application in logistic regression

Another example with spam data

STAT151A: Lab 11

Billy Fang

11/16/2017

Brief review of GLMs so far

Exponential dispersion model

Link function

Example: Logistic regression and probit regression

Example: Poisson regression

Deviances

Deviance computations in glm summary

Prediction application in logistic regression

Another example with spam data

Deviance computations in `glm` summary