library(car)
data(Duncan)

Review of yesterday’s lecture

Relevant reading: 11.1, 11.2, 11.8.1-2

Hat matrix: \[H := X(X^\top X)^{-1} X^\top\] Fitted values: \[\widehat{y} = Hy\]

Writing out the above yields \[\begin{align} \widehat{y}_1 &= h_{1,1} y_1 + h_{1,2} y_2 + \cdots + h_{1,n} y_n \\ &\vdots \\ \widehat{y}_n &= h_{n,1} y_1 + h_{n,2} y_2 + \cdots + h_{n,n} y_n. \end{align}\]

How much does \(y_2\) contribute to all the fitted values \(\widehat{y}_1, \ldots, \widehat{y}_n\)? Could consider \[h_{12}^2 + h_{22}^2 + h_{32}^2 + \cdots + h_{n2}^2.\] By symmetry of \(H\), this can be viewed as the dot product of the second row of \(H\) with the second column of \(H\), which is the \((2,2)\) entry of \(H^2\). By idempotence of \(H\), we have \(H^2=H\) so the above quantity is simply \(h_2 \equiv h_{2,2}\). More generally, how much \(y_i\) contributes to all the fitted values \(\widehat{y}_1, \ldots, \widehat{y}_n\) can be measured by \[h_i \equiv h_{ii} = h_{1,i}^2 + \cdots + h_{n,i}^2.\] Furthermore we can go back to the formula \(H = X(X^\top X)^{-1} X^\top\) and note that the \(j\)th diagonal entry can be written as \[h_i = x_i^\top (X^\top X)^{-1} x_i,\tag{1}\] where \(x_i^\top\) is the \(i\)th row of the design matrix. We call this leverage of the \(i\)th observation. As emphasized in lecture, this only depends on the observed explanatory variables \(x_1, \ldots, x_n\).

  1. \(h_i\) measures how far \(x_i\) (explanatory variables for \(j\)th observation) is from the other observed explanatory variables. (See discussion below.)

  2. \(h_i\) measures how much potential the \(i\)th subject has to influence the the fitted line/plane.

Does this quantity (1) look familiar?

Let \(X\) be \(n \times (p+1)\) have an intercept column and have full column rank. Review the following facts from lecture:

A distance by any other name would smell as sweet

Let \[S := \frac{1}{n-1} \sum_{j=1}^n (x_j - \overline{x})(x_j - \overline{x})^\top = \frac{1}{n-1} (X^*)^\top (X^*)\] be the sample covariance matrix of the explanatory variables. Remember, here \(x_j^\top\) denotes the \(j\)th row of \(X\) (the explanatory variables for the \(j\)th person), and \(\overline{x} := \frac{1}{n} \sum_{i=1}^n x_i\). Note that \(X^*\) is \(n \times p\) and whose columns are the non-intercept columns of \(X\) centered around their mean.

[If we think of the \(x_i\) as i.i.d. variables coming from some distribution, then the covariance matrix of such a random vector \(x\) is \(\text{Cov}(x) = \mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^\top]\). By comparing this with the definition of \(S\), we see that \(S\) is an estimate of \(\text{Cov}(x)\).]

Using \(S\) can define a norm (“length” or “size”) for any vector \(v \in \mathbb{R}^n\) by \[\sqrt{v^\top S^{-1} v}.\tag{$*$}\] Special cases:

Consider the following examples for \(S\).

\[ \begin{bmatrix} 1 \\ & 1 \end{bmatrix}, \begin{bmatrix} 2 \\ & 2 \end{bmatrix}, \begin{bmatrix} 1 \\ & 2 \end{bmatrix}, \begin{bmatrix} 2 \\ & 1 \end{bmatrix}, \begin{bmatrix} 1 & 0.1 \\ 0.1 & 1 \end{bmatrix}, \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix}, \begin{bmatrix} 1 & -0.1 \\ -0.1 & 1 \end{bmatrix}, \begin{bmatrix} 1 & -0.5 \\ -0.5 & 1 \end{bmatrix} \] The contours of constant distance [from the origin] appear as follows.

The Mahalanobis distance for the \(i\)th observation in the design matrix \(X\) is \[\sqrt{(x_i - \overline{x})^\top S^{-1} (x_i - \overline{x})},\] where \(S^{-1}\) is as defined above. It measures how far \(x_i\) (\(i\)th data point) is from the centroid \(\overline{x}\) using this norm as defined in (\(*\)) above.

How is this related to the leverage \(h_i = x_i^\top (X^\top X)^{-1} x_i\)? Your lecture notes mention the following equality. \[\underbrace{(x_i - \overline{x})^\top S^{-1} (x_i - \overline{x})}_{\Gamma_i} = (n - 1) (h_i - \frac{1}{n})\] or equivalently \[h_i = \frac{1}{n} + \underbrace{(x_i - \overline{x})^\top [(X^*)^\top (X^*)]^{-1} (x_i - \overline{x})}_{\frac{1}{n-1} \Gamma_i}\] (See page 290 of the textbook.) Again, what does this relationship between [squared] Mahalanobis distance \(\Gamma_i\) and leverage \(h_i\) imply qualitatively?

Optional proof: If we take the design matrix \(X\) (with intercept column and full column rank) and center the non-intercept columns (i.e. replacing them with the columns of \(X^*\)), then the column space does not change, so the hat matrix \(H = X(X^\top X)^{-1} X^\top\). So without loss of generality we may assume \[X = \left[\begin{array}{c|ccc} 1 & \ & & \\\ \vdots & & X^*\\ 1 \end{array}\right].\] Then \[X^\top X = \begin{bmatrix} n\\ & (X^*)^\top (X^*) \end{bmatrix}\] is block diagonal (because each column of \(X^*\) sums to zero), so its inverse can be found by inverting each block. Finally, the \(i\)th diagonal element of \(H\) is \[h_i = x_i^\top (X^\top X)^{-1} x_i = \begin{bmatrix}1 & x_i - \bar{x}\end{bmatrix} \begin{bmatrix} \frac{1}{n} \\ & [(X^*)^\top (X^*)]^{-1} \end{bmatrix} \begin{bmatrix}1 \\ x_i - \bar{x}\end{bmatrix}\] which yields the result.

Example: Duncan’s prestige data

Reproducing Figure 11.5 in section 11.2.

head(Duncan)
##            type income education prestige
## accountant prof     62        86       82
## pilot      prof     72        76       83
## architect  prof     75        92       90
## author     prof     55        90       76
## chemist    prof     64        86       90
## minister   prof     21        84       87
mod <- lm(prestige ~ income + education, data=Duncan)
X <- cbind(1, Duncan[,c("income", "education")])
X <- unname(as.matrix(X))
hat(X) # hat function gives leverage values
##  [1] 0.05092832 0.05732001 0.06963699 0.06489441 0.05132528 0.17305816
##  [7] 0.06449273 0.08785184 0.05439356 0.05932285 0.04678919 0.07925452
## [13] 0.07765314 0.07828447 0.08258799 0.19454165 0.04325517 0.04325478
## [19] 0.02628592 0.07224159 0.07975595 0.02412984 0.03132603 0.03270490
## [25] 0.04550181 0.04089141 0.26908963 0.03630630 0.04640845 0.06920206
## [31] 0.04969910 0.08030881 0.06065016 0.06445993 0.05858763 0.04959881
## [37] 0.04844212 0.04785227 0.06565969 0.05370328 0.05695616 0.04746090
## [43] 0.06866938 0.02468161 0.07058120
# manually computing hat values
H <- X %*% solve(t(X) %*% X) %*% t(X)
diag(H)
##  [1] 0.05092832 0.05732001 0.06963699 0.06489441 0.05132528 0.17305816
##  [7] 0.06449273 0.08785184 0.05439356 0.05932285 0.04678919 0.07925452
## [13] 0.07765314 0.07828447 0.08258799 0.19454165 0.04325517 0.04325478
## [19] 0.02628592 0.07224159 0.07975595 0.02412984 0.03132603 0.03270490
## [25] 0.04550181 0.04089141 0.26908963 0.03630630 0.04640845 0.06920206
## [31] 0.04969910 0.08030881 0.06065016 0.06445993 0.05858763 0.04959881
## [37] 0.04844212 0.04785227 0.06565969 0.05370328 0.05695616 0.04746090
## [43] 0.06866938 0.02468161 0.07058120
lev <- diag(H)
n <- dim(Duncan)[1]
p <- 2
plot(1:n, lev)

lev.sorted <- sort(lev, decreasing=T, index.return=T)
lev.sorted$x[1:3]
## [1] 0.2690896 0.1945416 0.1730582
rownames(Duncan)[lev.sorted$ix[1:3]]
## [1] "RR.engineer" "conductor"   "minister"
plot(1:n, lev)
for(i in lev.sorted$ix[1:3]) {
  text(i, lev[i]-0.02, rownames(Duncan)[i])
}
hbar <- (p+1)/n
abline(h=2*hbar, lty=2)
abline(h=3*hbar, lty=2)

plot(Duncan$education, Duncan$income, xlim=c(-20, 120), ylim=c(-40, 120), xlab="education", ylab="income")
i = lev.sorted$ix[1]
text(Duncan$education[i]-20, Duncan$income[i], rownames(Duncan)[i], cex=0.8)
i = lev.sorted$ix[2]
text(Duncan$education[i]-20, Duncan$income[i], rownames(Duncan)[i], cex=0.8)
i = lev.sorted$ix[3]
text(Duncan$education[i]+20, Duncan$income[i], rownames(Duncan)[i], cex=0.8)
centroid <- c(mean(Duncan$education), mean(Duncan$income))
points(centroid[1], centroid[2], pch=15)
Xstar <- scale(X[,2:3], scale=F) # centering columns
A <- (t(Xstar) %*% Xstar)
ellipse(centroid, shape=A, radius=sqrt(2*hbar - 1/n), col=1, lty=2, center.cex=0)
ellipse(centroid, shape=A, radius=sqrt(3*hbar - 1/n), col=1, lty=2, center.cex=0)

Added variable plots: residuals vs. residuals

Relevant reading: 11.6.1 (also see the note I uploaded to bCourses before the midterm)

For each subject \(i=1, \ldots, n\) we have a response variable \(y_i\) and explanatory variables \(x_{i1}, \ldots, x_{ip}\). We can do a least squares fit to get coefficients \(\widehat{\beta}_0, \ldots, \widehat{\beta}_p\) that give fitted values \[\widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_{i1} + \cdots + \widehat{\beta}_p x_{ip}.\] Suppose instead we do the following three regressions. (Note here \(x_j\) denotes columns of \(X\).)

  1. Regress \(y\) onto all variables except the first (columns \(x_2, \ldots, x_p\)) to get fitted value \(\widetilde{\widehat{y}}\) and residuals \(y - \widetilde{\widehat{y}}\).
  2. Regress the first variable’s column \(x_1\) on the other variables (columns \(x_2, \ldots, x_p\)) to get fitted value \(\widehat{x}_1\) and residual \(x - \widehat{x}\).
  3. Do a simple regression of the residuals \(y - \widetilde{\widehat{y}}\) onto the residuals \(x - \widehat{x}\).

A plot of the last regression is called an added variable plot or a partial regression plot.

Why not just just plot \(y\) against \(x_1\)? Still contains effect of other variables, not useful for assessing the relationship between \(x_1\) and \(y\) holding all other variables constant. For example, strong correlation between \(x_1\) and the other variables may cause a \(y\) vs. \(x_1\) plot to show a negative trend, even if there is a true linear model with \(\beta_1 > 0\). [See this stackexchange answer for more detail.]

The regressions (a) and (b) try extract the parts of \(y\) and \(x_1\) that are not explained by the other variables \(x_2, \ldots, x_p\), and then (c) views this relationship between \(y\) and \(x_1\), given the effect of the other variables.

We have the following interesting properties.

  1. The slope from the simple regression (c) is precisely \(\widehat{\beta}_1\) from the original multiple regression. [This simple regression also has zero intercept.]

  2. The residuals from the simple regression (c) are the same as the residuals from the original multiple regression.

  3. Useful for detecting points of high leverage and/or high influence.

  4. Since the residuals are the same, may be good for identifying violations of model assumptions (heteroscedasticity, nonlinearity, etc.)

avPlot(mod, "income", id.method="x", id.n=3)
avPlot(mod, "education", id.method="x", id.n=3)

How to interpret high leverage points on such plots?

coef(mod)
## (Intercept)      income   education 
##  -6.0646629   0.5987328   0.5458339
rownames(Duncan)[lev.sorted$ix[1:3]]
## [1] "RR.engineer" "conductor"   "minister"
Duncan_del <- Duncan[-lev.sorted$ix[2:3],]
mod_del <- lm(prestige ~ income + education, data=Duncan_del)
coef(mod_del)
## (Intercept)      income   education 
##  -6.4089856   0.8673986   0.3322408
avPlot(mod_del, "income", id.method="x", id.n=1)
avPlot(mod_del, "education", id.method="x", id.n=1)

Component plus residuals plot

Relevant reading: 12.3

Residual plots are not enough to identify the type of nonlinearity. Plot (a) is an example of monotone might be fixed by using \(Y = \alpha + \beta \sqrt{X} + \epsilon\), while (b) might be fixed by \(Y = \alpha + \beta_1 X + \beta_2 X^2 + \epsilon\), but both have the same residual plots.

Added variable plots can also show nonlinearity, but are not useful for locating a transformation to correct it, since it adjusts \(X_j\) after accounting for the other variables, but it is the unadjusted \(X_j\) that is transformed.

Component plus residual plots (a.k.a. partial residual plots) offer a useful alternative. (But they are not as suitable for analyzing leverage and influence on coefficients.)

We define the partial residuals for the \(j\)th explanatory variable by \[\begin{align} e^{(j)}_i &= e_i + \widehat{\beta}_j x_{ij} \\ &= [y_i - (\widehat{\beta}_0 + \widehat{\beta}_1 x_{i1} + \cdots + \widehat{\beta}_p x_{ip})] + \widehat{\beta}_j x_{ij} \end{align}\]

(Add back the linear component)

The component plus residual plot is the plot of \(e^{(j)}\) vs \(x_j\).

data("SLID")
SLID <- na.omit(SLID)
mod <- lm(log(wages) ~ sex + age + education, data=SLID)
coef(mod)
## (Intercept)     sexMale         age   education 
##  1.12018663  0.22425674  0.01765090  0.05493486
summary(mod)$r.squared
## [1] 0.3094527
crPlot(mod, variable="age")
crPlot(mod, variable="education")

Can suggest how to transform variables.

mod2 <- lm(log(wages) ~ sex + age + I(age^2) + I(education^2), data=SLID)
coef(mod2)
##    (Intercept)        sexMale            age       I(age^2) I(education^2) 
##   0.3898309249   0.2215444630   0.0834579140  -0.0008578935   0.0017642484
summary(mod2)$r.squared
## [1] 0.3837784
#crPlot(mod2, variable="age")
crPlot(mod2, variable="I(education^2)")