As always, feel free to leave anonymous feedback here: https://goo.gl/forms/fKjLeKItix2Djg5l2
Lecture review, notation/terminology clarification
Relevant reading: Parts of sections 11.3 and 11.4
- Residuals: \(\widehat{e} := y - X \widehat{\beta} = (I - H)y\) where \(H=X (X^\top X)^{-1} X^\top\). Exercises: Under our standard Gaussian model, what is the distribution of \(\widehat{e}\)? What is the distribution of \(\widehat{e}_i\)?
- Standardized residuals: the \(i\)th standardized residual is \[r_i := \frac{\widehat{e}_i}{\widehat{\sigma} \sqrt{1 - h_i}}\] (denoted as \(E'_i\) in the textbook). Exercises: What is the intuition behind why we might define this quantity this way? What does it measure? What is a drawback of this quantity?
- Predicted residuals: the \(i\)th predicted residual is \[\widehat{e}_{[i]} := y_i - x_i^\top \widehat{\beta}_{[i]},\] where \(\widehat{\beta}_{[i]}\) is the least squares coefficients vector from using the reduced dataset \(y_{[i]}\) and \(X_{[i]}\), and where \(y_{[i]}\) and \(X_{[i]}\) are obtained by deleting the \(i\)th entry and \(i\)th row respectively (effectively, removing the \(i\)th observation in the dataset). Some related facts:
- The Woodbury matrix formula allows us to write down an expression for \(\widehat{\beta}_{[i]}\) in terms of quantities in the original regression, which avoids having to do a regression for every possible leave-one-out reduced dataset. \[\widehat{\beta}_{[i]} = \widehat{\beta} - \frac{\widehat{e}_i}{1 - h_i} (X^\top X)^{-1} x_i \tag{1}\]
- This in turn leads to a formula for the predicted residual that again avoids using quantities from the regressions on the reduced datasets. \[\widehat{e}_{[i]} = \frac{\widehat{e}_i}{1 - h_i}.\] Exercise: What is the variance of \(\widehat{e}_{[i]}\)?
- Using equation (1) above along with some tedious computation, one can also prove the following useful expression that appears in your lecture notes. \[\text{RSS}_{[i]} = \text{RSS} - \frac{\widehat{e}_i^2}{1 - h_i},\] where \(\text{RSS}_{[i]}\) is the RSS from the fit on the reduced dataset
- Standardized predicted residuals: \[t_i = \frac{\widehat{e}_{[i]} \sqrt{1-h_i}}{\sqrt{\text{RSS}_{[i]} / (n - p - 2)}}\]
- Your textbook and R call these studentized residuals (but in the lecture notes, these are called externally studentized residuals, and studentized residuals refers to standardized residuals) and use the notation \(E^*_i\). They give a seemingly different formula: \[E^*_i := \frac{\widehat{e}_i}{\widehat{\sigma}_{[i]} \sqrt{1-h_i}},\] where \(\widehat{\sigma}_{[i]} := \sqrt{\text{RSS}_{[i]} / (n - p - 2)}\) is the estimate for the variance using the reduced dataset. Exercise: Show that indeed \(E^*_i = t_i\).
- Following its name, \(t_i\) follows a \(t\)-distribution with \(n-p-2\) degrees of freedom. To see this note that \(\widehat{e}_{[i]}\) depends only on \(e_i\) and \(\widehat{\beta}_{[i]}\), which are independent of \(\widehat{\sigma}_{[i]}\).
- Your textbook and your homework also provide the following relationship between the standardized predicted residuals and the standardized residuals. \[t_i = r_i \sqrt{\frac{n - p - 2}{n - p - 1 - r_i^2}}\]
- Cook’s distance: measuring influence by how much \(\widehat{\beta}\) differs from \(\widehat{\beta}_{[i]}\)
- In lecture: perhaps try \(\|\widehat{\beta} - \widehat{\beta}_{[i]}\|^2\)? But this ignores the correlation structure of \(\widehat{\beta}\). Correct this by using Mahalanobis distance instead.
- Equivalent way of understanding the “derivation” of Cook’s distance: inspiration from \(F\)-statistic. Suppose we want to test the “hypothesis” \[\beta = \widehat{\beta}_{[i]}.\] This can be written in the form \(L \beta = c\) with \(L = I_{p+1}\) and \(c = \widehat{\beta}_{[i]}\), so if we use our long formula for the \(F\)-statistic for the general linear hypothesis \(L \beta = c\), we get \[\frac{(L\widehat{\beta} - c)^\top [L(X^\top X)^{-1} L^\top]^{-1} (L \widehat{\beta} - c)\; /\; (p+1)}{\widehat{\sigma}^2}
= \frac{(\widehat{\beta} - \widehat{\beta}_{[i]})^\top (X^\top X) (\widehat{\beta} - \widehat{\beta}_{[i]})}{(p+1) \widehat{\sigma}^2}=: C_i\] which is the quantity from lecture. Why is this not really an \(F\)-statistic?
- The textbook and your lecture notes offer another formula for Cook’s distance. \[C_i = \frac{r_i^2}{p+1} \cdot \frac{h_i}{1 - h_i}.\] Exercise: show that both formulas are the same.
For #3 on the homework, you may use lm()
to compute familiar quantities (residuals, fitted values), but please use them to manually compute the new quantities (std. residuals, pred. residuals, std. pred. residuals, Cook’s distance). You may use R functions to check that your computations are correct.