library(faraway)
library(car)
library(ggplot2)

Reading artifical residual plots

n <- 50
for(i in 1:9) {x <- runif(n) ; print(qplot(x, rnorm(n)) + xlab("") + ylab(""))}

n <- 50
for(i in 1:9) {x <- runif(n) ; print(qplot(x, x * rnorm(n)) + xlab("") + ylab(""))}

n <- 50
for(i in 1:9) {x <- runif(n) ; print(qplot(x, sqrt(x) * rnorm(n)) + xlab("") + ylab(""))}

n <- 50
for(i in 1:9) {x <- runif(n, min=-1, max=1) ; print(qplot(x, cos(x*pi/25)+rnorm(n,sd=0.005)) + xlab("") + ylab(""))}

Non-constant variance

Variance-stabilizing transformations

The delta method yields \[\text{Var}(h(y)) \approx h'(\mathbb{E} y)^2 \text{Var}(y).\] This suggests choosing \(h\) to satisfy \[h'(\mathbb{E} y) \propto \frac{1}{\sqrt{\text{Var}(y)}} = \frac{1}{\text{SD}(y)}.\]

If \(\text{SD}(y) \propto \mathbb{E} y\), what is a variance stabilizing \(h\)?

If \(\text{SD}(y) \propto (\mathbb{E} y)^b\) for \(b \ne 1\), what is a variance stabilizing \(h\)? If we were to make a plot of log spread (log SD) vs. log mean, what would be the slope of the fitted line?

Species diversity example (counts)

data(gala)
# help(gala)
head(gala)
##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84
mod <- lm(Species ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)

qplot(fitted(mod), resid(mod)) +
  geom_hline(yintercept=0, linetype=2) +
  xlab("fitted values") +
  ylab("residuals") +
  ggtitle("Residuals vs. fitted; orig. model")

qplot(fitted(mod), rstudent(mod)) +
  geom_hline(yintercept=0, linetype=2) +
  xlab("fitted values") +
  ylab("studentized residuals") +
  ggtitle("Stud. residuals vs. fitted; orig. model")

mod.sqrt <- lm(sqrt(Species) ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)

qplot(fitted(mod.sqrt), resid(mod.sqrt)) +
  geom_hline(yintercept=0, linetype=2) +
  xlab("fitted values") +
  ylab("residuals") +
  ggtitle("Residuals vs. fitted; sqrt model")

qplot(fitted(mod.sqrt), rstudent(mod.sqrt)) +
  geom_hline(yintercept=0, linetype=2) +
  xlab("fitted values") +
  ylab("studentized residuals") +
  ggtitle("Stud. residuals vs. fitted; sqrt model")

SLID example

From Fox Section 12.2.

data(SLID)
# help(SLID)
head(SLID)
##   wages education age    sex language
## 1 10.56      15.0  40   Male  English
## 2 11.00      13.2  19   Male  English
## 3    NA      16.0  49   Male    Other
## 4 17.76      14.0  46   Male    Other
## 5    NA       8.0  71   Male  English
## 6 14.00      16.0  50 Female  English
SLID <- na.omit(SLID)

mod <- lm(wages ~ sex + age + education, data=SLID)
qplot(fitted(mod), rstudent(mod)) +
  geom_hline(yintercept=0, linetype=2) +
  geom_smooth(method='loess', se=F, method.args=list(degree=1)) +
  xlab("fitted values") +
  ylab("studentized residuals") +
  ggtitle("Residuals vs. fitted; orig. model")

# omit a few points with negative fitted values
idx <- which(fitted(mod) > 0)
logval <- log(fitted(mod)[idx])
logspread <- log(abs(rstudent(mod)[idx]))
b <- coef(lm(logspread ~ logval))[2]
b
##    logval 
## 0.9517869
qplot(logval, logspread) +
  geom_smooth(method='lm', se=F) +
  xlab("log fitted values") +
  ylab("log abs. studentized residuals") +
  ggtitle("log stud. res. vs. log fitted; sqrt model")


mod.log <- lm(log(wages) ~ sex + age + education, data=SLID)
qplot(fitted(mod.log), rstudent(mod.log)) +
  geom_hline(yintercept=0, linetype=2) +
  geom_smooth(method='loess', se=F, method.args=list(degree=1)) +
  xlab("fitted values") +
  ylab("studentized residuals") +
  ggtitle("Stud. residuals vs. fitted; log model")

Transforming the response variable may introduce (or fix) nonlinearity issues between the response variable and the explanatory variables.

Normality

Reading Q-Q plots

n = 50
for(i in 1:3) qqnorm(rnorm(n))
for(i in 1:3) qqnorm(exp(rnorm(n)))
for(i in 1:3) qqnorm(rt(n, 2))
for(i in 1:3) qqnorm(runif(n))

SLID example

From your lecture notes, one type of Q-Q plot that is recommended is sorted studentized residual vs. \(\frac{1}{n+1}, \frac{2}{n+1}, \ldots, \frac{n}{n+1}\) quantiles of a \(t_{n - p - 2}\) distribution.

From Fox Section 12.1.

mod <- lm(wages ~ sex + age + education, data=SLID)
n <- dim(SLID)[1]
res.df <- df.residual(mod)
tq <- qt((1:n) / (n+1), res.df)

qplot(tq, sort(rstudent(mod))) + geom_abline(slope=1)

mod.log <- lm(log(wages) ~ sex + age + education, data=SLID)
qplot(tq, sort(rstudent(mod.log))) + geom_abline(slope=1)

mod.cuberoot <- lm(I(wages^(1/3)) ~ sex + age + education, data=SLID)
qplot(tq, sort(rstudent(mod.cuberoot))) + geom_abline(slope=1)

From Faraway textbook:

When non-normality is found, the resolution depends on the type of problem found. For short-tailed distributions, the consequences of non-normality are not serious and can reasonably be ignored. For skewed errors, a transformation of the response may solve the problem. For long-tailed errors, we might just accept the non-normality and base the inference on the assumption of another distribution or use resampling methods such as the bootstrap or permutation tests. Alternatively, use robust methods, which give less weight to outlying observations but may again require resampling for the inference.

Discovering groups in diagnostic plots

data(savings)
mod <- lm(sr ~ ., data=savings)
summary(mod)
## 
## Call:
## lm(formula = sr ~ ., data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904
# avPlots(mod)
crPlots(mod)

# crPlot(mod, variable='pop15', smoother=loessLine)
qplot(savings$pop15, resid(mod) + coef(mod)['pop15'] * savings$pop15) +
  geom_smooth(method='lm', se=F, col='red', linetype=2) +
  geom_smooth(method='loess', se=F, method.args=list(degree=1), col='green') +
  xlab("pop15") + ylab("component plus residual")

mod.sub1 <- lm(sr ~ ., data=savings, subset=(pop15 < 35))
mod.sub2 <- lm(sr ~ ., data=savings, subset=(pop15 > 35))
summary(mod)
## 
## Call:
## lm(formula = sr ~ ., data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904
summary(mod.sub1)
## 
## Call:
## lm(formula = sr ~ ., data = savings, subset = (pop15 < 35))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5890 -1.5015  0.1165  1.8857  5.1466 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 23.9617950  8.0837502   2.964  0.00716 **
## pop15       -0.3858976  0.1953686  -1.975  0.06092 . 
## pop75       -1.3277421  0.9260627  -1.434  0.16570   
## dpi         -0.0004588  0.0007237  -0.634  0.53264   
## ddpi         0.8843944  0.2953405   2.994  0.00668 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.772 on 22 degrees of freedom
## Multiple R-squared:  0.5073, Adjusted R-squared:  0.4177 
## F-statistic: 5.663 on 4 and 22 DF,  p-value: 0.002734
summary(mod.sub2)
## 
## Call:
## lm(formula = sr ~ ., data = savings, subset = (pop15 > 35))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5511 -3.5101  0.0443  2.6764  8.4983 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4339689 21.1550278  -0.115    0.910
## pop15        0.2738537  0.4391910   0.624    0.541
## pop75       -3.5484769  3.0332806  -1.170    0.257
## dpi          0.0004208  0.0050001   0.084    0.934
## ddpi         0.3954742  0.2901012   1.363    0.190
## 
## Residual standard error: 4.454 on 18 degrees of freedom
## Multiple R-squared:  0.1558, Adjusted R-squared:  -0.03185 
## F-statistic: 0.8302 on 4 and 18 DF,  p-value: 0.5233

Transformations

Read Fox Chapter 4.

Key topics:

What to do with outliers (or unusual points)?

From Fox Section 11.7.

Cautionary tale from Faraway textbook:

It is dangerous to exclude outliers in an automatic manner. National Aeronautics and Space Administation (NASA) launched the Nimbus 7 satellite to record atmospheric information. After several years of operation in 1985, the British Antarctic Survey observed a large decrease in atmospheric ozone over the Antarctic. On further examination of the NASA data, it was found that the data processing program automatically discarded observations that were extremely low and assumed to be mistakes. Thus the discovery of the Antarctic ozone hole was delayed several years. Perhaps, if this had been known earlier, the chlorofluorocarbon (CFC) phaseout would have been agreed upon earlier and the damage could have been limited. See Stolarski et al. (1986) for more.