library(faraway)
library(car)
library(ggplot2)
n <- 50
for(i in 1:9) {x <- runif(n) ; print(qplot(x, rnorm(n)) + xlab("") + ylab(""))}
n <- 50
for(i in 1:9) {x <- runif(n) ; print(qplot(x, x * rnorm(n)) + xlab("") + ylab(""))}
n <- 50
for(i in 1:9) {x <- runif(n) ; print(qplot(x, sqrt(x) * rnorm(n)) + xlab("") + ylab(""))}
n <- 50
for(i in 1:9) {x <- runif(n, min=-1, max=1) ; print(qplot(x, cos(x*pi/25)+rnorm(n,sd=0.005)) + xlab("") + ylab(""))}
The delta method yields \[\text{Var}(h(y)) \approx h'(\mathbb{E} y)^2 \text{Var}(y).\] This suggests choosing \(h\) to satisfy \[h'(\mathbb{E} y) \propto \frac{1}{\sqrt{\text{Var}(y)}} = \frac{1}{\text{SD}(y)}.\]
If \(\text{SD}(y) \propto \mathbb{E} y\), what is a variance stabilizing \(h\)?
If \(\text{SD}(y) \propto (\mathbb{E} y)^b\) for \(b \ne 1\), what is a variance stabilizing \(h\)? If we were to make a plot of log spread (log SD) vs. log mean, what would be the slope of the fitted line?
data(gala)
# help(gala)
head(gala)
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
mod <- lm(Species ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)
qplot(fitted(mod), resid(mod)) +
geom_hline(yintercept=0, linetype=2) +
xlab("fitted values") +
ylab("residuals") +
ggtitle("Residuals vs. fitted; orig. model")
qplot(fitted(mod), rstudent(mod)) +
geom_hline(yintercept=0, linetype=2) +
xlab("fitted values") +
ylab("studentized residuals") +
ggtitle("Stud. residuals vs. fitted; orig. model")
mod.sqrt <- lm(sqrt(Species) ~ Area + Elevation + Scruz + Nearest + Adjacent, gala)
qplot(fitted(mod.sqrt), resid(mod.sqrt)) +
geom_hline(yintercept=0, linetype=2) +
xlab("fitted values") +
ylab("residuals") +
ggtitle("Residuals vs. fitted; sqrt model")
qplot(fitted(mod.sqrt), rstudent(mod.sqrt)) +
geom_hline(yintercept=0, linetype=2) +
xlab("fitted values") +
ylab("studentized residuals") +
ggtitle("Stud. residuals vs. fitted; sqrt model")
From Fox Section 12.2.
data(SLID)
# help(SLID)
head(SLID)
## wages education age sex language
## 1 10.56 15.0 40 Male English
## 2 11.00 13.2 19 Male English
## 3 NA 16.0 49 Male Other
## 4 17.76 14.0 46 Male Other
## 5 NA 8.0 71 Male English
## 6 14.00 16.0 50 Female English
SLID <- na.omit(SLID)
mod <- lm(wages ~ sex + age + education, data=SLID)
qplot(fitted(mod), rstudent(mod)) +
geom_hline(yintercept=0, linetype=2) +
geom_smooth(method='loess', se=F, method.args=list(degree=1)) +
xlab("fitted values") +
ylab("studentized residuals") +
ggtitle("Residuals vs. fitted; orig. model")
# omit a few points with negative fitted values
idx <- which(fitted(mod) > 0)
logval <- log(fitted(mod)[idx])
logspread <- log(abs(rstudent(mod)[idx]))
b <- coef(lm(logspread ~ logval))[2]
b
## logval
## 0.9517869
qplot(logval, logspread) +
geom_smooth(method='lm', se=F) +
xlab("log fitted values") +
ylab("log abs. studentized residuals") +
ggtitle("log stud. res. vs. log fitted; sqrt model")
mod.log <- lm(log(wages) ~ sex + age + education, data=SLID)
qplot(fitted(mod.log), rstudent(mod.log)) +
geom_hline(yintercept=0, linetype=2) +
geom_smooth(method='loess', se=F, method.args=list(degree=1)) +
xlab("fitted values") +
ylab("studentized residuals") +
ggtitle("Stud. residuals vs. fitted; log model")
Transforming the response variable may introduce (or fix) nonlinearity issues between the response variable and the explanatory variables.
n = 50
for(i in 1:3) qqnorm(rnorm(n))
for(i in 1:3) qqnorm(exp(rnorm(n)))
for(i in 1:3) qqnorm(rt(n, 2))
for(i in 1:3) qqnorm(runif(n))
From your lecture notes, one type of Q-Q plot that is recommended is sorted studentized residual vs. \(\frac{1}{n+1}, \frac{2}{n+1}, \ldots, \frac{n}{n+1}\) quantiles of a \(t_{n - p - 2}\) distribution.
From Fox Section 12.1.
mod <- lm(wages ~ sex + age + education, data=SLID)
n <- dim(SLID)[1]
res.df <- df.residual(mod)
tq <- qt((1:n) / (n+1), res.df)
qplot(tq, sort(rstudent(mod))) + geom_abline(slope=1)
mod.log <- lm(log(wages) ~ sex + age + education, data=SLID)
qplot(tq, sort(rstudent(mod.log))) + geom_abline(slope=1)
mod.cuberoot <- lm(I(wages^(1/3)) ~ sex + age + education, data=SLID)
qplot(tq, sort(rstudent(mod.cuberoot))) + geom_abline(slope=1)
From Faraway textbook:
When non-normality is found, the resolution depends on the type of problem found. For short-tailed distributions, the consequences of non-normality are not serious and can reasonably be ignored. For skewed errors, a transformation of the response may solve the problem. For long-tailed errors, we might just accept the non-normality and base the inference on the assumption of another distribution or use resampling methods such as the bootstrap or permutation tests. Alternatively, use robust methods, which give less weight to outlying observations but may again require resampling for the inference.
data(savings)
mod <- lm(sr ~ ., data=savings)
summary(mod)
##
## Call:
## lm(formula = sr ~ ., data = savings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2422 -2.6857 -0.2488 2.4280 9.7509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
## pop15 -0.4611931 0.1446422 -3.189 0.002603 **
## pop75 -1.6914977 1.0835989 -1.561 0.125530
## dpi -0.0003369 0.0009311 -0.362 0.719173
## ddpi 0.4096949 0.1961971 2.088 0.042471 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
# avPlots(mod)
crPlots(mod)
# crPlot(mod, variable='pop15', smoother=loessLine)
qplot(savings$pop15, resid(mod) + coef(mod)['pop15'] * savings$pop15) +
geom_smooth(method='lm', se=F, col='red', linetype=2) +
geom_smooth(method='loess', se=F, method.args=list(degree=1), col='green') +
xlab("pop15") + ylab("component plus residual")
mod.sub1 <- lm(sr ~ ., data=savings, subset=(pop15 < 35))
mod.sub2 <- lm(sr ~ ., data=savings, subset=(pop15 > 35))
summary(mod)
##
## Call:
## lm(formula = sr ~ ., data = savings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2422 -2.6857 -0.2488 2.4280 9.7509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
## pop15 -0.4611931 0.1446422 -3.189 0.002603 **
## pop75 -1.6914977 1.0835989 -1.561 0.125530
## dpi -0.0003369 0.0009311 -0.362 0.719173
## ddpi 0.4096949 0.1961971 2.088 0.042471 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
summary(mod.sub1)
##
## Call:
## lm(formula = sr ~ ., data = savings, subset = (pop15 < 35))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5890 -1.5015 0.1165 1.8857 5.1466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9617950 8.0837502 2.964 0.00716 **
## pop15 -0.3858976 0.1953686 -1.975 0.06092 .
## pop75 -1.3277421 0.9260627 -1.434 0.16570
## dpi -0.0004588 0.0007237 -0.634 0.53264
## ddpi 0.8843944 0.2953405 2.994 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.772 on 22 degrees of freedom
## Multiple R-squared: 0.5073, Adjusted R-squared: 0.4177
## F-statistic: 5.663 on 4 and 22 DF, p-value: 0.002734
summary(mod.sub2)
##
## Call:
## lm(formula = sr ~ ., data = savings, subset = (pop15 > 35))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5511 -3.5101 0.0443 2.6764 8.4983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4339689 21.1550278 -0.115 0.910
## pop15 0.2738537 0.4391910 0.624 0.541
## pop75 -3.5484769 3.0332806 -1.170 0.257
## dpi 0.0004208 0.0050001 0.084 0.934
## ddpi 0.3954742 0.2901012 1.363 0.190
##
## Residual standard error: 4.454 on 18 degrees of freedom
## Multiple R-squared: 0.1558, Adjusted R-squared: -0.03185
## F-statistic: 0.8302 on 4 and 18 DF, p-value: 0.5233
Read Fox Chapter 4.
Key topics:
From Fox Section 11.7.
Cautionary tale from Faraway textbook:
It is dangerous to exclude outliers in an automatic manner. National Aeronautics and Space Administation (NASA) launched the Nimbus 7 satellite to record atmospheric information. After several years of operation in 1985, the British Antarctic Survey observed a large decrease in atmospheric ozone over the Antarctic. On further examination of the NASA data, it was found that the data processing program automatically discarded observations that were extremely low and assumed to be mistakes. Thus the discovery of the Antarctic ozone hole was delayed several years. Perhaps, if this had been known earlier, the chlorofluorocarbon (CFC) phaseout would have been agreed upon earlier and the damage could have been limited. See Stolarski et al. (1986) for more.