library(dplyr)
library(car)

One-way ANOVA

(Read section 8.1 of the textbook.)

data(tips, package="reshape2")
head(tips)

##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

levels(tips$day)

## [1] "Fri"  "Sat"  "Sun"  "Thur"

Model: \[\begin{align} \text{tip}_{i, \text{Fri}} &= \mu_{\text{Fri}} + \epsilon_{i, \text{Fri}} \\ \text{tip}_{i, \text{Sat}} &= \mu_{\text{Sat}} + \epsilon_{i, \text{Sat}} \\ \text{tip}_{i, \text{Sun}} &= \mu_{\text{Sun}} + \epsilon_{i, \text{Sun}} \\ \text{tip}_{i, \text{Thu}} &= \mu_{\text{Thu}} + \epsilon_{i, \text{Thu}} \end{align}\] Recall that when calling lm(), R actually does not do this encoding. Rather it “drops” the dummy variable for the first level (Friday), so the model looks like. \[\begin{align} \text{tip}_{i, \text{Fri}} &= \alpha \phantom{{}+\beta_{Fri}{}} + \epsilon_{i, \text{Fri}} \\ \text{tip}_{i, \text{Sat}} &= \alpha + \beta_{\text{Sat}} + \epsilon_{i, \text{Sat}} \\ \text{tip}_{i, \text{Sun}} &= \alpha + \beta_{\text{Sun}} + \epsilon_{i, \text{Sun}} \\ \text{tip}_{i, \text{Thu}} &= \alpha + \beta_{\text{Thu}} + \epsilon_{i, \text{Thu}} \end{align}\]

In other words, \(\alpha = \mu_{\text{Fri}}\), \(\beta_{\text{Sat}} = \mu_{\text{Sat}} - \mu_{\text{Fri}}\), \(\beta_{\text{Sun}} = \mu_{\text{Sun}} - \mu_{\text{Fri}}\), and \(\beta_{\text{Thu}} = \mu_{\text{Thu}} - \mu_{\text{Fri}}\).

Both of these are different from another encoding given in your lecture notes (see also page 158 of the textbook).

\[\begin{align} \text{tip}_{i, \text{Fri}} &= \mu + \tau_{\text{Fri}} + \epsilon_{i, \text{Fri}} \\ \text{tip}_{i, \text{Sat}} &= \mu + \tau_{\text{Sat}} + \epsilon_{i, \text{Sat}} \\ \text{tip}_{i, \text{Sun}} &= \mu + \tau_{\text{Sun}} + \epsilon_{i, \text{Sun}} \\ \text{tip}_{i, \text{Thu}} &= \mu - (\tau_{\text{Fri}} + \tau_{\text{Sat}} + \tau_{\text{Sun}}) + \epsilon_{i, \text{Thu}} \end{align}\]

(Think of \(\tau_{\text{Thu}} := \tau_{\text{Fri}} + \tau_{\text{Sat}} + \tau_{\text{Sun}}\) and note the constraint that \(\tau\) coefficients sum to zero.)

However, all encodings will give the same fitted values \(\widehat{\text{tip}}\). (Exercise: why?) We are only going to deal with quantities like \(\text{RSS}\) and \(\text{RegSS}\) today, so the choice of encoding therefore will not matter.

boxplot(tip ~ day, data=tips)

mod <- lm(tip ~ day, data=tips)
anova(mod)

## Analysis of Variance Table
## 
## Response: tip
##            Df Sum Sq Mean Sq F value Pr(>F)
## day         3   9.53  3.1753  1.6724 0.1736
## Residuals 240 455.69  1.8987

If you do a linear model with one categorical variable, and apply anova() to the model, you get a one-way ANOVA table. Table 8.1 on page 160 clearly defines the quantities in this table. We will verify them now as well. We will see that this table computes the quantities needed for testing \[H_0 : \mu_{\text{Fri}} = \mu_{\text{Sat}} = \mu_{\text{Sun}} = \mu_{\text{Thu}}\]

Here, we let \(m = 4\) denote the number of levels of the single categorical variable. Then our design matrix will be \(n \times m\), no matter how we do the encoding (can either drop the intercept term, or drop one of the level’s dummy variable).

The second row contains familiar quantities: \(\text{RSS}\) and \(\text{RSS} / (n - m)\). The \(n - m\) is because \(m\) is the number of columns in our design matrix. (Compare with \(n - p - 1\) in regression, where \(p+1\) is the number of columns of the design matrix.)

n <- dim(tips)[1]
levels(tips$day)

## [1] "Fri"  "Sat"  "Sun"  "Thur"

m <- length(levels(tips$day))
e <- residuals(mod)
RSS <- sum(e^2)
RSS

## [1] 455.6866

RSS / (n - m)

## [1] 1.898694

The first row contains \(\text{RegSS}\) and \(\text{RegSS} / (m - 1)\). The \(m - 1\) is due to the \(m - 1\) constraints.

y <- tips$tip
y_bar <- mean(y)
y_hat <- fitted(mod)
RegSS <- sum((y_hat - y_bar)^2)
RegSS

## [1] 9.525873

TSS <- sum((y - y_bar)^2)
TSS - RSS

## [1] 9.525873

RegSS / (m-1)

## [1] 3.175291

The \(F\)-statistic ends up being the ratio of these quantities. Recall that in general the \(F\)-statistic looks like \[\frac{(\text{RegSS} - \text{RegSS}_0) / (m - 1)}{\text{RSS} / (n - m)} = \frac{(\text{RSS}_0 - \text{RSS}) / (m - 1)}{\text{RSS} / (n - m)}\] In this case the null model looks like \[\text{tip}_i = \mu + \epsilon_i,\] so \(\widehat{\mu} = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\) and \(\text{RegSS}_0 = 0\) and \(\text{RSS}_0 = \text{TSS}\).

So the numerator is just \(\text{RegSS} / (m - 1)\), which is what is computed in the table.

\[\frac{\text{RegSS} / (m - 1)}{\text{RSS} / (n - m)}.\]

Fstat <- (RegSS / (m - 1)) / (RSS / (n - m))
Fstat

## [1] 1.672355

1 - pf(Fstat, m - 1, n - m)

## [1] 0.1735886

Note that in lecture, you showed that the \(F\)-statistic has a special form. \[\frac{\sum_{j = 1}^m n_j (\bar{y}_j - \bar{y})^2 / (m - 1)}{\sum_{j=1}^m \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2 / (n - m)}.\] Recall that in this example, \(j = 1, \ldots, 4\) indexes the four groups (the four days), and \(i\) indexes individual observations within each group.

Recall the decomposition TSS = RSS + RegSS. \[ \sum_{j=1}^m \sum_{i=1}^{n_j} (y_{ij} - \bar{y})^2 = \sum_{j=1}^m \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2 + \sum_{j = 1}^m n_j (\bar{y}_j - \bar{y})^2\]

Sequential ANOVA table

What happens when you put an lm() object into anova()?

From the help file for anova.lm:

Specifying a single object gives a sequential analysis of variance table for that fit. That is, the reductions in the residual sum of squares as each term of the formula is added in turn are given in as the rows of a table, plus the residual sum of squares.

The table will contain F statistics (and P values) comparing the mean square for the row to the residual mean square.

The table can include test statistics. Normally the F statistic is most appropriate, which compares the mean square for a row to the residual sum of squares for the largest model considered.

Example 1: quantitative variables

data(Prestige)
mod3 <- lm(prestige ~ education + income + women, data=Prestige)
anova(mod3)

## Analysis of Variance Table
## 
## Response: prestige
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## education  1 21608.4 21608.4 350.9741 < 2.2e-16 ***
## income     1  2248.1  2248.1  36.5153 2.739e-08 ***
## women      1     5.3     5.3   0.0858    0.7702    
## Residuals 98  6033.6    61.6                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have nested models. \[\begin{align} \text{prestige}_i &= \beta_0 \phantom{{}+ \beta_1 \text{education}_i + \beta_2 \text{income}_i + \beta_3 \text{women}_i{}} + \epsilon_i \tag{0} \\ \text{prestige}_i &= \beta_0 + \beta_1 \text{education}_i \phantom{{}+ \beta_2 \text{income}_i + \beta_3 \text{women}_i{}} + \epsilon_i \tag{1} \\ \text{prestige}_i &= \beta_0 + \beta_1 \text{education}_i + \beta_2 \text{income}_i \phantom{{}+ \beta_3 \text{women}_i{}} + \epsilon_i \tag{2} \\ \text{prestige}_i &= \beta_0 + \beta_1 \text{education}_i + \beta_2 \text{income}_i + \beta_3 \text{women}_i + \epsilon_i \tag{3} \end{align}\]

mod0 <- lm(prestige ~ 1, data=Prestige)
mod1 <- lm(prestige ~ education, data=Prestige)
mod2 <- lm(prestige ~ education + income, data=Prestige)

RSS0 <- sum(resid(mod0)^2)
RSS1 <- sum(resid(mod1)^2)
RSS2 <- sum(resid(mod2)^2)
RSS3 <- sum(resid(mod3)^2)

y <- Prestige$prestige
y_bar <- mean(y)
TSS <- sum((y - y_bar)^2)
c(RSS0, TSS)

## [1] 29895.43 29895.43

The “Sum Sq” column.

c(RSS0 - RSS1,
  RSS1 - RSS2,
  RSS2 - RSS3,
  RSS3)

## [1] 21608.436539  2248.139345     5.280592  6033.570191

The “Mean Sq” column.

n <- dim(Prestige)[1]
Msq <- c((RSS0 - RSS1) / 1,
         (RSS1 - RSS2) / 1,
         (RSS2 - RSS3) / 1,
         RSS3 / (n - 3 - 1))
Msq

## [1] 21608.436539  2248.139345     5.280592    61.567043

The F statistics.

Fval <- Msq[1:3] / Msq[4]
Fval

## [1] 350.97408561  36.51530500   0.08576979

The \(p\)-values.

1 - pf(Fval, 1, n - 3 - 1)

## [1] 0.000000e+00 2.739412e-08 7.702447e-01

So we see that the table computes the following statistics.

\[\begin{align} \frac{(\text{RSS}_0 - \text{RSS}_1)/1}{\text{RSS}_3 / (n - 3 - 1)} \tag{a} \\ \frac{(\text{RSS}_1 - \text{RSS}_2)/1}{\text{RSS}_3 / (n - 3 - 1)} \tag{b} \\ \frac{(\text{RSS}_2 - \text{RSS}_3)/1}{\text{RSS}_3 / (n - 3 - 1)} \tag{c} \end{align}\]

Let us focus on the first statistic (a). Under the model with \(\beta_1 = \beta_2 = \beta_3 = 0\), this actualy follows an \(F\)-distribution with degrees of freedom \(1\) and \(n - 3 - 1\).

But we have seen how to do this starting with the model (1) \(\text{prestige}_i = \beta_0 + \beta_1 \text{education} + \epsilon_i\) and testing for \(\beta_1 = 0\). The \(F\)-test in this case is \[\frac{(\text{RSS}_0 - \text{RSS}_1)/1}{\text{RSS}_1 / (n - 1 - 1)}.\tag{a'}\] (The only difference is the denominator.) This also follows an \(F\)-distribution, but with degrees of freedom \(1\) and \(n - 1 - 1\). So both (a) and (a’) test for whether \(\beta_1 = 0\) in this small model (1). But these are different \(F\)-statistics and will give different \(p\)-values, as we show below.

Similarly, under the model \(\beta_2=\beta_3=0\), the second statistic above follows an \(F\)-distribution with degrees of freedom \(1\) and \(n - 3 - 1\). But if we consider model (2) \(\text{prestige}_i = \beta_0 + \beta_1 \text{education}_i + \beta_2 \text{income}_i + \epsilon_i\) and the hypothesis \(\beta_2 = 0\), we have seen that \[\frac{(\text{RSS}_1 - \text{RSS}_2)/1}{\text{RSS}_2 / (n - 2 - 1)}\tag{b'}\] follows the \(F\)-distribution with degrees of freedom \(1\) and \(n - 2 - 1\). So both (b) and (b’) test for \(\beta_2 = 0\) in model (2).

In general these \(F\)-statistics will be different, and will also have different \(p\)-values.

# computing statistic (a') and p-value
Fval_alt1 <- ((RSS0 - RSS1) / 1) / (RSS1 / (n - 1 - 1))
Fval_alt1

## [1] 260.7513

1 - pf(Fval_alt1, 1, n - 1 - 1)

## [1] 0

# statistic (a) and p-value
Fval[1]

## [1] 350.9741

1 - pf(Fval[1], 1, n - 3 - 1)

## [1] 0

# computing statistic (b') and p-value
Fval_alt2 <- ((RSS1 - RSS2) / 1) / (RSS2 / (n - 2 - 1))
Fval_alt2

## [1] 36.85565

1 - pf(Fval_alt2, 1, n - 2 - 1)

## [1] 2.355196e-08

# statistic (b) and p-value
Fval[2]

## [1] 36.51531

1 - pf(Fval[2], 1, n - 3 - 1)

## [1] 2.739412e-08

Finally statistic (c) is the same as what we have learned for testing \(\beta_3=0\) in the full model (3).

Custom sequence of models

The anova() function can take multiple lm objects. From the anova.lm help page:

If more than one object is specified, the table has a row for the residual degrees of freedom and sum of squares for each model. For all but the first model, the change in degrees of freedom and sum of squares is also given. (This only make statistical sense if the models are nested.) It is conventional to list the models from smallest to largest, but this is up to the user.

If we input the sequence of all four models rather than the largest model, we get a table with essentially the same values, but with an extra column that contains \(\text{RSS}_0\), \(\text{RSS}_1\), \(\text{RSS}_2\), and \(\text{RSS}_3\).

anova(mod3)

## Analysis of Variance Table
## 
## Response: prestige
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## education  1 21608.4 21608.4 350.9741 < 2.2e-16 ***
## income     1  2248.1  2248.1  36.5153 2.739e-08 ***
## women      1     5.3     5.3   0.0858    0.7702    
## Residuals 98  6033.6    61.6                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mod0, mod1, mod2, mod3)

## Analysis of Variance Table
## 
## Model 1: prestige ~ 1
## Model 2: prestige ~ education
## Model 3: prestige ~ education + income
## Model 4: prestige ~ education + income + women
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    101 29895.4                                    
## 2    100  8287.0  1   21608.4 350.9741 < 2.2e-16 ***
## 3     99  6038.9  1    2248.1  36.5153 2.739e-08 ***
## 4     98  6033.6  1       5.3   0.0858    0.7702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that each “Sum Sq” entry is the difference between the “RSS” entries in that row and the row above it.

Suppose we want to skip model (2), and instead look at the sequence of models (0), (1), and (3).

anova(mod0, mod1, mod3)

## Analysis of Variance Table
## 
## Model 1: prestige ~ 1
## Model 2: prestige ~ education
## Model 3: prestige ~ education + income + women
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    101 29895.4                                  
## 2    100  8287.0  1   21608.4 350.97 < 2.2e-16 ***
## 3     98  6033.6  2    2253.4  18.30 1.765e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 2: categorical variables

Since each categorical variable is represented by multiple dummy variables, the degrees of freedoms column may have values \(> 1\).

mod3 <- lm(tip ~ day + sex + factor(size), data=tips)
anova(mod3)

## Analysis of Variance Table
## 
## Response: tip
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## day            3   9.53  3.1753  2.1372   0.09624 .  
## sex            1   1.59  1.5946  1.0732   0.30128    
## factor(size)   5 106.43 21.2856 14.3266 3.077e-12 ***
## Residuals    234 347.66  1.4857                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have three nested models.

mod0 <- lm(tip ~ 1, data = tips)
mod1 <- lm(tip ~ day, data = tips)
mod2 <- lm(tip ~ day + sex, data = tips)

RSS3 <- sum(residuals(mod3)^2)
RSS2 <- sum(residuals(mod2)^2)
RSS1 <- sum(residuals(mod1)^2)
RSS0 <- sum(residuals(mod0)^2)
y <- tips$tip
y_bar <- mean(y)
TSS <- sum((y - y_bar)^2)
TSS

## [1] 465.2125

RSS0

## [1] 465.2125

n <- dim(tips)[1]
df1 <- 3
df2 <- 1
df3 <- 5