I highly recommend you read Chapter 7 (of the Fox textbook, not Statistics for Dummies) if you want more detail about dummy variable regression.
# you may need to install the "reshape2" package
data(tips, package="reshape2")
head(tips)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
Let us use this concrete example to understand exactly what \(y\) and \(X\) are. Suppose we consider tip
as the response variable, and total_bill
and size
as the explanatory variables.
\(y\) is the vector with the tip values.
y <- tips$tip
y
## [1] 1.01 1.66 3.50 3.31 3.61 4.71 2.00 3.12 1.96 3.23 1.71
## [12] 5.00 1.57 3.00 3.02 3.92 1.67 3.71 3.50 3.35 4.08 2.75
## [23] 2.23 7.58 3.18 2.34 2.00 2.00 4.30 3.00 1.45 2.50 3.00
## [34] 2.45 3.27 3.60 2.00 3.07 2.31 5.00 2.24 2.54 3.06 1.32
## [45] 5.60 3.00 5.00 6.00 2.05 3.00 2.50 2.60 5.20 1.56 4.34
## [56] 3.51 3.00 1.50 1.76 6.73 3.21 2.00 1.98 3.76 2.64 3.15
## [67] 2.47 1.00 2.01 2.09 1.97 3.00 3.14 5.00 2.20 1.25 3.08
## [78] 4.00 3.00 2.71 3.00 3.40 1.83 5.00 2.03 5.17 2.00 4.00
## [89] 5.85 3.00 3.00 3.50 1.00 4.30 3.25 4.73 4.00 1.50 3.00
## [100] 1.50 2.50 3.00 2.50 3.48 4.08 1.64 4.06 4.29 3.76 4.00
## [111] 3.00 1.00 4.00 2.55 4.00 3.50 5.07 1.50 1.80 2.92 2.31
## [122] 1.68 2.50 2.00 2.52 4.20 1.48 2.00 2.00 2.18 1.50 2.83
## [133] 1.50 2.00 3.25 1.25 2.00 2.00 2.00 2.75 3.50 6.70 5.00
## [144] 5.00 2.30 1.50 1.36 1.63 1.73 2.00 2.50 2.00 2.74 2.00
## [155] 2.00 5.14 5.00 3.75 2.61 2.00 3.50 2.50 2.00 2.00 3.00
## [166] 3.48 2.24 4.50 1.61 2.00 10.00 3.16 5.15 3.18 4.00 3.11
## [177] 2.00 2.00 4.00 3.55 3.68 5.65 3.50 6.50 3.00 5.00 3.50
## [188] 2.00 3.50 4.00 1.50 4.19 2.56 2.02 4.00 1.44 2.00 5.00
## [199] 2.00 2.00 4.00 2.01 2.00 2.50 4.00 3.23 3.41 3.00 2.03
## [210] 2.23 2.00 5.16 9.00 2.50 6.50 1.10 3.00 1.50 1.44 3.09
## [221] 2.20 3.48 1.92 3.00 1.58 2.50 2.00 3.00 2.72 2.88 2.00
## [232] 3.00 3.39 1.47 3.00 1.25 1.00 1.17 4.67 5.92 2.00 2.00
## [243] 1.75 3.00
length(y)
## [1] 244
The design matrix \(X\) simply contains the columns of the explanatory variables, along with an intercept.
X <- cbind(1,tips[, c("total_bill", "size")])
head(X)
## 1 total_bill size
## 1 1 16.99 2
## 2 1 10.34 3
## 3 1 21.01 3
## 4 1 23.68 2
## 5 1 24.59 4
## 6 1 25.29 4
dim(X)
## [1] 244 3
So in general, each row of \(X\) represents one data point. Each column of \(X\) represents a variable in the model (or the intercept). The matrix multiplication \(X \beta\) will take each data point (row of \(X\)) and compute a linear combination of the variables using coefficients \(\beta_0,\ldots, \beta_p\).
Now we can hypothetically do all the computations you have seen in lecture and on homework: solve the normal equation \(X^\top X \hat{\beta} = X^\top y\), etc.
X <- as.matrix(X)
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
## [,1]
## 1 0.66894474
## total_bill 0.09271334
## size 0.19259779
Let’s just check that we get the same result as the Great lm()
.
lm(tip ~ total_bill + size, data=tips)
##
## Call:
## lm(formula = tip ~ total_bill + size, data = tips)
##
## Coefficients:
## (Intercept) total_bill size
## 0.66894 0.09271 0.19260
What about categorical variables? [Represented as factor
s in R.]
class(tips$tip)
## [1] "numeric"
class(tips$size)
## [1] "integer"
class(tips$sex)
## [1] "factor"
Let’s check what values the categorical variables take, called levels
in R. I also refer to them as categories.
levels(tips$sex)
## [1] "Female" "Male"
levels(tips$smoker)
## [1] "No" "Yes"
levels(tips$day)
## [1] "Fri" "Sat" "Sun" "Thur"
levels(tips$time)
## [1] "Dinner" "Lunch"
How can we use these variables in a linear model? Let’s offer our full dataset (with categorical variables) to the Almighty lm()
and accept its blessing blindly, in the hope that we may become enlightened.
lm(tip ~ ., data = tips) # the period tells lm() to use all other variables as explanatory variables
##
## Call:
## lm(formula = tip ~ ., data = tips)
##
## Coefficients:
## (Intercept) total_bill sexMale smokerYes daySat
## 0.80382 0.09449 -0.03244 -0.08641 -0.12146
## daySun dayThur timeLunch size
## -0.02548 -0.16226 0.06813 0.17599
What are all these things? Why is there a coefficient for each level of each categorical variable? Actually one level is missing for each categorical variable?
[See Section 7.1 of Fox.]
Let’s try to form the design matrix naïvely.
X <- cbind(1, tips[,c("total_bill", "size", "sex")])
head(X)
## 1 total_bill size sex
## 1 1 16.99 2 Female
## 2 1 10.34 3 Male
## 3 1 21.01 3 Male
## 4 1 23.68 2 Male
## 5 1 24.59 4 Female
## 6 1 25.29 4 Male
Unfortunately, despite what we learn in sex ed, we can’t actually multiply or add using sex.
Maybe we can replace Male
with \(1\) and Female
with \(0\). (Or vice versa.)
tips$sex
## [1] Female Male Male Male Female Male Male Male Male Male
## [11] Male Female Male Male Female Male Female Male Female Male
## [21] Male Female Female Male Male Male Male Male Male Female
## [31] Male Male Female Female Male Male Male Female Male Male
## [41] Male Male Male Male Male Male Male Male Male Male
## [51] Male Female Female Male Male Male Male Female Male Male
## [61] Male Male Male Male Male Male Female Female Male Male
## [71] Male Female Female Female Female Male Male Male Male Male
## [81] Male Male Female Male Male Female Male Male Male Male
## [91] Male Male Female Female Female Male Male Male Male Male
## [101] Female Female Female Female Female Male Male Male Male Female
## [111] Male Female Male Male Female Female Male Female Female Female
## [121] Male Female Male Male Female Female Male Female Female Male
## [131] Male Female Female Female Female Female Female Female Male Female
## [141] Female Male Male Female Female Female Female Female Male Male
## [151] Male Male Male Male Male Female Male Female Female Male
## [161] Male Male Female Male Female Male Male Male Female Female
## [171] Male Male Male Male Male Male Male Male Female Male
## [181] Male Male Male Male Male Male Female Male Female Male
## [191] Male Female Male Male Male Male Male Female Female Male
## [201] Male Female Female Female Male Female Male Male Male Female
## [211] Male Male Male Female Female Female Male Male Male Female
## [221] Male Female Male Female Male Female Female Male Male Female
## [231] Male Male Male Male Male Male Male Male Female Male
## [241] Female Male Male Female
## Levels: Female Male
as.numeric(tips$sex) - 1
## [1] 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1
## [36] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
## [71] 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0
## [106] 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1
## [176] 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0
## [211] 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0
as.numeric(tips$sex == "Male")
## [1] 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1
## [36] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
## [71] 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0
## [106] 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1
## [176] 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0
## [211] 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0
sex <- as.numeric(tips$sex) - 1
Let’s make our design matrix again.
X <- cbind(1, tips[,c("total_bill", "size")], sex)
head(X)
## 1 total_bill size sex
## 1 1 16.99 2 0
## 2 1 10.34 3 1
## 3 1 21.01 3 1
## 4 1 23.68 2 1
## 5 1 24.59 4 0
## 6 1 25.29 4 1
Let’s do the linear regression manually and with lm()
.
X <- as.matrix(X)
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
## [,1]
## 1 0.68187393
## total_bill 0.09292034
## size 0.19258767
## sex -0.02641868
lm(tip ~ total_bill + size + sex, data=tips)
##
## Call:
## lm(formula = tip ~ total_bill + size + sex, data = tips)
##
## Coefficients:
## (Intercept) total_bill size sexMale
## 0.68187 0.09292 0.19259 -0.02642
Great, so this is what lm()
is doing too.
Let us understand exactly what kind of model we are fitting when we encode sex
in this way.
Our model looks like \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D + \epsilon_i\] where \(X_1\) and \(X_2\) are total_bill
and size
, while \(D\) takes on values \(0\) and \(1\) for Female
and Male
respectively.
We can view this as two models, one for each gender. For Female
, \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon_i,\] and for Male
, \[Y = (\beta_0 + \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon_i.\]
So, by encoding the genders as a \(0\)-\(1\) variable, we are saying that there is a constant difference of \(\beta_3\) between the average tips received between the two genders, holding the other variables fixed. If you were to plot these functions (ignoring the noise), the graphs would be two parallel planes that differ only in their intercept.
For dichotomous factors, we can encode using other things (e.g., \(1\) and \(2\) instead of \(0\) and \(1\)) without much repercussions, but \(0\)-\(1\) coding is more natural and interpretable. Howeveer, we have to be careful when we move to polytomous factors.
[See Section 7.2 of Fox.]
Again let us naïvely construct our design matrix.
levels(tips$day)
## [1] "Fri" "Sat" "Sun" "Thur"
X <- cbind(1, tips[,c("total_bill", "size", "day")])
X[70:100,]
## 1 total_bill size day
## 70 1 15.01 2 Sat
## 71 1 12.02 2 Sat
## 72 1 17.07 3 Sat
## 73 1 26.86 2 Sat
## 74 1 25.28 2 Sat
## 75 1 14.73 2 Sat
## 76 1 10.51 2 Sat
## 77 1 17.92 2 Sat
## 78 1 27.20 4 Thur
## 79 1 22.76 2 Thur
## 80 1 17.29 2 Thur
## 81 1 19.44 2 Thur
## 82 1 16.66 2 Thur
## 83 1 10.07 1 Thur
## 84 1 32.68 2 Thur
## 85 1 15.98 2 Thur
## 86 1 34.83 4 Thur
## 87 1 13.03 2 Thur
## 88 1 18.28 2 Thur
## 89 1 24.71 2 Thur
## 90 1 21.16 2 Thur
## 91 1 28.97 2 Fri
## 92 1 22.49 2 Fri
## 93 1 5.75 2 Fri
## 94 1 16.32 2 Fri
## 95 1 22.75 2 Fri
## 96 1 40.17 4 Fri
## 97 1 27.28 2 Fri
## 98 1 12.03 2 Fri
## 99 1 21.01 2 Fri
## 100 1 12.46 2 Fri
For dichotomous factors, we simply replaced the categories with the numbers \(0\) and \(1\). Here, we have four levels What if we just use \(0\), \(1\), \(2\), and \(3\)?
We need to consider what that model would be. \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D + \epsilon.\] Then the models for the four days of the week are \[\begin{align} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + 2\beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + 3 \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \end{align}\]We have thus imposed some sort of ordering on the categories, which is undesirable.
For dummy coding, we will view each category as a binary vector of length \(4-1=3\). Fri
will be represented by \((0,0,0)\), Sat
by \((1,0,0)\), Sun
by \((0,1,0)\), and Thur
by \((0,0,1)\). In terms of the design matrix, this amounts to replacing the day
column of our naïve design matrix with three columns, with entries according to the above coding scheme.
Fri
, Sat
, Sun
, and Thu
respectively are
\[\begin{align}
Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon
\\
Y &= (\beta_0 + \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon
\\
Y &= (\beta_0 + \beta_4) + \beta_1 X_1 + \beta_2 X_2 + \epsilon
\\
Y &= (\beta_0 + \beta_5) + \beta_1 X_1 + \beta_2 X_2 + \epsilon
\end{align}\]
Thus \(\beta_3\) represents the constant difference between an average tip on Sat
and one on Fri
, holding all other variables fixed; \(\beta_4\) and \(\beta_5\) represent constant differences relative to Fri
as well. Geometrically, the graphs of these four models are parallel planes with different intercepts.
Note that the choice of which level gets encoded by the zero vector is arbitrary, but it establishes which level of the categorical random variable will be the “default” against which the the other level of the variable are compared; permuting the encoding will change the meaning/interpretation of the least squares coefficients.
Let us make our design matrix now. [Code adapted from Michael Tong’s note on Piazza.]
X <- cbind(1, tips[,c("total_bill", "size")])
lev <- levels(tips$day)
num <- length(lev)
names <- colnames(X)
for (i in 2:num) {
X <- cbind(X, tips$day == lev[i])
}
colnames(X) <- c(names, lev[2:num])
X <- as.matrix(X)
X[70:100,]
## 1 total_bill size Sat Sun Thur
## 70 1 15.01 2 1 0 0
## 71 1 12.02 2 1 0 0
## 72 1 17.07 3 1 0 0
## 73 1 26.86 2 1 0 0
## 74 1 25.28 2 1 0 0
## 75 1 14.73 2 1 0 0
## 76 1 10.51 2 1 0 0
## 77 1 17.92 2 1 0 0
## 78 1 27.20 4 0 0 1
## 79 1 22.76 2 0 0 1
## 80 1 17.29 2 0 0 1
## 81 1 19.44 2 0 0 1
## 82 1 16.66 2 0 0 1
## 83 1 10.07 1 0 0 1
## 84 1 32.68 2 0 0 1
## 85 1 15.98 2 0 0 1
## 86 1 34.83 4 0 0 1
## 87 1 13.03 2 0 0 1
## 88 1 18.28 2 0 0 1
## 89 1 24.71 2 0 0 1
## 90 1 21.16 2 0 0 1
## 91 1 28.97 2 0 0 0
## 92 1 22.49 2 0 0 0
## 93 1 5.75 2 0 0 0
## 94 1 16.32 2 0 0 0
## 95 1 22.75 2 0 0 0
## 96 1 40.17 4 0 0 0
## 97 1 27.28 2 0 0 0
## 98 1 12.03 2 0 0 0
## 99 1 21.01 2 0 0 0
## 100 1 12.46 2 0 0 0
Let’s check that it matches what lm()
does.
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
## [,1]
## 1 0.74578683
## total_bill 0.09299361
## size 0.18713231
## Sat -0.12465824
## Sun -0.01349818
## Thur -0.07749322
lm(tip ~ total_bill + size + day, data=tips)
##
## Call:
## lm(formula = tip ~ total_bill + size + day, data = tips)
##
## Coefficients:
## (Intercept) total_bill size daySat daySun
## 0.74579 0.09299 0.18713 -0.12466 -0.01350
## dayThur
## -0.07749
Why did we not use the “one-hot” encoding of \((1,0,0,0)\), \((0,1,0,0)\), \((0,0,1,0)\), and \((0,0,0,1)\) (thus introducing four new variables to represent day
)? Then if you go through and write down the model for each day
, we will have intercepts \(\beta_0 + \beta_3\), \(\beta_0 + \beta_4\), \(\beta_0 + \beta_5\), and \(\beta_0 + \beta_6\). Here we have \(5\) coefficients trying to specify \(4\) intercepts, so there are multiple ways to represent a certain model, i.e. the least squares solution will not be unique.
Another way to understand the issue with this encoding is to form the design matrix \(X\) with this encoding, and note that the last three columns will sum to the first column. Thus \(X\) is not full rank, so the least squares solution is not unique.
If you have multiple categorical variables, just do the dummy encoding for each variable. This amounts to replacing each categorical variable’s column of the data table with \(k-1\) columns containing the dummy encoding, where \(k\) is the number of levels the variable has
X <- model.matrix(tip ~ . , data=tips) # probably can't use this on homework
head(X)
## (Intercept) total_bill sexMale smokerYes daySat daySun dayThur timeLunch
## 1 1 16.99 0 0 0 1 0 0
## 2 1 10.34 1 0 0 1 0 0
## 3 1 21.01 1 0 0 1 0 0
## 4 1 23.68 1 0 0 1 0 0
## 5 1 24.59 0 0 0 1 0 0
## 6 1 25.29 1 0 0 1 0 0
## size
## 1 2
## 2 3
## 3 3
## 4 2
## 5 4
## 6 4
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
## [,1]
## (Intercept) 0.80381728
## total_bill 0.09448701
## sexMale -0.03244094
## smokerYes -0.08640832
## daySat -0.12145838
## daySun -0.02548066
## dayThur -0.16225920
## timeLunch 0.06812860
## size 0.17599200
lm(tip ~ ., data=tips)
##
## Call:
## lm(formula = tip ~ ., data = tips)
##
## Coefficients:
## (Intercept) total_bill sexMale smokerYes daySat
## 0.80382 0.09449 -0.03244 -0.08641 -0.12146
## daySun dayThur timeLunch size
## -0.02548 -0.16226 0.06813 0.17599
A few of you were worried that if you have several categorical variables, then many columns will consist of \(0\)s and \(1\)s, which would make it quite possible for columns to be the same (or more generally, to be linearly dependent).
If this did happen, think about what that means. For example, maybe all male bill payers in the dataset ate on Saturday, and no female bill payers ate on Saturday. Then from the perspective of the data, the Male
dummy variable is exactly the same as the Saturday
dummy variable Thus the least squares estimate is not unique (similar to Question 1 on Homework 1).
Note that the issues and suggested remedies above are not intrinsic to dummy variables or categorical variables. The same can be said for quantitative data that have perfect collinearity or near collinearity.
See the Wikipedia page (in particular the “consequences” and “remedies” sections).
Dummy coding essentially only amounts to changing the intercept term for each level of the categorical variable. If we want to have different slopes for each category, we will need interaction terms between the dummy variables and the other variables. See Section 7.3 in the textbook for more detail.
Above, the categorical variables were nominal, i.e. they had no intrinsic order. An ordinal variable has a natural ordering (e.g., “unsatisfied”, “indifferent”, “satisfied”), and it may be good to incorporate this ordering into the regeression. Sometimes things like “year” or “age” are encoded as factor
s in the dataset despite obviously being numerical, and it may make a lot of sense to just treat it as a number. To convert factor
s to numerics, you can try as.numeric()
but you should be careful to check what it is doing.
For instance,
year <- factor(c("1984", "2017", "1984"))
as.numeric(year)
## [1] 1 2 1
as.numeric(as.character(year))
## [1] 1984 2017 1984
Sometimes you may want to do the opposite: convert a quantitative variable back into a categorical variable. You would need to have a good justification for doing so. For sake of demonstration, we convert the size
variable into a factor
and then apply lm()
.
as.factor(tips$size)
## [1] 2 3 3 2 4 4 2 4 2 2 2 4 2 4 2 2 3 3 3 3 2 2 2 4 2 4 2 2 2 2 2 4 2 4 2
## [36] 3 3 3 3 3 3 2 2 2 4 2 2 4 3 2 2 2 4 2 4 2 4 2 2 4 2 2 2 4 3 3 2 1 2 2
## [71] 2 3 2 2 2 2 2 4 2 2 2 2 1 2 2 4 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 3 2 2
## [106] 2 2 2 2 2 2 1 3 2 3 2 4 2 2 4 2 2 2 2 2 6 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [141] 2 6 5 6 2 2 3 2 2 2 2 2 3 4 4 5 6 4 2 4 4 2 3 2 2 3 2 4 2 2 3 2 2 2 2
## [176] 2 2 2 2 2 4 2 3 4 2 5 3 5 3 3 2 2 2 2 2 2 2 4 2 2 3 2 2 2 4 3 3 4 2 2
## [211] 3 4 4 2 3 2 5 2 2 4 2 2 1 3 2 2 2 4 2 2 4 3 2 2 2 2 2 2 3 3 2 2 2 2
## Levels: 1 2 3 4 5 6
tips$size <- as.factor(tips$size)
lm(tip ~ ., data=tips)
##
## Call:
## lm(formula = tip ~ ., data = tips)
##
## Coefficients:
## (Intercept) total_bill sexMale smokerYes daySat
## 0.86318 0.09387 -0.03250 -0.07710 -0.11261
## daySun dayThur timeLunch size2 size3
## -0.01248 -0.17351 0.08158 0.29568 0.45547
## size4 size5 size6
## 0.69172 0.44743 1.18049