Dummy variables

I highly recommend you read Chapter 7 (of the Fox textbook, not Statistics for Dummies) if you want more detail about dummy variable regression.

Recap for quantitative variables

# you may need to install the "reshape2" package
data(tips, package="reshape2")
head(tips)
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Let us use this concrete example to understand exactly what \(y\) and \(X\) are. Suppose we consider tip as the response variable, and total_bill and size as the explanatory variables.

\(y\) is the vector with the tip values.

y <- tips$tip
y
##   [1]  1.01  1.66  3.50  3.31  3.61  4.71  2.00  3.12  1.96  3.23  1.71
##  [12]  5.00  1.57  3.00  3.02  3.92  1.67  3.71  3.50  3.35  4.08  2.75
##  [23]  2.23  7.58  3.18  2.34  2.00  2.00  4.30  3.00  1.45  2.50  3.00
##  [34]  2.45  3.27  3.60  2.00  3.07  2.31  5.00  2.24  2.54  3.06  1.32
##  [45]  5.60  3.00  5.00  6.00  2.05  3.00  2.50  2.60  5.20  1.56  4.34
##  [56]  3.51  3.00  1.50  1.76  6.73  3.21  2.00  1.98  3.76  2.64  3.15
##  [67]  2.47  1.00  2.01  2.09  1.97  3.00  3.14  5.00  2.20  1.25  3.08
##  [78]  4.00  3.00  2.71  3.00  3.40  1.83  5.00  2.03  5.17  2.00  4.00
##  [89]  5.85  3.00  3.00  3.50  1.00  4.30  3.25  4.73  4.00  1.50  3.00
## [100]  1.50  2.50  3.00  2.50  3.48  4.08  1.64  4.06  4.29  3.76  4.00
## [111]  3.00  1.00  4.00  2.55  4.00  3.50  5.07  1.50  1.80  2.92  2.31
## [122]  1.68  2.50  2.00  2.52  4.20  1.48  2.00  2.00  2.18  1.50  2.83
## [133]  1.50  2.00  3.25  1.25  2.00  2.00  2.00  2.75  3.50  6.70  5.00
## [144]  5.00  2.30  1.50  1.36  1.63  1.73  2.00  2.50  2.00  2.74  2.00
## [155]  2.00  5.14  5.00  3.75  2.61  2.00  3.50  2.50  2.00  2.00  3.00
## [166]  3.48  2.24  4.50  1.61  2.00 10.00  3.16  5.15  3.18  4.00  3.11
## [177]  2.00  2.00  4.00  3.55  3.68  5.65  3.50  6.50  3.00  5.00  3.50
## [188]  2.00  3.50  4.00  1.50  4.19  2.56  2.02  4.00  1.44  2.00  5.00
## [199]  2.00  2.00  4.00  2.01  2.00  2.50  4.00  3.23  3.41  3.00  2.03
## [210]  2.23  2.00  5.16  9.00  2.50  6.50  1.10  3.00  1.50  1.44  3.09
## [221]  2.20  3.48  1.92  3.00  1.58  2.50  2.00  3.00  2.72  2.88  2.00
## [232]  3.00  3.39  1.47  3.00  1.25  1.00  1.17  4.67  5.92  2.00  2.00
## [243]  1.75  3.00
length(y)
## [1] 244

The design matrix \(X\) simply contains the columns of the explanatory variables, along with an intercept.

X <- cbind(1,tips[, c("total_bill", "size")])
head(X)
##   1 total_bill size
## 1 1      16.99    2
## 2 1      10.34    3
## 3 1      21.01    3
## 4 1      23.68    2
## 5 1      24.59    4
## 6 1      25.29    4
dim(X)
## [1] 244   3

So in general, each row of \(X\) represents one data point. Each column of \(X\) represents a variable in the model (or the intercept). The matrix multiplication \(X \beta\) will take each data point (row of \(X\)) and compute a linear combination of the variables using coefficients \(\beta_0,\ldots, \beta_p\).

Now we can hypothetically do all the computations you have seen in lecture and on homework: solve the normal equation \(X^\top X \hat{\beta} = X^\top y\), etc.

X <- as.matrix(X)
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
##                  [,1]
## 1          0.66894474
## total_bill 0.09271334
## size       0.19259779

Let’s just check that we get the same result as the Great lm().

lm(tip ~ total_bill + size, data=tips)
## 
## Call:
## lm(formula = tip ~ total_bill + size, data = tips)
## 
## Coefficients:
## (Intercept)   total_bill         size  
##     0.66894      0.09271      0.19260

Categorical variables

What about categorical variables? [Represented as factors in R.]

class(tips$tip)
## [1] "numeric"
class(tips$size)
## [1] "integer"
class(tips$sex)
## [1] "factor"

Let’s check what values the categorical variables take, called levels in R. I also refer to them as categories.

levels(tips$sex)
## [1] "Female" "Male"
levels(tips$smoker)
## [1] "No"  "Yes"
levels(tips$day)
## [1] "Fri"  "Sat"  "Sun"  "Thur"
levels(tips$time)
## [1] "Dinner" "Lunch"

How can we use these variables in a linear model? Let’s offer our full dataset (with categorical variables) to the Almighty lm() and accept its blessing blindly, in the hope that we may become enlightened.

lm(tip ~ ., data = tips) # the period tells lm() to use all other variables as explanatory variables
## 
## Call:
## lm(formula = tip ~ ., data = tips)
## 
## Coefficients:
## (Intercept)   total_bill      sexMale    smokerYes       daySat  
##     0.80382      0.09449     -0.03244     -0.08641     -0.12146  
##      daySun      dayThur    timeLunch         size  
##    -0.02548     -0.16226      0.06813      0.17599

What are all these things? Why is there a coefficient for each level of each categorical variable? Actually one level is missing for each categorical variable?

Dichotomous factor (a categorical variable that has two categories)

[See Section 7.1 of Fox.]

Let’s try to form the design matrix naïvely.

X <- cbind(1, tips[,c("total_bill", "size", "sex")])
head(X)
##   1 total_bill size    sex
## 1 1      16.99    2 Female
## 2 1      10.34    3   Male
## 3 1      21.01    3   Male
## 4 1      23.68    2   Male
## 5 1      24.59    4 Female
## 6 1      25.29    4   Male

Unfortunately, despite what we learn in sex ed, we can’t actually multiply or add using sex.

Maybe we can replace Male with \(1\) and Female with \(0\). (Or vice versa.)

tips$sex
##   [1] Female Male   Male   Male   Female Male   Male   Male   Male   Male  
##  [11] Male   Female Male   Male   Female Male   Female Male   Female Male  
##  [21] Male   Female Female Male   Male   Male   Male   Male   Male   Female
##  [31] Male   Male   Female Female Male   Male   Male   Female Male   Male  
##  [41] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
##  [51] Male   Female Female Male   Male   Male   Male   Female Male   Male  
##  [61] Male   Male   Male   Male   Male   Male   Female Female Male   Male  
##  [71] Male   Female Female Female Female Male   Male   Male   Male   Male  
##  [81] Male   Male   Female Male   Male   Female Male   Male   Male   Male  
##  [91] Male   Male   Female Female Female Male   Male   Male   Male   Male  
## [101] Female Female Female Female Female Male   Male   Male   Male   Female
## [111] Male   Female Male   Male   Female Female Male   Female Female Female
## [121] Male   Female Male   Male   Female Female Male   Female Female Male  
## [131] Male   Female Female Female Female Female Female Female Male   Female
## [141] Female Male   Male   Female Female Female Female Female Male   Male  
## [151] Male   Male   Male   Male   Male   Female Male   Female Female Male  
## [161] Male   Male   Female Male   Female Male   Male   Male   Female Female
## [171] Male   Male   Male   Male   Male   Male   Male   Male   Female Male  
## [181] Male   Male   Male   Male   Male   Male   Female Male   Female Male  
## [191] Male   Female Male   Male   Male   Male   Male   Female Female Male  
## [201] Male   Female Female Female Male   Female Male   Male   Male   Female
## [211] Male   Male   Male   Female Female Female Male   Male   Male   Female
## [221] Male   Female Male   Female Male   Female Female Male   Male   Female
## [231] Male   Male   Male   Male   Male   Male   Male   Male   Female Male  
## [241] Female Male   Male   Female
## Levels: Female Male
as.numeric(tips$sex) - 1
##   [1] 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1
##  [36] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
##  [71] 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0
## [106] 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1
## [176] 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0
## [211] 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0
as.numeric(tips$sex == "Male")
##   [1] 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1
##  [36] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
##  [71] 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0
## [106] 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1
## [176] 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0
## [211] 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0
sex <- as.numeric(tips$sex) - 1

Let’s make our design matrix again.

X <- cbind(1, tips[,c("total_bill", "size")], sex)
head(X)
##   1 total_bill size sex
## 1 1      16.99    2   0
## 2 1      10.34    3   1
## 3 1      21.01    3   1
## 4 1      23.68    2   1
## 5 1      24.59    4   0
## 6 1      25.29    4   1

Let’s do the linear regression manually and with lm().

X <- as.matrix(X)
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
##                   [,1]
## 1           0.68187393
## total_bill  0.09292034
## size        0.19258767
## sex        -0.02641868
lm(tip ~ total_bill + size + sex, data=tips)
## 
## Call:
## lm(formula = tip ~ total_bill + size + sex, data = tips)
## 
## Coefficients:
## (Intercept)   total_bill         size      sexMale  
##     0.68187      0.09292      0.19259     -0.02642

Great, so this is what lm() is doing too.

Let us understand exactly what kind of model we are fitting when we encode sex in this way.

Our model looks like \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D + \epsilon_i\] where \(X_1\) and \(X_2\) are total_bill and size, while \(D\) takes on values \(0\) and \(1\) for Female and Male respectively.

We can view this as two models, one for each gender. For Female, \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon_i,\] and for Male, \[Y = (\beta_0 + \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon_i.\]

So, by encoding the genders as a \(0\)-\(1\) variable, we are saying that there is a constant difference of \(\beta_3\) between the average tips received between the two genders, holding the other variables fixed. If you were to plot these functions (ignoring the noise), the graphs would be two parallel planes that differ only in their intercept.

For dichotomous factors, we can encode using other things (e.g., \(1\) and \(2\) instead of \(0\) and \(1\)) without much repercussions, but \(0\)-\(1\) coding is more natural and interpretable. Howeveer, we have to be careful when we move to polytomous factors.

Polytomous factors (categorical variables that take more than two variables)

[See Section 7.2 of Fox.]

Again let us naïvely construct our design matrix.

levels(tips$day)
## [1] "Fri"  "Sat"  "Sun"  "Thur"
X <- cbind(1, tips[,c("total_bill", "size", "day")])
X[70:100,]
##     1 total_bill size  day
## 70  1      15.01    2  Sat
## 71  1      12.02    2  Sat
## 72  1      17.07    3  Sat
## 73  1      26.86    2  Sat
## 74  1      25.28    2  Sat
## 75  1      14.73    2  Sat
## 76  1      10.51    2  Sat
## 77  1      17.92    2  Sat
## 78  1      27.20    4 Thur
## 79  1      22.76    2 Thur
## 80  1      17.29    2 Thur
## 81  1      19.44    2 Thur
## 82  1      16.66    2 Thur
## 83  1      10.07    1 Thur
## 84  1      32.68    2 Thur
## 85  1      15.98    2 Thur
## 86  1      34.83    4 Thur
## 87  1      13.03    2 Thur
## 88  1      18.28    2 Thur
## 89  1      24.71    2 Thur
## 90  1      21.16    2 Thur
## 91  1      28.97    2  Fri
## 92  1      22.49    2  Fri
## 93  1       5.75    2  Fri
## 94  1      16.32    2  Fri
## 95  1      22.75    2  Fri
## 96  1      40.17    4  Fri
## 97  1      27.28    2  Fri
## 98  1      12.03    2  Fri
## 99  1      21.01    2  Fri
## 100 1      12.46    2  Fri

For dichotomous factors, we simply replaced the categories with the numbers \(0\) and \(1\). Here, we have four levels What if we just use \(0\), \(1\), \(2\), and \(3\)?

We need to consider what that model would be. \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D + \epsilon.\] Then the models for the four days of the week are \[\begin{align} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + 2\beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + 3 \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \end{align}\]

We have thus imposed some sort of ordering on the categories, which is undesirable.

For dummy coding, we will view each category as a binary vector of length \(4-1=3\). Fri will be represented by \((0,0,0)\), Sat by \((1,0,0)\), Sun by \((0,1,0)\), and Thur by \((0,0,1)\). In terms of the design matrix, this amounts to replacing the day column of our naïve design matrix with three columns, with entries according to the above coding scheme.

Let us try to justify the use of the coding scheme. Having added new binary variables \(D_3, D_4, D_5\), our model is \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 D_3 + \beta_4 D_4 + \beta_5 D_5,\] and the four models for Fri, Sat, Sun, and Thu respectively are \[\begin{align} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + \beta_3) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + \beta_4) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\ Y &= (\beta_0 + \beta_5) + \beta_1 X_1 + \beta_2 X_2 + \epsilon \end{align}\]

Thus \(\beta_3\) represents the constant difference between an average tip on Sat and one on Fri, holding all other variables fixed; \(\beta_4\) and \(\beta_5\) represent constant differences relative to Fri as well. Geometrically, the graphs of these four models are parallel planes with different intercepts.

Note that the choice of which level gets encoded by the zero vector is arbitrary, but it establishes which level of the categorical random variable will be the “default” against which the the other level of the variable are compared; permuting the encoding will change the meaning/interpretation of the least squares coefficients.

Let us make our design matrix now. [Code adapted from Michael Tong’s note on Piazza.]

X <- cbind(1, tips[,c("total_bill", "size")])

lev <- levels(tips$day)
num <- length(lev)
names <- colnames(X)

for (i in 2:num) {
  X <- cbind(X, tips$day == lev[i])
}
colnames(X) <- c(names, lev[2:num])
X <- as.matrix(X)
X[70:100,]
##     1 total_bill size Sat Sun Thur
## 70  1      15.01    2   1   0    0
## 71  1      12.02    2   1   0    0
## 72  1      17.07    3   1   0    0
## 73  1      26.86    2   1   0    0
## 74  1      25.28    2   1   0    0
## 75  1      14.73    2   1   0    0
## 76  1      10.51    2   1   0    0
## 77  1      17.92    2   1   0    0
## 78  1      27.20    4   0   0    1
## 79  1      22.76    2   0   0    1
## 80  1      17.29    2   0   0    1
## 81  1      19.44    2   0   0    1
## 82  1      16.66    2   0   0    1
## 83  1      10.07    1   0   0    1
## 84  1      32.68    2   0   0    1
## 85  1      15.98    2   0   0    1
## 86  1      34.83    4   0   0    1
## 87  1      13.03    2   0   0    1
## 88  1      18.28    2   0   0    1
## 89  1      24.71    2   0   0    1
## 90  1      21.16    2   0   0    1
## 91  1      28.97    2   0   0    0
## 92  1      22.49    2   0   0    0
## 93  1       5.75    2   0   0    0
## 94  1      16.32    2   0   0    0
## 95  1      22.75    2   0   0    0
## 96  1      40.17    4   0   0    0
## 97  1      27.28    2   0   0    0
## 98  1      12.03    2   0   0    0
## 99  1      21.01    2   0   0    0
## 100 1      12.46    2   0   0    0

Let’s check that it matches what lm() does.

beta <- solve(t(X) %*% X, t(X) %*% y)
beta
##                   [,1]
## 1           0.74578683
## total_bill  0.09299361
## size        0.18713231
## Sat        -0.12465824
## Sun        -0.01349818
## Thur       -0.07749322
lm(tip ~ total_bill + size + day, data=tips)
## 
## Call:
## lm(formula = tip ~ total_bill + size + day, data = tips)
## 
## Coefficients:
## (Intercept)   total_bill         size       daySat       daySun  
##     0.74579      0.09299      0.18713     -0.12466     -0.01350  
##     dayThur  
##    -0.07749

Why did we not use the “one-hot” encoding of \((1,0,0,0)\), \((0,1,0,0)\), \((0,0,1,0)\), and \((0,0,0,1)\) (thus introducing four new variables to represent day)? Then if you go through and write down the model for each day, we will have intercepts \(\beta_0 + \beta_3\), \(\beta_0 + \beta_4\), \(\beta_0 + \beta_5\), and \(\beta_0 + \beta_6\). Here we have \(5\) coefficients trying to specify \(4\) intercepts, so there are multiple ways to represent a certain model, i.e. the least squares solution will not be unique.

Another way to understand the issue with this encoding is to form the design matrix \(X\) with this encoding, and note that the last three columns will sum to the first column. Thus \(X\) is not full rank, so the least squares solution is not unique.

Multiple categorical variables

If you have multiple categorical variables, just do the dummy encoding for each variable. This amounts to replacing each categorical variable’s column of the data table with \(k-1\) columns containing the dummy encoding, where \(k\) is the number of levels the variable has

X <- model.matrix(tip ~ . , data=tips) # probably can't use this on homework
head(X)
##   (Intercept) total_bill sexMale smokerYes daySat daySun dayThur timeLunch
## 1           1      16.99       0         0      0      1       0         0
## 2           1      10.34       1         0      0      1       0         0
## 3           1      21.01       1         0      0      1       0         0
## 4           1      23.68       1         0      0      1       0         0
## 5           1      24.59       0         0      0      1       0         0
## 6           1      25.29       1         0      0      1       0         0
##   size
## 1    2
## 2    3
## 3    3
## 4    2
## 5    4
## 6    4
beta <- solve(t(X) %*% X, t(X) %*% y)
beta
##                    [,1]
## (Intercept)  0.80381728
## total_bill   0.09448701
## sexMale     -0.03244094
## smokerYes   -0.08640832
## daySat      -0.12145838
## daySun      -0.02548066
## dayThur     -0.16225920
## timeLunch    0.06812860
## size         0.17599200
lm(tip ~ ., data=tips)
## 
## Call:
## lm(formula = tip ~ ., data = tips)
## 
## Coefficients:
## (Intercept)   total_bill      sexMale    smokerYes       daySat  
##     0.80382      0.09449     -0.03244     -0.08641     -0.12146  
##      daySun      dayThur    timeLunch         size  
##    -0.02548     -0.16226      0.06813      0.17599

Multicollinearity/collinearity

A few of you were worried that if you have several categorical variables, then many columns will consist of \(0\)s and \(1\)s, which would make it quite possible for columns to be the same (or more generally, to be linearly dependent).

If this did happen, think about what that means. For example, maybe all male bill payers in the dataset ate on Saturday, and no female bill payers ate on Saturday. Then from the perspective of the data, the Male dummy variable is exactly the same as the Saturday dummy variable Thus the least squares estimate is not unique (similar to Question 1 on Homework 1).

  • If the two variables are truly the same (or more generally, if some variable is a linear combination of other variables), you should just drop one variable.
  • Typically that is not the case, and the variables are indeed different, but you just do not have enough data to discriminate between them. Getting more data, if possible, is the preferred remedy.

Note that the issues and suggested remedies above are not intrinsic to dummy variables or categorical variables. The same can be said for quantitative data that have perfect collinearity or near collinearity.

See the Wikipedia page (in particular the “consequences” and “remedies” sections).

What if I don’t want “parallel” models?

Dummy coding essentially only amounts to changing the intercept term for each level of the categorical variable. If we want to have different slopes for each category, we will need interaction terms between the dummy variables and the other variables. See Section 7.3 in the textbook for more detail.

Categorical variables: nominal or ordinal?

Above, the categorical variables were nominal, i.e. they had no intrinsic order. An ordinal variable has a natural ordering (e.g., “unsatisfied”, “indifferent”, “satisfied”), and it may be good to incorporate this ordering into the regeression. Sometimes things like “year” or “age” are encoded as factors in the dataset despite obviously being numerical, and it may make a lot of sense to just treat it as a number. To convert factors to numerics, you can try as.numeric() but you should be careful to check what it is doing.

For instance,

year <- factor(c("1984", "2017", "1984"))
as.numeric(year)
## [1] 1 2 1
as.numeric(as.character(year))
## [1] 1984 2017 1984

Sometimes you may want to do the opposite: convert a quantitative variable back into a categorical variable. You would need to have a good justification for doing so. For sake of demonstration, we convert the size variable into a factor and then apply lm().

as.factor(tips$size)
##   [1] 2 3 3 2 4 4 2 4 2 2 2 4 2 4 2 2 3 3 3 3 2 2 2 4 2 4 2 2 2 2 2 4 2 4 2
##  [36] 3 3 3 3 3 3 2 2 2 4 2 2 4 3 2 2 2 4 2 4 2 4 2 2 4 2 2 2 4 3 3 2 1 2 2
##  [71] 2 3 2 2 2 2 2 4 2 2 2 2 1 2 2 4 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 3 2 2
## [106] 2 2 2 2 2 2 1 3 2 3 2 4 2 2 4 2 2 2 2 2 6 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [141] 2 6 5 6 2 2 3 2 2 2 2 2 3 4 4 5 6 4 2 4 4 2 3 2 2 3 2 4 2 2 3 2 2 2 2
## [176] 2 2 2 2 2 4 2 3 4 2 5 3 5 3 3 2 2 2 2 2 2 2 4 2 2 3 2 2 2 4 3 3 4 2 2
## [211] 3 4 4 2 3 2 5 2 2 4 2 2 1 3 2 2 2 4 2 2 4 3 2 2 2 2 2 2 3 3 2 2 2 2
## Levels: 1 2 3 4 5 6
tips$size <- as.factor(tips$size)
lm(tip ~ ., data=tips)
## 
## Call:
## lm(formula = tip ~ ., data = tips)
## 
## Coefficients:
## (Intercept)   total_bill      sexMale    smokerYes       daySat  
##     0.86318      0.09387     -0.03250     -0.07710     -0.11261  
##      daySun      dayThur    timeLunch        size2        size3  
##    -0.01248     -0.17351      0.08158      0.29568      0.45547  
##       size4        size5        size6  
##     0.69172      0.44743      1.18049