Sometimes we want to include a categorical explanatory variable in the model
Example: what is the relationship between income and education level allowing for gender (male or female)?
To find such relationship we fit a regression model that includes indicator that represent the categories of the categorical explanatory variable
Indicator variables are also called dummy variables and take values 0 or 1
Example: one dichotomous and one quantitative explanatory variable.
Including a categorical variable with M categories in the model:
For example, consider a categorical variable with categories 1, 2, \(\ldots\), M. We consider the reference category as 1, and create the following dummy variables:
\(D_2\) = 1 if category 2 and 0 otherwise
\(D_3\) = 1 if category 3 and 0 otherwise
…
\(D_M\) = 1 if category 1 and 0 otherwise
\[Y = \alpha + \beta_2 D_2 + \beta_3 D_3 + \ldots + \beta_M D_M + \epsilon\] \(\beta_j\) is the difference in the mean value of \(Y\) between the category \(j\) and the reference category.
We want to understand the relationship between income (\(Y\)) and gender (dichotomous variable) and education (\(X\), quantitative explanatory variable).
Gender is a qualitative explanatory variable, with categories male and female. We represent the qualitative explanatory variable with a dummy or indicator variable \(D\).
Consider Women as reference and create a dummy variable for Men.
Gender | \(D\) |
Men | 1 |
Women | 0 |
We assume an additive model (the partial effect of each explanatory variable is the same regardless of the specific value at which the other explanatory variable is held constant). One way of formulating the common slope model is
\[Y_i = \alpha + \beta X_i + \gamma D_i + \epsilon_i\]
For women, \(Y_i = \alpha + \beta X_i + \gamma (0) + \epsilon_i = \alpha + \beta X_i + \epsilon_i\)
For men, \(Y_i = \alpha + \beta X_i + \gamma (1) + \epsilon_i = (\alpha + \gamma) + \beta X_i + \epsilon_i\)
We want to understand the relationship between prestige (\(Y\)) and categorical variable occupation and continuous variables income (\(X_1\)) and education (\(X_2\)).
Occupations are classified into three categories: professional and managerial, white-collar and blue-collar. The three-category classification can be represented in the regression equation by introducing two dummy variables.
Consider Blue Collar as reference and create 2 dummy variables, one for Professional & Managerial and one for White Collar.
Category | \(D_1\) | \(D_2\) |
Professional & Managerial | 1 | 0 |
White Collar | 0 | 1 |
Blue Collar | 0 | 0 |
\[Y_i = \alpha + \beta_1 X_{i1} + \beta_2 X_{i2} + \gamma_1 D_{i1} + \gamma_2 D_{i2} + \epsilon_i\]
Professional: \(Y_i = (\alpha + \gamma_1) + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i\)
White-collar: \(Y_i = (\alpha + \gamma_2) + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i\)
Blue-collar: \(Y_i = \alpha + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i\)
This model describes three parallel regression planes which can differ in their intercepts
Two explanatory variables interact in determining a response variable when the partial effect of one depends on the value of the other.
In (a) gender and education are independent, since women and men have identical education distributions.
In (b) gender and education are related, since women, on average, have higher levels of education than men.
In both (a) and (b), the within-gender regressions of income on education are not parallel - the slope for men is larger than the slope for women.
Because the effect of education varies by gender, education and gender interact in affecting income.
Also, the effect of gender varies by education. Because the regressions are not parallel, the relative income advantage of men changes with education.
Interaction is a symmetric concept - the effect of education varies by gender, and the effect of gender varies by education.
Interaction refers to the manner in which explanatory variables combine to affect a response variable, not to the relationship between the explanatory variables themselves.
The following model accommodates different intercepts and slopes for women and men. Along with the dummy regressor \(D\) for gender and the quantitative regressor \(X\) for education, the interaction regressor \(XD\) is introduced.
\[Y_i = \alpha + \beta X_i + \gamma D_i + \delta (X_i D_i) + \epsilon_i\]
For women, \(Y_i = \alpha + \beta X_i + \gamma (0) + \delta (X_i \times 0) + \epsilon_i = \alpha + \beta X_i + \epsilon_i\)
For men, \(Y_i = \alpha + \beta X_i + \gamma (1) + \delta (X_i \times 1) + \epsilon_i = (\alpha + \gamma) + (\beta + \delta) X_i + \epsilon_i\)
Categorical variables Book Statistical Learning with R Chapter 3.3.1
Interactions Book Statistical Learning with R Chapter 3.3.2
In a large organization, we obtained data on yearly income for the employees. By computing sample means for the income of male employees it appears that this is larger than the mean salary for female employees. There is concern about this, and since you are the organization’s statistical expert, you are asked to check whether there is a significant difference in the income between genders. How would you proceed using linear regression? Write the model and explain the procedure you would follow.
Consider the model for the response variable “income” with independent variables “gender” and “experience”, the latter categorised as “junior”, “intermediate” and “senior”. Write the additive model by considering the main effects of the covariates, and by choosing the baseline categories “male’’ for”gender" and “junior” for “experience”. Then write the expected income of a junior male employee. Finally, write the model for the expected income of a senior female employee.
We fit the model having covariates gender and experience. Then we obtained from R the summary of this fit and noticed that the estimated coefficients for the “intermediate” and “senior” experience levels were positive and significant at the chosen \(\alpha\) level, and that all the remaining coefficients were non-significant. You have to write a report to your boss and explain carefully what this implies. What would you report?
Gender is a categorical variable. Here I set baseline gender = male. We have the linear model \[E(income | gender) = \beta_0 + \beta_1 \times X_{female},\] with dummy variable \(X_{female} = 1\) for female subjects and 0 otherwise, \(\beta_0\) the expected income for a male employee, and \(\beta_0+\beta_1\) the expected income for a female employee.
Therefore \(\beta_1\) represents the difference in income between genders. It is therefore sufficient to check the hypothesis \(H_0 : \beta_1 = 0\) vs the alternative hypothesis \(H_1 : \beta_1 \neq 0\). We can use a t-test for this task at some significance level \(\alpha\). If we turn out failing to accept \(H_0\) then we conclude that there is a significant difference (at the specified level \(\alpha\)). Notice we can also test \(H_0\) vs the one-tailed hypothesis \(H_1 : \beta_1 < 0\) if it is clear from the data that a female employee can’t earn more than a male.
Here I choose the following baselines: baseline gender = male and baseline experience = junior. The model with main effects only is \[E(income|gender, experience) = \beta_0 + \beta_1 \times X_{female} + \beta_2 \times X_{intermediate} + \beta_3 \times X_{senior}.\]
We have that \(\beta_0\) is the expected income for a junior male (all introduced dummy variables are zero in this case). \(\beta_0 + \beta_1 + \beta_3\) is the expected income for a senior female.
This means that, marginally with respect to gender (i.e. regardless of the gender), employees with intermediate experience and senior experience are earning on average more than junior employees. Also, marginally with respect to the experience level, there is no difference in income between different genders.
German breast cancer case study collected in Germany from 1984-1989. 686 observations including the variables tumor size (measured in mm), age, menopausal status (meno), number of positive lymph nodes (nodes), progesterone receptor (pgr) and oestrogen (er).
Are the variables age, menopausal status, number of positive lymph nodes, progesterone receptor and oestrogen useful to explain tumor size?
<- read.csv("data/GermanBreastCancer.csv")
d head(d[, c("size", "age", "meno", "nodes", "pgr", "er")])
Menopausal status is a categorical variable with two categories: premenopausal and postmenopausal.
We choose one category as reference and create an indicator variable for the other category. For example, we can set postmenopausal as the reference category and create an indicator variable called pre for category premenopausal
pre=1 if menopausal status is premenopausal, and
pre=0 if menopausal status is postmenopausal
$pre <- as.numeric(as.factor(d$meno)) - 1
dhead(d[, c("meno", "pre")])
<- lm(size ~ age + nodes + pgr + er + pre, data = d)
res summary(res)
##
## Call:
## lm(formula = size ~ age + nodes + pgr + er + pre, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.742 -8.684 -2.391 5.745 84.350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.429281 4.860937 5.643 2.46e-08 ***
## age -0.040185 0.081692 -0.492 0.623
## nodes 0.855936 0.094620 9.046 < 2e-16 ***
## pgr 0.001734 0.002792 0.621 0.535
## er -0.006082 0.003870 -1.571 0.117
## pre 0.327211 1.642781 0.199 0.842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.51 on 680 degrees of freedom
## Multiple R-squared: 0.1138, Adjusted R-squared: 0.1073
## F-statistic: 17.47 on 5 and 680 DF, p-value: 2.691e-16
\(\widehat{size}\) = 27.43 -0.0402 age + 0.8559 nodes + 0.00173 pgr - 0.00608 er + 0.33 pre
meno is postmenopausal (pre = 0), then \(0.33 \times pre = 0.33 \times 0 = 0\) and intercept is \((27.43+0)\)
\(\widehat{size}\) = 27.43 - 0.0402 age + 0.8559 nodes + 0.00173 pgr - 0.00608 er
meno is premenopausal (pre = 1), then \(0.33\times pre = 0.33 \times 1 = 0.33\) and intercept is \((27.43+0.33)\)
\(\widehat{size}\) = 27.76 - 0.0402 age + 0.8559 nodes + 0.00173 pgr - 0.00608 er
27.43 is the mean size when age=nodes=pgr=er=pre=0
0.33 is the difference in the mean size between pre=1 and pre=0, given that the rest of variables remain constant. p-value=0.842, there is not evidence that pre
affects the mean size allowing for the other predictors
mean size decreases by 0.0402 for every unit increase in age, given that the rest of variables remain constant. p-value=0.623, there is not evidence that age
affects the mean size allowing for the other predictors
mean size increases by 0.8559 for every unit increase in nodes, given that the rest of variables remain constant. p-value=0.000, there is strong evidence that nodes
affects the mean size allowing for the other predictors
mean size increases by 0.00173 for every unit increase in pgr, given that the rest of variables remain constant. p-value=0.535, there is not evidence that pgr
affects the mean size allowing for the other predictors
mean size decreases by 0.00608 for every unit increase in er, given that the rest of variables remain constant. p-value=0.117, there is not evidence that er
affects the mean size allowing for the other predictors