1 Categorical variables and interactions

Learning objectives

  • Categorical variables
  • Interactions
  • Examples

2 Categorical variables

Sometimes we want to include a categorical explanatory variable in the model

Example: what is the relationship between income and education level allowing for gender (male or female)?

To find such relationship we fit a regression model that includes indicator that represent the categories of the categorical explanatory variable

Indicator variables are also called dummy variables and take values 0 or 1

Example: one dichotomous and one quantitative explanatory variable.

  • Figure shows examples on the relationship between education and income among women and men.
  • In both cases, the within-gender regressions of income on education are parallel. Parallel regressions imply additive effects of education and gender on income.
  • In (a), gender and education are unrelated to each other: If we ignore gender and regress income on education alone, we obtain the same slope as is produced by the separate within-gender regressions; ignoring gender inflates the size of the errors, however.
  • In (b) gender and education are related (women have higher average education than men), and therefore if we regress income on education alone, we arrive at a biased assessment of the effect of education on income. The overall regression of income on education has a negative slope even though the within-gender regressions have positive slopes.

2.1 Including a categorical variable in the model

Including a categorical variable with M categories in the model:

  1. Choose a category as reference category
  2. Create M-1 dummy variables
  3. Include the dummy variables in the model
  4. The coefficients of the dummy variables indicate the increase or decrease in the mean value of \(Y\) with respect reference category.

For example, consider a categorical variable with categories 1, 2, \(\ldots\), M. We consider the reference category as 1, and create the following dummy variables:

\(D_2\) = 1 if category 2 and 0 otherwise
\(D_3\) = 1 if category 3 and 0 otherwise

\(D_M\) = 1 if category 1 and 0 otherwise

\[Y = \alpha + \beta_2 D_2 + \beta_3 D_3 + \ldots + \beta_M D_M + \epsilon\] \(\beta_j\) is the difference in the mean value of \(Y\) between the category \(j\) and the reference category.

2.2 Categorical variables with 2 categories

We want to understand the relationship between income (\(Y\)) and gender (dichotomous variable) and education (\(X\), quantitative explanatory variable).

Gender is a qualitative explanatory variable, with categories male and female. We represent the qualitative explanatory variable with a dummy or indicator variable \(D\).

Consider Women as reference and create a dummy variable for Men.

Gender \(D\)
Men 1
Women 0

We assume an additive model (the partial effect of each explanatory variable is the same regardless of the specific value at which the other explanatory variable is held constant). One way of formulating the common slope model is

\[Y_i = \alpha + \beta X_i + \gamma D_i + \epsilon_i\]

  • For women, \(Y_i = \alpha + \beta X_i + \gamma (0) + \epsilon_i = \alpha + \beta X_i + \epsilon_i\)

  • For men, \(Y_i = \alpha + \beta X_i + \gamma (1) + \epsilon_i = (\alpha + \gamma) + \beta X_i + \epsilon_i\)

2.3 Categorical variables with more than 2 categories

We want to understand the relationship between prestige (\(Y\)) and categorical variable occupation and continuous variables income (\(X_1\)) and education (\(X_2\)).

Occupations are classified into three categories: professional and managerial, white-collar and blue-collar. The three-category classification can be represented in the regression equation by introducing two dummy variables.

Consider Blue Collar as reference and create 2 dummy variables, one for Professional & Managerial and one for White Collar.

Category \(D_1\) \(D_2\)
Professional & Managerial 1 0
White Collar 0 1
Blue Collar 0 0

\[Y_i = \alpha + \beta_1 X_{i1} + \beta_2 X_{i2} + \gamma_1 D_{i1} + \gamma_2 D_{i2} + \epsilon_i\]

Professional: \(Y_i = (\alpha + \gamma_1) + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i\)

White-collar: \(Y_i = (\alpha + \gamma_2) + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i\)

Blue-collar: \(Y_i = \alpha + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i\)

This model describes three parallel regression planes which can differ in their intercepts

  • \(\alpha\) is the intercept for blue-collar occupations
  • \(\gamma_1\) represents the constant vertical difference between the parallel regression planes for professional and blue-collar occupations (fixing the values of education and income)
  • \(\gamma_2\) represents the constant vertical distance between the regression planes for white-collar and blue-collar occupations
  • Blue-collar occupations are coded 0 for both dummy regressors, so ‘blue collar’ serves as a baseline category with which the other occupational categories are compared.

2.4 Interactions

Two explanatory variables interact in determining a response variable when the partial effect of one depends on the value of the other.

  • In (a) gender and education are independent, since women and men have identical education distributions.

  • In (b) gender and education are related, since women, on average, have higher levels of education than men.

  • In both (a) and (b), the within-gender regressions of income on education are not parallel - the slope for men is larger than the slope for women.

  • Because the effect of education varies by gender, education and gender interact in affecting income.

  • Also, the effect of gender varies by education. Because the regressions are not parallel, the relative income advantage of men changes with education.

  • Interaction is a symmetric concept - the effect of education varies by gender, and the effect of gender varies by education.

  • Interaction refers to the manner in which explanatory variables combine to affect a response variable, not to the relationship between the explanatory variables themselves.

The following model accommodates different intercepts and slopes for women and men. Along with the dummy regressor \(D\) for gender and the quantitative regressor \(X\) for education, the interaction regressor \(XD\) is introduced.

\[Y_i = \alpha + \beta X_i + \gamma D_i + \delta (X_i D_i) + \epsilon_i\]

For women, \(Y_i = \alpha + \beta X_i + \gamma (0) + \delta (X_i \times 0) + \epsilon_i = \alpha + \beta X_i + \epsilon_i\)

For men, \(Y_i = \alpha + \beta X_i + \gamma (1) + \delta (X_i \times 1) + \epsilon_i = (\alpha + \gamma) + (\beta + \delta) X_i + \epsilon_i\)

  • \(\alpha\) and \(\beta\) are the intercept and slope for the regression of income on education among women
  • \(\gamma\) gives the difference in intercepts between the male and female groups
  • \(\delta\) gives the difference in slopes between the two groups

3 Categorical variables and interactions

4 Example categorical variable. Organization’s income

  1. In a large organization, we obtained data on yearly income for the employees. By computing sample means for the income of male employees it appears that this is larger than the mean salary for female employees. There is concern about this, and since you are the organization’s statistical expert, you are asked to check whether there is a significant difference in the income between genders. How would you proceed using linear regression? Write the model and explain the procedure you would follow.

  2. Consider the model for the response variable “income” with independent variables “gender” and “experience”, the latter categorised as “junior”, “intermediate” and “senior”. Write the additive model by considering the main effects of the covariates, and by choosing the baseline categories “male’’ for”gender" and “junior” for “experience”. Then write the expected income of a junior male employee. Finally, write the model for the expected income of a senior female employee.

  3. We fit the model having covariates gender and experience. Then we obtained from R the summary of this fit and noticed that the estimated coefficients for the “intermediate” and “senior” experience levels were positive and significant at the chosen \(\alpha\) level, and that all the remaining coefficients were non-significant. You have to write a report to your boss and explain carefully what this implies. What would you report?

4.1 Solution

Gender is a categorical variable. Here I set baseline gender = male. We have the linear model \[E(income | gender) = \beta_0 + \beta_1 \times X_{female},\] with dummy variable \(X_{female} = 1\) for female subjects and 0 otherwise, \(\beta_0\) the expected income for a male employee, and \(\beta_0+\beta_1\) the expected income for a female employee.

Therefore \(\beta_1\) represents the difference in income between genders. It is therefore sufficient to check the hypothesis \(H_0 : \beta_1 = 0\) vs the alternative hypothesis \(H_1 : \beta_1 \neq 0\). We can use a t-test for this task at some significance level \(\alpha\). If we turn out failing to accept \(H_0\) then we conclude that there is a significant difference (at the specified level \(\alpha\)). Notice we can also test \(H_0\) vs the one-tailed hypothesis \(H_1 : \beta_1 < 0\) if it is clear from the data that a female employee can’t earn more than a male.

Here I choose the following baselines: baseline gender = male and baseline experience = junior. The model with main effects only is \[E(income|gender, experience) = \beta_0 + \beta_1 \times X_{female} + \beta_2 \times X_{intermediate} + \beta_3 \times X_{senior}.\]

We have that \(\beta_0\) is the expected income for a junior male (all introduced dummy variables are zero in this case). \(\beta_0 + \beta_1 + \beta_3\) is the expected income for a senior female.

This means that, marginally with respect to gender (i.e. regardless of the gender), employees with intermediate experience and senior experience are earning on average more than junior employees. Also, marginally with respect to the experience level, there is no difference in income between different genders.

5 Example categorical variable. Breast cancer study

German breast cancer case study collected in Germany from 1984-1989. 686 observations including the variables tumor size (measured in mm), age, menopausal status (meno), number of positive lymph nodes (nodes), progesterone receptor (pgr) and oestrogen (er).

Are the variables age, menopausal status, number of positive lymph nodes, progesterone receptor and oestrogen useful to explain tumor size?

d <- read.csv("data/GermanBreastCancer.csv")
head(d[, c("size", "age", "meno", "nodes", "pgr", "er")])
  • Menopausal status is a categorical variable
  • Age, number of positive lymph nodes, progesterone receptor, oestrogen are continuous variables
  • Number of positive lymph nodes count variable, can be considered as continuous variable

5.1 Create indicator variable

Menopausal status is a categorical variable with two categories: premenopausal and postmenopausal.

We choose one category as reference and create an indicator variable for the other category. For example, we can set postmenopausal as the reference category and create an indicator variable called pre for category premenopausal
pre=1 if menopausal status is premenopausal, and
pre=0 if menopausal status is postmenopausal

d$pre <- as.numeric(as.factor(d$meno)) - 1
head(d[, c("meno", "pre")])

5.2 Interpretation of coefficients and p-values

res <- lm(size ~ age + nodes + pgr + er + pre, data = d)
summary(res)
## 
## Call:
## lm(formula = size ~ age + nodes + pgr + er + pre, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.742  -8.684  -2.391   5.745  84.350 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.429281   4.860937   5.643 2.46e-08 ***
## age         -0.040185   0.081692  -0.492    0.623    
## nodes        0.855936   0.094620   9.046  < 2e-16 ***
## pgr          0.001734   0.002792   0.621    0.535    
## er          -0.006082   0.003870  -1.571    0.117    
## pre          0.327211   1.642781   0.199    0.842    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.51 on 680 degrees of freedom
## Multiple R-squared:  0.1138, Adjusted R-squared:  0.1073 
## F-statistic: 17.47 on 5 and 680 DF,  p-value: 2.691e-16

\(\widehat{size}\) = 27.43 -0.0402 age + 0.8559 nodes + 0.00173 pgr - 0.00608 er + 0.33 pre

  • meno is postmenopausal (pre = 0), then \(0.33 \times pre = 0.33 \times 0 = 0\) and intercept is \((27.43+0)\)
    \(\widehat{size}\) = 27.43 - 0.0402 age + 0.8559 nodes + 0.00173 pgr - 0.00608 er

  • meno is premenopausal (pre = 1), then \(0.33\times pre = 0.33 \times 1 = 0.33\) and intercept is \((27.43+0.33)\)
    \(\widehat{size}\) = 27.76 - 0.0402 age + 0.8559 nodes + 0.00173 pgr - 0.00608 er

  • 27.43 is the mean size when age=nodes=pgr=er=pre=0

  • 0.33 is the difference in the mean size between pre=1 and pre=0, given that the rest of variables remain constant. p-value=0.842, there is not evidence that pre affects the mean size allowing for the other predictors

  • mean size decreases by 0.0402 for every unit increase in age, given that the rest of variables remain constant. p-value=0.623, there is not evidence that age affects the mean size allowing for the other predictors

  • mean size increases by 0.8559 for every unit increase in nodes, given that the rest of variables remain constant. p-value=0.000, there is strong evidence that nodes affects the mean size allowing for the other predictors

  • mean size increases by 0.00173 for every unit increase in pgr, given that the rest of variables remain constant. p-value=0.535, there is not evidence that pgr affects the mean size allowing for the other predictors

  • mean size decreases by 0.00608 for every unit increase in er, given that the rest of variables remain constant. p-value=0.117, there is not evidence that er affects the mean size allowing for the other predictors