37  LM3. Categorical and interactions

37.1 Linear model

The table below shows the first few rows of a data set from the 2011 March U.S. Current Population Survey (CPS). The set contains weekly wages (in $1,000) for 4,952 males between the age of 18 and 70 who worked full-time in 2011. Also recorded are the region of the country (with four categories: Midwest, Northeast, South, and West), the metropolitan status of the men’s employment (with three categories: Metropolitan, Not Metropolitan, and Not Identified), years of education (0 through 16), and race (White or Black).

Region MetropolitanStatus Age Educ Race WeeklyWages, in $1,000
West Not Metropolitan 64 9 White 1.419
Midwest Metropolitan 51 11 White 1.000
South Metropolitan 25 4 White 0.420
West Not Metropolitan 46 13 White 1.980
Northeast Metropolitan 31 12 Black 1.750
  1. Suppose we would like to model weekly wages based on the region of the country. Write down the linear regression model for E(\(WeeklyWages | Region\)) using the usual notation: \(\beta_0\) for the intercept and \(\beta_i\), \(i\geq 1\), for the coefficients of the covariates. Let “Midwest” be the reference level and use alphabetical order for the rest of the regions.

  2. Here is a partial regression output (with some lines omitted) from fitting the model in part (a) using R.

Call:
lm(formula = WeeklyWages ~ Region, data = CPS2011)
Coefficients:
...
--
Residual standard error: 0.6615 on 4948 degrees of freedom
...
F-statistic: 15.62 on 3 and 4948 DF, p-value: 4.116e-10

At the same time, we conduct an ANOVA that compares average weekly wages across all four geographical regions. Based on the R output above, fill in the following ANOVA table:

Sum of Squares type Sum of Squares value d.f. Mean Square F-statistic p-value
SSB (Model) ? 3 ? 15.62 4.116e-10
SSE (Error) ? 4948 ? - -
SST (Total) ? 4951 ? - -
  1. Calculate the adjusted R-squared for the regression in the previous part.

  2. Next, we enrich the model by adding \(Age\) and \(Educ\), using main effects as well as the interaction between \(Region\) and \(Age\). Keep the reference level for \(Region\) the same as before. Write down the new model for \(E(WeeklyWages | Region, Age, Educ)\).

  3. Suppose \(\hat \beta_0\) and \(\hat \beta_{Age}\) are the estimates of the intercept and the coefficient for predictor \(Age\), respectively, for the model in the previous part. What is the sub-population of men for which the fitted relationship between \(WeeklyWages\) and \(Age\) is described by the following:

\[\hat \beta_0 + \hat \beta_{Age} Age?\]

Linear regression. Solutions

\[\mbox{E(WeeklyWages $|$ Region)} = \beta_0 + \beta_1 \mbox{Northeast} + \beta_2 \mbox{South} + \beta_3 \mbox{West}.\]

We know p-value, all d.f.s, and F-statistic from the output.

Also, we know that Residual standard error\(^2\) is Mean Square for SSE.

Residual standard error = 0.6615. So Mean Square for SSE = 0.6615 \(\times\) 0.6615=0.4375822.

Also, Mean Square for SSB will be F-statistic \(\times\) Mean Square for SSE. 15.62 \(\times\) 0.4375822=6.835034.

Then, SSB and SSE may be calculated by multiplying Mean Square for SSB and SSE, respectively, by d.f. Then, we can sum-up d.f.s and SSs to get SST and d.f. for SST. Mean Square for SST is obtained by dividing SST by its d.f.

Sum of Squares type Sum of Squares value d.f. Mean Square F-statistic p-value
SSB (Model) 6.835034 \(\times\) 3=20.5051 3 15.62 \(\times\) 0.4375822=6.835034 15.62 4.116e-10
SSE (Error) 0.4375822 \(\times\) 4948=2165.157 4948 0.6615 \(\times\) 0.6615=0.4375822 - -
SST (Total) 20.5051+ 2165.157=2185.662 4951 2185.662/4951=0.4414587 - -
  1. Calculate the adjusted R-squared for the regression in the previous part.

\(R^2 = \frac{SSM}{SST} = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n ( Y_i - \bar Y)^2} \ \ \ \mbox{(proportion of variability explained by the model)}\)

\(R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^n ( Y_i - \hat Y_i)^2}{\sum_{i=1}^n ( Y_i - \bar Y)^2} \ \ \ \mbox{(1 - proportion of variability not explained by the model)}\)

\(R^2(adj) = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST} = 1 - \frac{0.4375822}{0.4414587} = 0.00878\)

\[E(WeeklyWages | Region, Age, Educ) = \beta_0 + \beta_1 Age + \beta_2 Educ + \beta_3 Northeast + \beta_4 South + \beta_5 West + \] \[ + \beta_6 Age \times Northeast + \beta_7 Age \times South + \beta_8 Age \times West\]

The subpopulation of men who live in Midwest and have zero years of education.