34 R

34.1 Simple linear regression

34.1.1 Murders and poverty

Conduct a regression analysis for predicting annual murders per million from percentage living in poverty using a random sample of 20 metropolitan areas. Data is in murders.csv, annual murders per million is variable annual_murders_per_mil, and percentage living in poverty is variable perc_pov. Answer the following questions.

Write out the linear model.
Interpret the intercept.
Interpret the slope.
Interpret $R^2$.
Calculate the correlation coefficient.
What are the hypotheses for evaluating whether poverty percentage is a significant predictor of murder rate?
State the conclusion of the hypothesis test in context of the data.
Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.
Do your results from the hypothesis test and the confidence interval agree? Explain.

Solution

d <- read.csv("data/murders.csv")
head(d)

  population perc_pov perc_unemp annual_murders_per_mil
1     587000     16.5        6.2                   11.2
2     643000     20.5        6.4                   13.4
3     635000     26.3        9.3                   40.7
4     692000     16.5        5.3                    5.3
5    1248000     19.2        7.3                   24.8
6     643000     16.5        5.9                   12.7

summary(d)

   population         perc_pov       perc_unemp    annual_murders_per_mil
 Min.   : 587000   Min.   :14.30   Min.   :4.900   Min.   : 5.30         
 1st Qu.: 643000   1st Qu.:17.12   1st Qu.:6.150   1st Qu.:13.22         
 Median : 745000   Median :19.15   Median :6.600   Median :19.50         
 Mean   :1433000   Mean   :19.72   Mean   :6.935   Mean   :20.57         
 3rd Qu.:1318750   3rd Qu.:21.57   3rd Qu.:7.775   3rd Qu.:26.07         
 Max.   :7895000   Max.   :26.30   Max.   :9.300   Max.   :40.70

plot(d$perc_pov, d$annual_murders_per_mil,
     xlab = "Percent in poverty",
     ylab = "Annual muders per million")

res <- lm(annual_murders_per_mil ~ perc_pov, data = d)
summary(res)


Call:
lm(formula = annual_murders_per_mil ~ perc_pov, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1663 -2.5613 -0.9552  2.8887 12.3475 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -29.901      7.789  -3.839   0.0012 ** 
perc_pov       2.559      0.390   6.562 3.64e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.512 on 18 degrees of freedom
Multiple R-squared:  0.7052,    Adjusted R-squared:  0.6889 
F-statistic: 43.06 on 1 and 18 DF,  p-value: 3.638e-06

After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. Assumptions are independency, normality, linearity and constant variance. If there any assumption violation, subsequent inferential procedures may be invalid.

We would also need to investigate unusual observations and whether or not they are influential.

$\widehat{\mbox{murder}} = -29.901 + 2.559 \mbox{ poverty}$

$\hat \beta_0 = -29.901$.

The expected murder rate in metropolitan areas with no poverty is -29.901 per million. This is obviously not a meaningful value, it just serves to adjust the height of the regression line.

$\hat \beta_1 = 2.559$.

For each additional percentage increase in poverty, we expect murders per million to be higher on average by a factor 2.559.

$R^2$ = 0.7052

Poverty levels explains 70.52% of the variability in murder rates in metropolitan areas.

The correlation coefficient is $r$ = $\sqrt{R^2} = \sqrt{0.7052}$ = 0.8398.

$H_0: \beta_1 = 0$
$H_1: \beta_1 \neq 0$

$\alpha = 0.05$

$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = 6.562$

p-value is the probability to obtain a test statistic equal to 6.562 or more extreme assuming the null hypothesis is true.

(n <- nrow(d))

[1] 20

2*(1-pt(6.562, n-2))

[1] 3.639832e-06

p-value is approximately 0. p-value $< \alpha$, so we reject the null hypothesis.

The data provide convincing evidence that poverty percentage is a significant predictor of murder rate.

95% confidence interval for the slope of poverty percentage

$2.559 \pm t^* \times SE = 2.559 \pm 2.10 \times 0.390 = (1.740, 3.378)$

$t^*$ value such that 0.05/2 probability in t(n-2) is below $t^*$.

qt(0.025, 20-2)

[1] -2.100922

c(2.559 - 2.10 * 0.390, 2.559 + 2.10 * 0.390)

[1] 1.740 3.378

We are 95% confident that for each percentage point poverty is higher, murder rate is expected to be higher on average by 1.74 to 3.378 per million.

The results from the hypothesis test and the confidence interval agree. We rejected the null and the confidence interval does not include 0. In both cases we conclude there is evidence that poverty has an effect on murder rate.

34.1.2 Hospital infection

Consider the hospital infection risk dataset hospital.csv which consists of a sample of n = 58 hospitals in the east and north-central U.S. Conduct a regression analysis for predicting infection risk (percent of patients who get an infection) from average length of stay (in days). In the data, infection risk is variable InfctRsk and average length of stay Stay. Answer the following questions.

Write out the linear model.
Interpret the slope.
What are the hypotheses for evaluating whether lenght of stay is a significant predictor of infection risk? State the conclusion of the hypothesis test in context of the data.
Interpret $R^2$.
Calculate and interpret the mean risk of infection when the average length of stay is 10 days and its standard error.
Calculate and interpret the confidence interval for the mean risk for an average length of stay of 10 days.
Calculate and interpret the prediction interval for the mean risk for an average length of stay of 10 days.
Comment on the difference between confidence and prediction intervals.

Solution

d <- read.csv("data/hospital.csv")
head(d)

  ID  Stay  Age InfctRsk Culture  Xray Beds MedSchool Region Census Nurses
1  5 11.20 56.5      5.7    34.5  88.9  180         2      1    134    151
2 10  8.84 56.3      6.3    29.6  82.6   85         2      1     59     66
3 11 11.07 53.2      4.9    28.5 122.0  768         1      1    591    656
4 13 12.78 56.8      7.7    46.0 116.9  322         1      1    252    349
5 18 11.62 53.9      6.4    25.5  99.2  133         2      1    113    101
6 23  9.78 52.3      5.0    17.6  95.9  270         1      1    240    198
  Facilities
1       40.0
2       40.0
3       80.0
4       57.1
5       37.1
6       57.1

summary(d)

       ID              Stay             Age           InfctRsk    
 Min.   :  2.00   Min.   : 7.390   Min.   :38.80   Min.   :1.300  
 1st Qu.: 28.25   1st Qu.: 8.905   1st Qu.:50.60   1st Qu.:3.900  
 Median : 53.00   Median : 9.930   Median :53.20   Median :4.500  
 Mean   : 52.71   Mean   :10.049   Mean   :52.78   Mean   :4.557  
 3rd Qu.: 77.75   3rd Qu.:11.060   3rd Qu.:55.60   3rd Qu.:5.300  
 Max.   :109.00   Max.   :13.950   Max.   :65.90   Max.   :7.800  
    Culture           Xray             Beds         MedSchool   
 Min.   : 2.20   Min.   : 42.60   Min.   : 52.0   Min.   :1.00  
 1st Qu.:10.65   1st Qu.: 75.95   1st Qu.:133.2   1st Qu.:2.00  
 Median :16.10   Median : 87.75   Median :195.5   Median :2.00  
 Mean   :18.28   Mean   : 86.30   Mean   :259.8   Mean   :1.81  
 3rd Qu.:23.62   3rd Qu.: 98.97   3rd Qu.:316.5   3rd Qu.:2.00  
 Max.   :60.50   Max.   :133.50   Max.   :768.0   Max.   :2.00  
     Region          Census          Nurses        Facilities   
 Min.   :1.000   Min.   : 37.0   Min.   : 14.0   Min.   : 5.70  
 1st Qu.:1.000   1st Qu.:105.5   1st Qu.: 88.0   1st Qu.:37.10  
 Median :2.000   Median :159.5   Median :152.0   Median :45.70  
 Mean   :1.552   Mean   :198.5   Mean   :184.4   Mean   :46.01  
 3rd Qu.:2.000   3rd Qu.:249.0   3rd Qu.:201.5   3rd Qu.:54.30  
 Max.   :2.000   Max.   :595.0   Max.   :656.0   Max.   :80.00

plot(d$InfctRsk, d$Stay,
     xlab = "avg length of stay (days)",
     ylab = "infection risk (% patients get infection)")

res <- lm(InfctRsk ~ Stay, data = d)
summary(res)


Call:
lm(formula = InfctRsk ~ Stay, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6145 -0.4660  0.1388  0.4970  2.4310 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.15982    0.95580  -1.213     0.23    
Stay         0.56887    0.09416   6.041  1.3e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.024 on 56 degrees of freedom
Multiple R-squared:  0.3946,    Adjusted R-squared:  0.3838 
F-statistic:  36.5 on 1 and 56 DF,  p-value: 1.302e-07

We would also need to investigate unusual observations and whether or not they are influential.

$\widehat{\mbox{risk}} = -1.15982 + 0.56887 \mbox{ stay}$

$\hat \beta_1 = 0.56887$.

We expect the average infection risk to increase by a factor of 0.56887 for each additional day in the length of stay.

$H_0: \beta_1 = 0$
$H_1: \beta_1 \neq 0$

$\alpha = 0.05$

$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = 6.041$

p-value is the probability to obtain a test statistic equal to 6.041 or more extreme assuming the null hypothesis is true.

(n <- nrow(d))

[1] 58

2*(1-pt(6.041, n-2))

[1] 1.303602e-07

p-value is approximately 0. p-value $< \alpha$, so we reject the null hypothesis.

The data provide convincing evidence that length of stay is a significant predictor of infection risk.

$R^2$ = 0.3946

Length of stay explains 39.46% of the variability in infection risk.

The mean infection risk when the average stay is 10 days is 4.5288.

$\widehat{\mbox{risk}} = -1.15982 + 0.56887 \times \mbox{ 10} = -1.15982 + 0.56887 \times 10 = 4.5288$

The standard error of fit $\widehat{\mbox{risk}}$ measures the accuracy of $\widehat{\mbox{risk}}$ as an estimate of the mean risk when the average stay is 10 days.

new <- data.frame(Stay = 10)
predict(lm(InfctRsk ~ Stay, data = d), new, interval = "confidence", se.fit = TRUE)

$fit
       fit      lwr      upr
1 4.528846 4.259205 4.798486

$se.fit
[1] 0.1346021

$df
[1] 56

$residual.scale
[1] 1.024489

For the 95% CI, we say with 95% confidence we can estimate that in hospitals in which the average length of stay is 10 days, the mean infection risk is between 4.259205 and 4.798486.

new <- data.frame(Stay = 10)
predict(lm(InfctRsk ~ Stay, data = d), new, interval = "confidence", se.fit = TRUE)

$fit
       fit      lwr      upr
1 4.528846 4.259205 4.798486

$se.fit
[1] 0.1346021

$df
[1] 56

$residual.scale
[1] 1.024489

We can calculate the 95% CI as

$\widehat{risk} \pm t^* \times SE(\widehat{risk}) = 4.528846 \pm 2.003241 \times 0.1346021 = (4.259206, 4.798486)$

c(4.528846 - 2.003241 * 0.1346021, 4.528846 + 2.003241 * 0.1346021)

[1] 4.259206 4.798486

$t^*$ is the value such that $\alpha/2$ probability on the t(n-2) distribution is below $t^*$.

n <- nrow(d)
qt(0.05/2, n-2)

[1] -2.003241

For the 95% PI, we say with 95% confidence that for any future hospital where the average length of stay is 10 days, the infection risk is between 2.45891 and 6.598781.

new <- data.frame(Stay = 10)
predict(lm(InfctRsk ~ Stay, data = d), new, interval = "prediction", se.fit = TRUE)

$fit
       fit     lwr      upr
1 4.528846 2.45891 6.598781

$se.fit
[1] 0.1346021

$df
[1] 56

$residual.scale
[1] 1.024489

We can calculate the 95% PI as

$\widehat{risk} \pm t^* \times \sqrt{MSE + [SE(\widehat{risk})]^2} =$ $4.528846 \pm 2.003241 \times \sqrt{ 1.024^2 + 0.1346021^2} = (2.459881, 6.597811)$

c(4.528846 - 2.003241 * sqrt(1.024^2 + 0.1346021^2),
  4.528846 + 2.003241 * sqrt(1.024^2 + 0.1346021^2))

[1] 2.459881 6.597811

Residual standard error = $\sqrt{MSE}$ so MSE = 1.024^2

Note that the formula of the standard error of the prediction is similar to the standard error of the fit when estimating the average risk when stay is 10. The standard error of the prediction just has an extra MSE term added that the standard error of the fit does not.

Prediction intervals are wider than confidence intervals because there is more uncertainty in predicting an individual value of the response variable than in estimating the mean of the response variable.

The formula for the prediction interval has the extra MSE term. A confidence interval for the mean response will always be narrower than the corresponding prediction interval for a new observation.

Furthermore, both intervals are narrowest at the mean of the predictor values.

34.2 Multiple linear regression

34.2.1 Baby weights, Part I

The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother.

	Estimate	Std. Error	t value	Pr($\geq \mid t\mid$)
(Intercept)	123.05	0.65	189.60	0.0000
smoke	-8.94	1.03	-8.65	0.0000

Write the equation of the regression model.
Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.
Is there a statistically significant relationship between the average birth weight and smoking?

Solution

$\widehat{\mbox{weight}} = 123.05 -8.94 \times \mbox{smoke}$

The estimated body weight of babies born to smoking mothers is 8.94 ounces lower than babies born to non-smoking mothers.

Smoker: $\widehat{\mbox{weight}} = 123.05 -8.94 \times 1 = 114.11$

Non-smoker: $\widehat{\mbox{weight}} = 123.05 -8.94 \times 0 = 123.05$

$H_0: \beta_1 = 0$
$H_1: \beta_1 \neq 0$

$\alpha = 0.05$

$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = -8.65$

p-value is the probability to obtain a test statistic equal to -8.65 or more extreme assuming the null hypothesis is true.

p-value is approximately 0. p-value $< \alpha$, so we reject the null hypothesis.

The data provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and smoking. Furthermore, having rejected the null, we can conclude that smoking is associated with lower birth weights.

34.2.2 Baby weights, Part II

The previous exercise Part I introduces a data set on birth weight of babies. Another variable we consider is parity, which is 1 if the child is the first born, and 0 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

	Estimate	Std. Error	t value	Pr($\geq \mid t\mid$)
(Intercept)	120.07	0.60	199.94	0.0000
parity	-1.93	1.19	-1.62	0.1052

Write the equation of the regression model.
Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.
Is there a statistically significant relationship between the average birth weight and parity?

Solution

$\widehat{\mbox{weight}} = 120.07 -1.93 \times \mbox{parity}$

The estimated body weight of first born babies is 1.93 ounces lower than other babies.

First borns: $\widehat{\mbox{weight}} = 120.07 - 1.93 \times 1 = 118.14$

Others: $\widehat{\mbox{weight}} = 120.07 - 1.93 \times 0 = 120.07$

$H_0: \beta_1 = 0$
$H_1: \beta_1 \neq 0$

$\alpha = 0.05$

$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = -1.62$

p-value is the probability to obtain a test statistic equal to -1.62 or more extreme assuming the null hypothesis is true.

p-value is 0.1052. p-value $> \alpha$, so we fail to reject the null hypothesis.

The data do not provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and parity.

34.2.3 Baby weights, Part III

We considered the variables smoke and parity, one at a time, in modeling birth weights of babies in previous exercises Parts I and II. A more realistic approach to modeling infant weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in days (gestation), mother’s age in years (age), mother’s height in inches (height), and mother’s pregnancy weight in pounds (weight).

Use the data babies.csv to answer the following questions.

Write the equation of the regression model that relates birth weights of babies (bwt) to variables gestation, parity, age, height, weight, and smoke.
Interpret the slopes of gestation, age and parity in this context.
The coefficient for parity is different than in the linear model shown in exercise Part II. Why might there be a difference?
Calculate the residual for the first observation in the data set.
Interpret the adjusted $R^2$.

Solution

d <- read.csv("data/babies.csv")
head(d)

  case bwt gestation parity age height weight smoke
1    1 120       284      0  27     62    100     0
2    2 113       282      0  33     64    135     0
3    3 128       279      0  28     64    115     1
4    4 123        NA      0  36     69    190     0
5    5 108       282      0  23     67    125     1
6    6 136       286      0  25     62     93     0

summary(d)

      case             bwt          gestation         parity      
 Min.   :   1.0   Min.   : 55.0   Min.   :148.0   Min.   :0.0000  
 1st Qu.: 309.8   1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000  
 Median : 618.5   Median :120.0   Median :280.0   Median :0.0000  
 Mean   : 618.5   Mean   :119.6   Mean   :279.3   Mean   :0.2549  
 3rd Qu.: 927.2   3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000  
 Max.   :1236.0   Max.   :176.0   Max.   :353.0   Max.   :1.0000  
                                  NA's   :13                      
      age            height          weight          smoke       
 Min.   :15.00   Min.   :53.00   Min.   : 87.0   Min.   :0.0000  
 1st Qu.:23.00   1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000  
 Median :26.00   Median :64.00   Median :125.0   Median :0.0000  
 Mean   :27.26   Mean   :64.05   Mean   :128.6   Mean   :0.3948  
 3rd Qu.:31.00   3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000  
 Max.   :45.00   Max.   :72.00   Max.   :250.0   Max.   :1.0000  
 NA's   :2       NA's   :22      NA's   :36      NA's   :10

res <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = d)
summary(res)


Call:
lm(formula = bwt ~ gestation + parity + age + height + weight + 
    smoke, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.613 -10.189  -0.135   9.683  51.713 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -80.41085   14.34657  -5.605 2.60e-08 ***
gestation     0.44398    0.02910  15.258  < 2e-16 ***
parity       -3.32720    1.12895  -2.947  0.00327 ** 
age          -0.00895    0.08582  -0.104  0.91696    
height        1.15402    0.20502   5.629 2.27e-08 ***
weight        0.05017    0.02524   1.987  0.04711 *  
smoke        -8.40073    0.95382  -8.807  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.83 on 1167 degrees of freedom
  (62 observations deleted due to missingness)
Multiple R-squared:  0.258, Adjusted R-squared:  0.2541 
F-statistic: 67.61 on 6 and 1167 DF,  p-value: < 2.2e-16

$\widehat{\mbox{weight}} = -80.41 + 0.44 \mbox{ gestation} - 3.33 \mbox{ parity} -0.01 \mbox{ age} + 1.15 \mbox{ height} + 0.05 \mbox{ weight} -8.40 \mbox{ smoke}$

$\hat \beta_{gestation} = 0.44$. The model predicts a 0.44 ounce increase in the birth weight of the baby for each additional day of pregnancy, helding all other covariates constant.

$\hat \beta_{age} = -0.01$. The model predicts a 0.01 ounce decrease in the birth weight of the baby for each additional year in mothers age, helding all other covariates constant.

$\hat \beta_{parity} = -3.33$. The model predicts a 3.33 ounce decrease in the birth weight of the baby for first born babies compared to others, helding all other covariates constant.

Parity might be correlated with the other variables in the model, which complicates model estimation.

First observation

d[1, ]

  case bwt gestation parity age height weight smoke
1    1 120       284      0  27     62    100     0

Observed weight = 120

Predicted weight:

$\widehat{\mbox{weight}} = -80.41 + 0.44 \mbox{284} - 3.33 \mbox{0} -0.01 \mbox{27} + 1.15 \mbox{62} + 0.05 \mbox{100} -8.40 \mbox{0} = 120.58$

-80.41 + 0.44 * 284 - 3.33 *0 -0.01 *27 + 1.15 *62 + 0.05 *100 -8.40*0

[1] 120.58

residual = observed weight = predicted weight = 120 - 120.58 = -0.58

The model over-predicts this baby’s birth weight.

R$^2$(adj)= 25.41% The percentage of variation explained by the model is about 25% adjusting for the number of parameters.

34.2.4 Cigarette consumption

An insurance organization wanted to study the consumption pattern of cigarettes in all 50 states in the United States and the District of Columbia. The investigators are interested in studying how statewide cigarette consumption is associated with various socioeconomic and demographic variables, and building a parsimonious regression model for predicting the consumption of cigarettes.

The variables for the study are

Age: Median Age of a person living in the state
HS: Percentage of people $>$ 25 years old who completed High School
Income: Per capita personal income (in dollars)
Black: Percentage of African Americans
Female: Percentage of Females
Price: Weighted average price (in cents) of a pack of cigarettes
Sales: Number of packs of cigarettes sold

Conduct a regression analysis that can inform investigators how cigarette consumption is associated with the various variables. As part of the analysis, write the regression equation, explain the t tests for coefficients and the F test, interpret R$^2$(adj), check assumptions and identify unusual observations. Data are in cigarette.txt.

Solution

d <- read.table("data/cigarette.txt", header = TRUE)
head(d)

  State  Age   HS Income Black Female Price Sales
1    AL 27.0 41.3   2948  26.2   51.7  42.7  89.8
2    AK 22.9 66.7   4644   3.0   45.7  41.8 121.3
3    AZ 26.3 58.1   3665   3.0   50.8  38.5 115.2
4    AR 29.1 39.9   2878  18.3   51.5  38.8 100.3
5    CA 28.1 62.6   4493   7.0   50.8  39.7 123.0
6    CO 26.2 63.9   3855   3.0   50.7  31.1 124.8

res <- lm(Sales ~ Age + HS + Income + Black + Female + Price, data = d)
summary(res)


Call:
lm(formula = Sales ~ Age + HS + Income + Black + Female + Price, 
    data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.398 -12.388  -5.367   6.270 133.213 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 103.34485  245.60719   0.421  0.67597   
Age           4.52045    3.21977   1.404  0.16735   
HS           -0.06159    0.81468  -0.076  0.94008   
Income        0.01895    0.01022   1.855  0.07036 . 
Black         0.35754    0.48722   0.734  0.46695   
Female       -1.05286    5.56101  -0.189  0.85071   
Price        -3.25492    1.03141  -3.156  0.00289 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.17 on 44 degrees of freedom
Multiple R-squared:  0.3208,    Adjusted R-squared:  0.2282 
F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

Regression equation

$\widehat{Sales}$ = 103 + 4.52 Age - 0.062 HS + 0.0189 Income + 0.358 Black - 1.05 Female - 3.25 Price

mean value of sales increases by 4.52 for every unit increase in age, given that the rest of variables remain constant
mean value of sales decreases by 0.062 for every unit increase in HS, given that the rest of variables remain constant
mean value of sales increases by 0.0189 for every unit increase in income, given that the rest of variables remain constant
mean value of sales increases by 0.358 for every unit increase in black, given that the rest of variables remain constant
mean value of sales decreases by 1.05 for every unit increase in female, given that the rest of variables remain constant
mean value of sales decreases by 3.25 for every unit increase in price, given that the rest of variables remain constant

Tests for coefficients. t test

Test null hypothesis ‘response variable $y$ is not linearly related to variable $x_i$’ versus alternative hypothesis ‘$y$ is linearly related to $x_i$’, $i=1,\ldots,p$
$H_0$: $\beta_{i}$ = 0
$H_1$: $\beta_{i} \neq 0$

Test statistic: t-statistic = estimate of the coefficient / SE of coefficient

Distribution: t(n-p-1). t distribution with degrees of freedom = number of observations - number of explanatory variables - 1

p-value: probability of obtaining a test statistic as extreme or more extreme than the one observed

summary(res)


Call:
lm(formula = Sales ~ Age + HS + Income + Black + Female + Price, 
    data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.398 -12.388  -5.367   6.270 133.213 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 103.34485  245.60719   0.421  0.67597   
Age           4.52045    3.21977   1.404  0.16735   
HS           -0.06159    0.81468  -0.076  0.94008   
Income        0.01895    0.01022   1.855  0.07036 . 
Black         0.35754    0.48722   0.734  0.46695   
Female       -1.05286    5.56101  -0.189  0.85071   
Price        -3.25492    1.03141  -3.156  0.00289 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.17 on 44 degrees of freedom
Multiple R-squared:  0.3208,    Adjusted R-squared:  0.2282 
F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

p-value=0.167, there is not evidence that age affects the mean value of sales allowing for the other predictors.
p-value=0.940, there is not evidence that HS affects the mean value of sales allowing for the other predictors.
p-value=0.070, there is some evidence that income affects the mean value of sales allowing for the other predictors.
p-value=0.467, there is not evidence that black affects the mean value of sales allowing for the other predictors.
p-value=0.851, there is not evidence that female affects the mean value of sales allowing for the other predictors.
p-value=0.003, there is strong evidence that price affects the mean value of sales allowing for the other predictors.

ANOVA. F test

Test null hypothesis ‘none of the explanatory variables are useful predictors of the response variable’ versus alternative hypothesis ‘at least one of the explanatory variables is related to the response variable’
$H_0$: $\beta_{1} =\beta_{2}=...=\beta_{p}$ = 0
$H_1$: at least one of $\beta_1$, $\beta_2$, $\dots$, $\beta_p \neq$ 0

Test statistic: F-statistic = mean squares model / mean squares residual

\[F = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2/p}{\sum_{i=1}^n ( y_i - \hat y)^2/(n-p-1)} = \frac{SSM/DFM}{SSE/DFE} = \frac{MSM}{MSE},\]

Distribution: F(p, n-p-1). F distribution with degrees of freedom = (number of explanatory variables, number of observations - number of explanatory variables - 1)

p-value: probability that test statistic has a value as extreme or more extreme than the observed

p-value = $P(F \geq F_{\mbox{observed}})$

summary(res)


Call:
lm(formula = Sales ~ Age + HS + Income + Black + Female + Price, 
    data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.398 -12.388  -5.367   6.270 133.213 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 103.34485  245.60719   0.421  0.67597   
Age           4.52045    3.21977   1.404  0.16735   
HS           -0.06159    0.81468  -0.076  0.94008   
Income        0.01895    0.01022   1.855  0.07036 . 
Black         0.35754    0.48722   0.734  0.46695   
Female       -1.05286    5.56101  -0.189  0.85071   
Price        -3.25492    1.03141  -3.156  0.00289 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.17 on 44 degrees of freedom
Multiple R-squared:  0.3208,    Adjusted R-squared:  0.2282 
F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

p-value=0.007. There is strong evidence that at least one of the explanatory variables in the model is related to the response

R$^2$(adj)

R$^2$(adj)= 22.82% The percentage of variation explained by the model is about 23% adjusting for the number of parameters.

summary(res)$adj.r.squared

[1] 0.2282303

Check assumptions

Independence

Random sample

Normality

To check normality we look at the normal Q-Q plot of the residuals. The data show departure from normality because the points do not form a straight line.

plot(res, 2)

Linearity and constant variance

Linearity. Residual versus variables plots indicate that the effect of the explanatory variables may not be linear.

Constant variance. Points are not randomly scattered around 0. The assumption of constance variance of $y$, $\sigma^2$, for all values of $x$ does not hold.

par(mfrow = c(2, 3))
plot(d$Age, rstandard(res))
abline(h = 0, lty = 2)
plot(d$HS, rstandard(res))
abline(h = 0, lty = 2)
plot(d$Income, rstandard(res))
abline(h = 0, lty = 2)
plot(d$Black, rstandard(res))
abline(h = 0, lty = 2)
plot(d$Female, rstandard(res))
abline(h = 0, lty = 2)
plot(d$Price, rstandard(res))
abline(h = 0, lty = 2)

Unusual observations

Outliers (unusual $y$). Standardized residuals less than -2 or greater than 2.

# Outliers
op <- which(abs(rstandard(res)) > 2)
op

29 30 
29 30

High leverage observations (unusual $x$). Leverage greater than 2 times the average of all leverage values. That is, $h_{ii} > 2(p+1)/n$.

# High leverage observations
h <- hatvalues(res)
lp <- which(h > 2*length(res$coefficients)/nrow(d))
lp

 2  9 10 45 
 2  9 10 45

Influential values. Cook’s distance higher than $4/n$.

# Influential observations
ip <- which(cooks.distance(res) > 4/nrow(d))
ip

 2  9 12 29 30 45 
 2  9 12 29 30 45

plot(res, 4)

d[unique(c(lp, op, ip)), ]

   State  Age   HS Income Black Female Price Sales
2     AK 22.9 66.7   4644   3.0   45.7  41.8 121.3
9     DC 28.4 55.2   5079  71.1   53.5  32.6 200.4
10    FL 32.3 52.6   3738  15.3   51.8  43.8 123.6
45    UT 23.1 67.3   3227   0.6   50.6  36.6  65.5
29    NV 27.8 65.2   4563   5.7   49.3  44.0 189.5
30    NH 28.0 57.6   3737   0.3   51.1  34.1 265.7
12    HI 25.0 61.9   4623   1.0   48.0  36.7  82.1

Outliers: observations with large residual: 29, 30
High leverage observations: observations with large unusual value of explanatory variables: 2, 9, 10, 45
Observation 9 has a very high percentage of African Americans in DC. Is it a data entry error?
Influential observations: observations 2, 9, 12, 29, 30 and 45 may be affecting the analysis too much. We would need to omit these observations, run the analysis without them and see how they impact the results.

34.3 Linear regression

In 1929, Edwin Hubble investigated the relationship between the distance of a galaxy from the earth and the velocity with which it appears to be receding. Galaxies appear to be moving away from us no matter which direction we look. His paper records the distance (in megaparsecs) measured to 24 nebulae which lie outside our galaxy. For each of these nebulae Hubble also measured the velocity of the nebulae away from the earth (negative numbers mean the nebula is moving towards the earth).

distance <- c(0.032, 0.034, 0.214, 0.263, 0.275, 0.275, 0.45, 0.5,
0.5, 0.63, 0.8, 0.9, 0.9, 0.9, 0.9, 1,
1.1, 1.1, 1.4, 1.7, 2, 2, 2, 2)

velocity <- c(170, 290, -130, -70, -185, -220, 200, 290,
270, 200, 300, -30, 650, 150, 500, 920,
450, 500, 500, 960, 500, 850, 800, 1090)

d <- data.frame(velocity = velocity, distance = distance)
res <- lm(velocity ~ distance, data = d)
summary(res)


Call:
lm(formula = velocity ~ distance, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-397.96 -158.10  -13.16  148.09  506.63 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -40.78      83.44  -0.489     0.63    
distance      454.16      75.24   6.036 4.48e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 232.9 on 22 degrees of freedom
Multiple R-squared:  0.6235,    Adjusted R-squared:  0.6064 
F-statistic: 36.44 on 1 and 22 DF,  p-value: 4.477e-06

X <- model.matrix(res)
solve(t(X)%*%X)

            (Intercept)    distance
(Intercept)  0.12833882 -0.09510043
distance    -0.09510043  0.10434830

Could the true value of the intercept term be 0? (Hubble’s model called for the intercept to be 0).
Give a 95% confidence interval for the coefficient of Distance in our model. This coefficient is called the Hubble constant.
The matrix $(X'X)^{-1}$ for this model and this data set is given in the R output. Use this to give a 90% confidence interval for the mean recession velocity of all nebulae whose distance from earth is 1 megaparsec.
When we fit a cubic polynomial model \[V = \beta_0 + \beta_1 D + \beta_2 D^2 + \beta_3 D^3 + \epsilon\] to this data, we find the residual sum of squares is 1062087 whereas the residual sum of squares in the straight line model above is 1193442. What is the value of the F statistic for testing the hypothesis that $\beta_2 = \beta_3 = 0$ and what are the relevant degrees of freedom?

Linear regression. Solution

We test $H_0: \beta_0 = 0$. The relevant $t$ statistic given in the output is -0.489. The corresponding p-value is 0.63 which is not at all small ($>$ 0.05). Thus we fail to reject the null hypothesis at any common level of significance. Thus we conclude that the intercept might well be 0.

Since $\hat \beta = (X'X)^{-1}X' Y$ and $Y \sim N(X\beta, \sigma^2 I)$,
\[\hat \beta \sim N(\beta, \sigma^2(X'X)^{-1})\]

\[\frac{\hat \beta_j - \beta_j}{se(\hat \beta_j)}= \frac{\hat \beta_j - \beta_j}{\sqrt{\hat \sigma^2 {(X'X)}_{jj}^{-1}}} \sim t_{n-p-1}\]

$(1-\alpha)100\%$ confidence interval: \[\hat \beta_j \pm t_{\alpha/2, n-p-1} \sqrt{\hat \sigma^2 {(X'X)}_{jj}^{-1}}\]

The estimate of $\beta_1$ is 454.16.

The corresponding estimated standard error is 75.24.

There are $n-p-1 = 24-1-1 =$ 22 degrees of freedom for error so we need a critical point from the $t_{0.025, 22}$ qt(0.05/2, 22) = - 2.07.

95% confidence interval is \[454.16 \pm 2.07 \times 75.24 = (298.4132, 609.9068)\]

$(1-\alpha)100\%$ confidence interval:

\[\hat y_0 \pm t_{\alpha/2, n-(p+1)} \times se(\hat y_0),\] where

$\hat y_0 = x'_0 \hat \beta$

$Var(\hat y_0) = Var({x'_0}{\hat \beta}) = x'_0 Var(\hat \beta) x_0 = \sigma^2 x'_0 (X'X)^{-1} x_0$

$se(\hat y_0) = \sqrt{MSE\ x_0'(X'X)^{-1}x_0}$

We get a confidence interval for $\beta_0+\beta_1$.

\[\hat y_0 = x'_0 \hat \beta = (1, 1) (\hat \beta_0, \hat \beta_1)' = \hat \beta_0 + \hat \beta_1 = 454.16 - 40.78 = 413.38\]

# estimate
454.16 - 40.78

[1] 413.38

\[Var(\hat y_0) = Var({x'_0}{\hat \beta}) = {x'_0} Var({\hat \beta}) {x_0} = \sigma^2 {x'_0}(X'X)^{-1} {x_0}\]

Residual standard error

\[\hat \sigma = \sqrt{MSE}= \sqrt{\sum_{i=1}^n ( y_i - \hat y_i)^2/(n-p-1)} = 232.9\]

$se(\hat y_0)$ = 48

# (X'X)^{-1}
X <- model.matrix(res)
solve(t(X)%*%X)

            (Intercept)    distance
(Intercept)  0.12833882 -0.09510043
distance    -0.09510043  0.10434830

# x' (X'X)^{-1} x
t(c(1, 1)) %*% solve(t(X)%*%X) %*% c(1, 1)

           [,1]
[1,] 0.04248626

# standard error
232.9 * sqrt(t(c(1, 1)) %*% solve(t(X)%*%X) %*% c(1, 1))

        [,1]
[1,] 48.0058

90% CI: $413.38 \pm 1.717 \times 48$

# critical value
qt(0.10/2, 22)

[1] -1.717144

# 90% CI interval
413.38 + c(-1.717, 1.717) * 48

[1] 330.964 495.796

Model M1 with $p_1$ predictors nested in model M2 with $p_2$ predictors, $p_1 < p_2$.

$H_0: \beta_{p_1+1} = \ldots = \beta_{p_2} = 0$ (Reduced M1 model fits as well as M2 model) vs.

$H_1: \beta_{j} \neq 0$ for any $p_1 < j \leq p_2$ (M2 fits better than M1)

Under the null hypothesis that the smaller model is correct,

\[F=\frac{(RSS(reduced) - RSS(full))/(p_2-p_1)}{RSS(full)/(n-p_2-1)} \sim F_{p_2-p_1,\ n-p_2-1}\]

where $RSS(full)$ is the residual sum-of-squares for the least squares fit of the bigger model with $p_2+1$ parameters, and $RSS(reduced)$ the same for the nested smaller model with $p_1 + 1$ parameters.

The F statistic measures the change in residual sum-of-squares per additional parameter in the bigger model, and it is normalized by an estimate of $\sigma^2$ (MSE).

F statistic has a $F_{p_2-p_1,\ n-p_2-1} = F_{3-1, 24-3-1} = F_{2, 20}$ distribution.

\[F = \frac{(1193442 - 1062087)/2}{1062087/(24-4)} = 1.236763\]

Fobs <- ((1193442 - 1062087)/2)/(1062087/(24-4))
Fobs

[1] 1.236763

1 - pf(Fobs, 2, 20)

[1] 0.311593

p-value = 0.31 > 0.05. Fail to reject the null.

34.4 Linear regression

We want to assess the effect of participating in a job training program on future earnings. For 2,675 men, we observe their participation status (job.training = 1 if they received training, 0 if not) and earnings in dollars after the program completed (regardless of whether they received training) (earnings.after). We also observe their earnings before the program (earnings.before), education in years (education), age (age), and binary indicators for being black, hispanic, married and high school graduate (black, hispanic, married, and high.school.grad).

First consider the below output from a simple linear regression of earnings.after on job.training.

Call:
lm(formula = earnings.after ~ job.training)

Coefficients:
                Estimate   Std. Error    t value   Pr(>|t|)
(Intercept)      21553.9        303.6      70.98     <2e-16 ***
job.trainingYes -15204.8       1154.6     -13.17     <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15150 on 2673 degrees of freedom
Multiple R-squared: 0.06092, Adjusted R-squared: 0.06057
F-statistic: 173.4 on 1 and 2673 DF, p-value: < 2.2e-16

Give a precise interpretation of the coefficient estimate for job.training. Based on this output, what do you conclude about the relationship between participation in the training program and earnings afterward? To answer this question, state a formal hypothesis test, including null and alternative hypotheses, significance level, test statistic value, and conclusion.
Give a reasonable explanation for why the coefficient is negative.

Now consider the following multiple regression output, which controls for earnings.before the program and years of education.

Call:
lm(formula = earnings.after ~ job.training + earnings.before + educ)

Coefficients:
                   Estimate   Std. Error t value  Pr(>|t|)
(Intercept)      -2.138e+03    8.160e+02  -2.620   0.00883 **
job.trainingYes   4.792e+02    8.219e+02   0.583   0.55992
earnings.before   8.360e-01    1.657e-02  50.436   < 2e-16 ***
educ              6.275e+02    6.879e+01   9.121   < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10160 on 2671 degrees of freedom
Multiple R-squared: 0.5777, Adjusted R-squared: 0.5772
F-statistic: 1218 on 3 and 2671 DF, p-value: < 2.2e-16

Give a precise interpretation of the coefficient estimate for job.training. Based on this output, what do you conclude about the relationship between participation in the training program and earnings afterward? To answer this question, state a formal hypothesis test, including null and alternative hypotheses, significance level, test statistic value, and conclusion.
What does this conclusion say about the training program?

Now consider expanding the regression to include all the observed variables described above. We have the following regression output.

Call:
lm(formula = earnings.after ~ job.training + earnings.before +
   educ + age + black + hispanic + married + high.school.grad)

Coefficients:
                      Estimate    Std. Error   t value    Pr(>|t|)
(Intercept)          -43.50987    1697.68742    -0.026      0.9796
job.trainingYes      714.92999     920.24279     0.777      0.4373
earnings.before        0.84832       0.01743    48.658     < 2e-16 ***
educ                 597.11088     103.86503    5.749      1.0e-08 ***
age                  -88.46638      20.90856    -4.231     2.4e-05 ***
blackYes            -621.57753     497.80243    -1.249      0.2119
hispanicYes         1904.11397    1097.24813     1.735      0.0828 .
marriedYes          1184.57915     589.37459     2.010      0.0445 *
high.school.gradNo   610.11290     650.31497     0.938      0.3482
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10130 on 2666 degrees of freedom
Multiple R-squared: 0.5817, Adjusted R-squared: 0.5804
F-statistic: 463.4 on 8 and 2666 DF, p-value: < 2.2e-16

We also have the Analysis of Variance Table below, where ‘Model 2’ is the current regression and ‘Model 1’ is the regression in part (b).

Model 1: earnings.after ~ job.training + earnings.before + educ
Model 2: earnings.after ~ job.training + earnings.before + educ + age +
         black + hispanic + married + high.school.grad

   Res.Df          RSS  Df   Sum of Sq        F     Pr(>F)
1    2671   2.7597e+11
2    2666   2.7335e+11   5   2.622e+09   5.1144  0.0001133 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Give a precise description of the p-value in the Analysis of Variance Table above. Include the null and alternative hypotheses of the relevant test, what these mean conceptually for comparing ‘Model 1’ and ‘Model 2’, and what you conclude based on this output.
‘Model 1’ from part (b) has a BIC value of 49390.32, whereas ‘Model 2’ has a BIC value of 49404.24. Based on this information and the Analysis of Variance Table, which model do you prefer? Why?

Linear regression. Solution

a.i. The coefficient means that on average people that underwent training made $15204.80 less than those that did no training. This difference is signifcant at the 0.05 level. The null is $H_0: \beta_1 = 0$ and the alternative is $H_1: \beta_1 \neq 0$; the t test statistic value is -13.17 and the p-value is given as $<$ 2e-16.

a.ii. One possibility (and the later parts verify that this is correct) is that the people that underwent training make much less to begin with. Notice that in this part, we are not modeling the change in earnings, just the level afterward. If the people that did the training program had very low earnings relative to those that did not do training, then even if the training raised their income, it would still be lower than the no-training sample.

b.i. The coefficient means that on average people that underwent training made $479.20 more than those that did no training, holding fixed education and pre-training earnings. So if there are two people with the same education level and the same earnings, and only one goes through training, then on average that one will earn $479.20 more than the untrained person. This difference is not significant at the 0.05 level. The null is $H_0: \beta_1 = 0$ and the alternative is $H_1: \beta_1 \neq 0$; the t test statistic value is 0.583 and the p-value is given as 0.55992.

b.ii. Statistically, the training does not add significant value, once controling for pre-training earnings and education level. This does not rule out the possibility that some sub-groups could benefit, such as specific earnings levels or demographic groups.

c.i. The null hypothesis is $H_0: \beta_4 = \beta_5 = \beta_6 = \beta_7 = \beta_8 = 0$ and the alternative is $H_1: \beta_j \neq 0$ for at least one $j \in \{4,5,6,7,8 \}$. Conceptually, the null hypothesis is that the additional variables, as a whole, do not add anything to the model, while the alternative is that ‘Model 2’ is preferred to ‘Model 1’, i.e. the new variables add something. The p-value is less than 0.05, so at the 5% level, we reject the null hypothesis and decide that the new model is preferred.

c.ii. According to BIC, the old model is better, but according to the F test, the new model is. Which is better is largely a matter of taste at this point. Full credit would be given for any correctly justified answer. Remember that our goal is to estimate $\beta_1$, and the estimates have different interpretations in the different models because they control for different things. If we do not want to condition on demographics, like race, then the first model is preferred.