# 1 Exercise 1

Below is a scatter plot for a dataset. Identify the outliers and high leverage observations in the dataset. Say which measure could be used to identify each of these observations.

## 1.1 Solution. Exercise 1

1. Observation at (2, 10) is high leverage. High leverage but not influential (will not show up in Cook’s D). Outlier 1: (x,y)=(1,-9). Outlier 2: (x,y)=(-1,5). Outlier 3: (x,y)=(-1,-16). Outliers will show up in a residual plot and as a high Cook’s D.

# 2 Exercise 2

The table below shows the first few rows of a data set from the 2011 March U.S. Current Population Survey (CPS). The set contains weekly wages (in $1,000) for 4,952 males between the age of 18 and 70 who worked full-time in 2011. Also recorded are the region of the country (with four categories: Midwest, Northeast, South, and West), the metropolitan status of the men’s employment (with three categories: Metropolitan, Not Metropolitan, and Not Identified), years of education (0 through 16), and race (White or Black). Region MetropolitanStatus Age Educ Race WeeklyWages, in$1,000
West Not Metropolitan 64 9 White 1.419
Midwest Metropolitan 51 11 White 1.000
South Metropolitan 25 4 White 0.420
West Not Metropolitan 46 13 White 1.980
Northeast Metropolitan 31 12 Black 1.750
1. Suppose we would like to model weekly wages based on the region of the country. Write down the linear regression model for E(WeeklyWages $$|$$ Region) using the usual notation: $$\beta_0$$ for the intercept and $$\beta_i$$, $$i\geq 1$$, for the coefficients of the covariates. Let Midwest be the reference level and use alphabetical order for the rest of the regions.

2. Here is a partial regression output (with some lines omitted) from fitting the model in part (a) using R.

Call:
lm(formula = WeeklyWages ~ Region, data = CPS2011)
Coefficients:
...
--
Residual standard error: 0.6615 on 4948 degrees of freedom
...
F-statistic: 15.62 on 3 and 4948 DF, p-value: 4.116e-10

Based on the R output above, fill in the following table to compare average weekly wages across all four geographical regions.

Sum of Squares type Sum of Squares value d.f. Mean Square F-statistic p-value
SSM ? 3 ? 15.62 4.116e-10
SSE ? 4948 ? - -
SST ? 4951 ? - -

## 2.1 Solution. Exercise 2

$\mbox{E(WeeklyWages | Region)} = \beta_0 + \beta_1 \mbox{Northeast} + \beta_2 \mbox{South} + \beta_3 \mbox{West}$

We know p-value, all d.f.s, and F-statistic from the output.

We know that $$^2$$ is Mean Square for SSE. Based on that, we can calculate SSE ($$0.6615^2$$=0.4375822).

Mean Square for SSM will be F-statistic $$\times$$ Mean Square for SSE (15.62 * 0.4375822 = 6.835034).

SSM may be calculated by multiplying Mean Square for SSM by d.f. (6.835034 * 3 = 20.5051).

SSE may be calculated by multiplying Mean Square for SSE by d.f. (0.4375822 * 4948 = 2165.157).

Then, we can sum-up d.f.s and Sum of Squares to get SST and d.f. for SST (20.5051+ 2165.157=2185.662).

Finally, we get Mean Square for SST by dividing SST by its d.f. (2185.662/4951=0.4414587)

SS type Sum of Squares value d.f. Mean Square F-statistic p-value
SSM 6.835034 $$\times$$ 3=20.5051 3 15.62 $$\times$$ 0.4375822=6.835034 15.62 4.116e-10
SSE 0.4375822 $$\times$$ 4948=2165.157 4948 $$0.6615^2$$=0.4375822 - -
SST 20.5051+ 2165.157=2185.662 4951 2185.662/4951=0.4414587 - -

## 2.2 Exercise 3

Consider the hospital infection risk dataset hospital.csv which consists of a sample of n = 58 hospitals in the east and north-central U.S. Conduct a regression analysis for predicting infection risk (percent of patients who get an infection) from average length of stay (in days). In the data, infection risk is variable InfctRsk and average length of stay Stay. Answer the following questions.

1. Write out the linear model.
2. Interpret the slope.
3. What are the hypotheses for evaluating whether lenght of stay is a significant predictor of infection risk? State the conclusion of the hypothesis test in context of the data.
4. Interpret $$R^2$$.
5. Calculate and interpret the mean risk of infection when the average length of stay is 10 days and its standard error.
6. Calculate and interpret the confidence interval for the mean risk for an average length of stay of 10 days.
7. Calculate and interpret the prediction interval for the mean risk for an average length of stay of 10 days.
8. Comment on the difference between confidence and prediction intervals.

## 2.3 Solution. Exercise 3

d <- read.csv("data/hospital.csv")
head(d)
summary(d)
##        ID              Stay             Age           InfctRsk
##  Min.   :  2.00   Min.   : 7.390   Min.   :38.80   Min.   :1.300
##  1st Qu.: 28.25   1st Qu.: 8.905   1st Qu.:50.60   1st Qu.:3.900
##  Median : 53.00   Median : 9.930   Median :53.20   Median :4.500
##  Mean   : 52.71   Mean   :10.049   Mean   :52.78   Mean   :4.557
##  3rd Qu.: 77.75   3rd Qu.:11.060   3rd Qu.:55.60   3rd Qu.:5.300
##  Max.   :109.00   Max.   :13.950   Max.   :65.90   Max.   :7.800
##     Culture           Xray             Beds         MedSchool
##  Min.   : 2.20   Min.   : 42.60   Min.   : 52.0   Min.   :1.00
##  1st Qu.:10.65   1st Qu.: 75.95   1st Qu.:133.2   1st Qu.:2.00
##  Median :16.10   Median : 87.75   Median :195.5   Median :2.00
##  Mean   :18.28   Mean   : 86.30   Mean   :259.8   Mean   :1.81
##  3rd Qu.:23.62   3rd Qu.: 98.97   3rd Qu.:316.5   3rd Qu.:2.00
##  Max.   :60.50   Max.   :133.50   Max.   :768.0   Max.   :2.00
##      Region          Census          Nurses        Facilities
##  Min.   :1.000   Min.   : 37.0   Min.   : 14.0   Min.   : 5.70
##  1st Qu.:1.000   1st Qu.:105.5   1st Qu.: 88.0   1st Qu.:37.10
##  Median :2.000   Median :159.5   Median :152.0   Median :45.70
##  Mean   :1.552   Mean   :198.5   Mean   :184.4   Mean   :46.01
##  3rd Qu.:2.000   3rd Qu.:249.0   3rd Qu.:201.5   3rd Qu.:54.30
##  Max.   :2.000   Max.   :595.0   Max.   :656.0   Max.   :80.00
plot(d$InfctRsk, d$Stay,
xlab = "avg length of stay (days)",
ylab = "infection risk (% patients get infection)")

res <- lm(InfctRsk ~ Stay, data = d)
summary(res)
##
## Call:
## lm(formula = InfctRsk ~ Stay, data = d)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.6145 -0.4660  0.1388  0.4970  2.4310
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.15982    0.95580  -1.213     0.23
## Stay         0.56887    0.09416   6.041  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.024 on 56 degrees of freedom
## Multiple R-squared:  0.3946, Adjusted R-squared:  0.3838
## F-statistic:  36.5 on 1 and 56 DF,  p-value: 1.302e-07

After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. Assumptions are independency, normality, linearity and constant variance. If there any assumption violation, subsequent inferential procedures may be invalid.

We would also need to investigate unusual observations and whether or not they are influential.

$$\widehat{\mbox{risk}} = -1.15982 + 0.56887 \mbox{ stay}$$

$$\hat \beta_1 = 0.56887$$.

We expect the average infection risk to increase by a factor of 0.56887 for each additional day in the length of stay.

$$H_0: \beta_1 = 0$$
$$H_1: \beta_1 \neq 0$$

$$\alpha = 0.05$$

$$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = 6.041$$

p-value is the probability to obtain a test statistic equal to 6.041 or more extreme assuming the null hypothesis is true.

(n <- nrow(d))
## [1] 58
2*(1-pt(6.041, n-2))
## [1] 1.303602e-07

p-value is approximately 0. p-value $$< \alpha$$, so we reject the null hypothesis.

The data provide convincing evidence that length of stay is a significant predictor of infection risk.

$$R^2$$ = 0.3946

Length of stay explains 39.46% of the variability in infection risk.

The mean infection risk when the average stay is 10 days is 4.5288.

$$\widehat{\mbox{risk}} = -1.15982 + 0.56887 \times \mbox{ 10} = -1.15982 + 0.56887 \times 10 = 4.5288$$

The standard error of fit $$\widehat{\mbox{risk}}$$ measures the accuracy of $$\widehat{\mbox{risk}}$$ as an estimate of the mean risk when the average stay is 10 days.

new <- data.frame(Stay = 10)
predict(lm(InfctRsk ~ Stay, data = d), new, interval = "confidence", se.fit = TRUE)
## $fit ## fit lwr upr ## 1 4.528846 4.259205 4.798486 ## ##$se.fit
## [1] 0.1346021
##
## $df ## [1] 56 ## ##$residual.scale
## [1] 1.024489

For the 95% CI, we say with 95% confidence we can estimate that in hospitals in which the average length of stay is 10 days, the mean infection risk is between 4.259205 and 4.798486.

new <- data.frame(Stay = 10)
predict(lm(InfctRsk ~ Stay, data = d), new, interval = "confidence", se.fit = TRUE)
## $fit ## fit lwr upr ## 1 4.528846 4.259205 4.798486 ## ##$se.fit
## [1] 0.1346021
##
## $df ## [1] 56 ## ##$residual.scale
## [1] 1.024489

We can calculate the 95% CI as

$$\widehat{risk} \pm t^* \times SE(\widehat{risk}) = 4.528846 \pm 2.003241 \times 0.1346021 = (4.259206, 4.798486)$$

c(4.528846 - 2.003241 * 0.1346021, 4.528846 + 2.003241 * 0.1346021)
## [1] 4.259206 4.798486

$$t^*$$ is the value such that $$\alpha/2$$ probability on the t(n-2) distribution is below $$t^*$$.

n <- nrow(d)
qt(0.05/2, n-2)
## [1] -2.003241

For the 95% PI, we say with 95% confidence that for any future hospital where the average length of stay is 10 days, the infection risk is between 2.45891 and 6.598781.

new <- data.frame(Stay = 10)
predict(lm(InfctRsk ~ Stay, data = d), new, interval = "prediction", se.fit = TRUE)
## $fit ## fit lwr upr ## 1 4.528846 2.45891 6.598781 ## ##$se.fit
## [1] 0.1346021
##
## $df ## [1] 56 ## ##$residual.scale
## [1] 1.024489

We can calculate the 95% PI as

$$\widehat{risk} \pm t^* \times \sqrt{MSE + [SE(\widehat{risk})]^2} =$$ $$4.528846 \pm 2.003241 \times \sqrt{ 1.024^2 + 0.1346021^2} = (2.459881, 6.597811)$$

c(4.528846 - 2.003241 * sqrt(1.024^2 + 0.1346021^2),
4.528846 + 2.003241 * sqrt(1.024^2 + 0.1346021^2))
## [1] 2.459881 6.597811

Residual standard error = $$\sqrt{MSE}$$ so MSE = 1.024^2

Note that the formula of the standard error of the prediction is similar to the standard error of the fit when estimating the average risk when stay is 10. The standard error of the prediction just has an extra MSE term added that the standard error of the fit does not.

Prediction intervals are wider than confidence intervals because there is more uncertainty in predicting an individual value of the response variable than in estimating the mean of the response variable.

The formula for the prediction interval has the extra MSE term. A confidence interval for the mean response will always be narrower than the corresponding prediction interval for a new observation.

Furthermore, both intervals are narrowest at the mean of the predictor values.