Below is a scatter plot for a dataset. Identify the outliers and high leverage observations in the dataset. Say which measure could be used to identify each of these observations.
The table below shows the first few rows of a data set from the 2011 March U.S. Current Population Survey (CPS). The set contains weekly wages (in $1,000) for 4,952 males between the age of 18 and 70 who worked full-time in 2011. Also recorded are the region of the country (with four categories: Midwest, Northeast, South, and West), the metropolitan status of the men’s employment (with three categories: Metropolitan, Not Metropolitan, and Not Identified), years of education (0 through 16), and race (White or Black).
Region | MetropolitanStatus | Age | Educ | Race | WeeklyWages, in $1,000 |
---|---|---|---|---|---|
West | Not Metropolitan | 64 | 9 | White | 1.419 |
Midwest | Metropolitan | 51 | 11 | White | 1.000 |
South | Metropolitan | 25 | 4 | White | 0.420 |
West | Not Metropolitan | 46 | 13 | White | 1.980 |
Northeast | Metropolitan | 31 | 12 | Black | 1.750 |
Suppose we would like to model weekly wages based on the region of the country. Write down the linear regression model for E(WeeklyWages \(|\) Region) using the usual notation: \(\beta_0\) for the intercept and \(\beta_i\), \(i\geq 1\), for the coefficients of the covariates. Let Midwest be the reference level and use alphabetical order for the rest of the regions.
Here is a partial regression output (with some lines omitted) from fitting the model in part (a) using R.
Call:
lm(formula = WeeklyWages ~ Region, data = CPS2011)
Coefficients:
...
--
Residual standard error: 0.6615 on 4948 degrees of freedom
...
F-statistic: 15.62 on 3 and 4948 DF, p-value: 4.116e-10
Based on the R output above, fill in the following table to compare average weekly wages across all four geographical regions.
Sum of Squares type | Sum of Squares value | d.f. | Mean Square | F-statistic | p-value |
---|---|---|---|---|---|
SSM | ? | 3 | ? | 15.62 | 4.116e-10 |
SSE | ? | 4948 | ? | - | - |
SST | ? | 4951 | ? | - | - |
\[\mbox{E(WeeklyWages $|$ Region)} = \beta_0 + \beta_1 \mbox{Northeast} + \beta_2 \mbox{South} + \beta_3 \mbox{West}\]
We know p-value, all d.f.s, and F-statistic from the output.
We know that \(^2\) is Mean Square for SSE. Based on that, we can calculate SSE (\(0.6615^2\)=0.4375822).
Mean Square for SSM will be F-statistic \(\times\) Mean Square for SSE (15.62 * 0.4375822 = 6.835034).
SSM may be calculated by multiplying Mean Square for SSM by d.f. (6.835034 * 3 = 20.5051).
SSE may be calculated by multiplying Mean Square for SSE by d.f. (0.4375822 * 4948 = 2165.157).
Then, we can sum-up d.f.s and Sum of Squares to get SST and d.f. for SST (20.5051+ 2165.157=2185.662).
Finally, we get Mean Square for SST by dividing SST by its d.f. (2185.662/4951=0.4414587)
SS type | Sum of Squares value | d.f. | Mean Square | F-statistic | p-value |
---|---|---|---|---|---|
SSM | 6.835034 \(\times\) 3=20.5051 | 3 | 15.62 \(\times\) 0.4375822=6.835034 | 15.62 | 4.116e-10 |
SSE | 0.4375822 \(\times\) 4948=2165.157 | 4948 | \(0.6615^2\)=0.4375822 | - | - |
SST | 20.5051+ 2165.157=2185.662 | 4951 | 2185.662/4951=0.4414587 | - | - |
Consider the hospital infection risk dataset hospital.csv
which consists of a sample of n = 58 hospitals in the east and north-central U.S. Conduct a regression analysis for predicting infection risk (percent of patients who get an infection) from average length of stay (in days). In the data, infection risk is variable InfctRsk
and average length of stay Stay
. Answer the following questions.
<- read.csv("data/hospital.csv")
d head(d)
summary(d)
## ID Stay Age InfctRsk
## Min. : 2.00 Min. : 7.390 Min. :38.80 Min. :1.300
## 1st Qu.: 28.25 1st Qu.: 8.905 1st Qu.:50.60 1st Qu.:3.900
## Median : 53.00 Median : 9.930 Median :53.20 Median :4.500
## Mean : 52.71 Mean :10.049 Mean :52.78 Mean :4.557
## 3rd Qu.: 77.75 3rd Qu.:11.060 3rd Qu.:55.60 3rd Qu.:5.300
## Max. :109.00 Max. :13.950 Max. :65.90 Max. :7.800
## Culture Xray Beds MedSchool
## Min. : 2.20 Min. : 42.60 Min. : 52.0 Min. :1.00
## 1st Qu.:10.65 1st Qu.: 75.95 1st Qu.:133.2 1st Qu.:2.00
## Median :16.10 Median : 87.75 Median :195.5 Median :2.00
## Mean :18.28 Mean : 86.30 Mean :259.8 Mean :1.81
## 3rd Qu.:23.62 3rd Qu.: 98.97 3rd Qu.:316.5 3rd Qu.:2.00
## Max. :60.50 Max. :133.50 Max. :768.0 Max. :2.00
## Region Census Nurses Facilities
## Min. :1.000 Min. : 37.0 Min. : 14.0 Min. : 5.70
## 1st Qu.:1.000 1st Qu.:105.5 1st Qu.: 88.0 1st Qu.:37.10
## Median :2.000 Median :159.5 Median :152.0 Median :45.70
## Mean :1.552 Mean :198.5 Mean :184.4 Mean :46.01
## 3rd Qu.:2.000 3rd Qu.:249.0 3rd Qu.:201.5 3rd Qu.:54.30
## Max. :2.000 Max. :595.0 Max. :656.0 Max. :80.00
plot(d$InfctRsk, d$Stay,
xlab = "avg length of stay (days)",
ylab = "infection risk (% patients get infection)")
<- lm(InfctRsk ~ Stay, data = d)
res summary(res)
##
## Call:
## lm(formula = InfctRsk ~ Stay, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6145 -0.4660 0.1388 0.4970 2.4310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.15982 0.95580 -1.213 0.23
## Stay 0.56887 0.09416 6.041 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.024 on 56 degrees of freedom
## Multiple R-squared: 0.3946, Adjusted R-squared: 0.3838
## F-statistic: 36.5 on 1 and 56 DF, p-value: 1.302e-07
After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. Assumptions are independency, normality, linearity and constant variance. If there any assumption violation, subsequent inferential procedures may be invalid.
We would also need to investigate unusual observations and whether or not they are influential.
\(\widehat{\mbox{risk}} = -1.15982 + 0.56887 \mbox{ stay}\)
\(\hat \beta_1 = 0.56887\).
We expect the average infection risk to increase by a factor of 0.56887 for each additional day in the length of stay.
\(H_0: \beta_1 = 0\)
\(H_1: \beta_1 \neq 0\)
\(\alpha = 0.05\)
\(T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = 6.041\)
p-value is the probability to obtain a test statistic equal to 6.041 or more extreme assuming the null hypothesis is true.
<- nrow(d)) (n
## [1] 58
2*(1-pt(6.041, n-2))
## [1] 1.303602e-07
p-value is approximately 0. p-value \(< \alpha\), so we reject the null hypothesis.
The data provide convincing evidence that length of stay is a significant predictor of infection risk.
\(R^2\) = 0.3946
Length of stay explains 39.46% of the variability in infection risk.
The mean infection risk when the average stay is 10 days is 4.5288.
\(\widehat{\mbox{risk}} = -1.15982 + 0.56887 \times \mbox{ 10} = -1.15982 + 0.56887 \times 10 = 4.5288\)
The standard error of fit \(\widehat{\mbox{risk}}\) measures the accuracy of \(\widehat{\mbox{risk}}\) as an estimate of the mean risk when the average stay is 10 days.
<- data.frame(Stay = 10)
new predict(lm(InfctRsk ~ Stay, data = d), new, interval = "confidence", se.fit = TRUE)
## $fit
## fit lwr upr
## 1 4.528846 4.259205 4.798486
##
## $se.fit
## [1] 0.1346021
##
## $df
## [1] 56
##
## $residual.scale
## [1] 1.024489
For the 95% CI, we say with 95% confidence we can estimate that in hospitals in which the average length of stay is 10 days, the mean infection risk is between 4.259205 and 4.798486.
<- data.frame(Stay = 10)
new predict(lm(InfctRsk ~ Stay, data = d), new, interval = "confidence", se.fit = TRUE)
## $fit
## fit lwr upr
## 1 4.528846 4.259205 4.798486
##
## $se.fit
## [1] 0.1346021
##
## $df
## [1] 56
##
## $residual.scale
## [1] 1.024489
We can calculate the 95% CI as
\(\widehat{risk} \pm t^* \times SE(\widehat{risk}) = 4.528846 \pm 2.003241 \times 0.1346021 = (4.259206, 4.798486)\)
c(4.528846 - 2.003241 * 0.1346021, 4.528846 + 2.003241 * 0.1346021)
## [1] 4.259206 4.798486
\(t^*\) is the value such that \(\alpha/2\) probability on the t(n-2) distribution is below \(t^*\).
<- nrow(d)
n qt(0.05/2, n-2)
## [1] -2.003241
For the 95% PI, we say with 95% confidence that for any future hospital where the average length of stay is 10 days, the infection risk is between 2.45891 and 6.598781.
<- data.frame(Stay = 10)
new predict(lm(InfctRsk ~ Stay, data = d), new, interval = "prediction", se.fit = TRUE)
## $fit
## fit lwr upr
## 1 4.528846 2.45891 6.598781
##
## $se.fit
## [1] 0.1346021
##
## $df
## [1] 56
##
## $residual.scale
## [1] 1.024489
We can calculate the 95% PI as
\(\widehat{risk} \pm t^* \times \sqrt{MSE + [SE(\widehat{risk})]^2} =\) \(4.528846 \pm 2.003241 \times \sqrt{ 1.024^2 + 0.1346021^2} = (2.459881, 6.597811)\)
c(4.528846 - 2.003241 * sqrt(1.024^2 + 0.1346021^2),
4.528846 + 2.003241 * sqrt(1.024^2 + 0.1346021^2))
## [1] 2.459881 6.597811
Residual standard error = \(\sqrt{MSE}\) so MSE = 1.024^2
Note that the formula of the standard error of the prediction is similar to the standard error of the fit when estimating the average risk when stay is 10. The standard error of the prediction just has an extra MSE term added that the standard error of the fit does not.
Prediction intervals are wider than confidence intervals because there is more uncertainty in predicting an individual value of the response variable than in estimating the mean of the response variable.
The formula for the prediction interval has the extra MSE term. A confidence interval for the mean response will always be narrower than the corresponding prediction interval for a new observation.
Furthermore, both intervals are narrowest at the mean of the predictor values.