A regression model describes the linear relationship between a continuous response variable \(y\) and one or more explanatory variables \(x_1, x_2, \dots, x_p\)
In simple linear regression there is one explanatory variable \[\hat y = b_0 + b_1 x_1\] Example: How does body mass index relate to physical activity?
In multiple linear regression there is more than one explanatory variable \[\hat y = b_0 + b_1 x_1 + ... + b_p x_p\] Example: How does systolic blood pressure relate to age and weight?
After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. If there any assumption violation, subsequent inferential procedures may be invalid
Assumptions:
We check normality of errors with a normal Q-Q plot of the residuals
Constant variance and linearity are checked using plots of the residuals (\(y\)-axis) against each explanatory variable (\(x\)-axis) to assess the linearity between \(y\) and each \(x\), and the constant variance of \(y\) across each \(x\).
Outlier (unusual \(y\)): observation whose response value \(y\) does not follow the general trend of the rest of the data. Outliers have standardized residual less than -2 or greater than 2 (or -3 and 3). The bigger the data set, the more outliers we expect to get, about 5% of the sample.
High leverage observation (unusual \(x\)): observation that has extreme predictor \(x\) values. An observation is usually considered to have high leverage if it has leverage (“hat”) greater than 2 times the average of all leverage values. That is, if \(h_{ii} > 2(p+1)/n\).
Influential observation: an observation is influential if its omission causes statistical inference to change, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Cook’s distance combines residuals and leverages in a single measure of influence. We examine in closer detail the points with values Cook’s distance that are substantially larger than the rest.
A hospital administrator wants to study the association between patient satisfaction and patients age, severity of illness and anxiety level. 23 patients were randomly selected and the data were collected, where larger values of satisfaction, severity, and anxiety are associated with more satisfaction, increased severity of illness and more anxiety, respectively.
<- read.table("data/patsat.txt", header = TRUE)
d head(d)
summary(d)
## Satisfaction Age Severity Anxiety
## Min. :26.00 Min. :28.00 Min. :43.00 Min. :1.800
## 1st Qu.:50.00 1st Qu.:33.00 1st Qu.:48.00 1st Qu.:2.150
## Median :60.00 Median :40.00 Median :50.00 Median :2.300
## Mean :61.35 Mean :39.61 Mean :50.78 Mean :2.296
## 3rd Qu.:73.50 3rd Qu.:44.50 3rd Qu.:53.50 3rd Qu.:2.400
## Max. :89.00 Max. :55.00 Max. :62.00 Max. :2.900
We summarise the data graphically using a matrix of scatterplots to examine all pairwise relationships.
plot(d)
We also compute the correlations.
cor(d)
## Satisfaction Age Severity Anxiety
## Satisfaction 1.0000000 -0.7736828 -0.5874444 -0.6023105
## Age -0.7736828 1.0000000 0.4666091 0.4976945
## Severity -0.5874444 0.4666091 1.0000000 0.7945275
## Anxiety -0.6023105 0.4976945 0.7945275 1.0000000
All explanatory variables by themselves are negatively and linearly related to Satisfaction.
The variable Age shows the strongest relationship with Satisfaction, followed by Anxiety and then Severity.
We need to perform a formal analysis using multiple regression to see which combination of variables is most useful to predict Satisfaction.
Collinearity occurs when two or more explanatory variables in a multiple regression model are highly correlated.
If the explanatory variables are strongly related to each other then it’s possible that they may affect the response in the same way and this makes it difficult to distinguish their contributions.
If this is the case, then it is likely that we would not need to use all the explanatory variables in the model. That is, we have some redundant explanatory variables.
Collinearity causes instability in the estimation of \(\beta\) but it is not so much of a problem for prediction.
We will try to avoid having very similar predictors in the model.
<- lm(Satisfaction ~ Age + Severity + Anxiety, data = d)
res summary(res)
##
## Call:
## lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.954 -7.154 1.550 6.599 14.888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.8759 25.7757 6.319 4.59e-06 ***
## Age -1.2103 0.3015 -4.015 0.00074 ***
## Severity -0.6659 0.8210 -0.811 0.42736
## Anxiety -8.6130 12.2413 -0.704 0.49021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.29 on 19 degrees of freedom
## Multiple R-squared: 0.6727, Adjusted R-squared: 0.621
## F-statistic: 13.01 on 3 and 19 DF, p-value: 7.482e-05
\(\widehat{satisfaction}\) = 162.9 - 1.210 age - 0.666 severity - 8.6 anxiety
Test null hypothesis ‘response variable \(y\) is not linearly related to variable \(x_i\)’ versus alternative hypothesis ‘\(y\) is linearly related to \(x_i\)’, \(i=1,\ldots,p\)
\(H_0\): \(\beta_{i}\) = 0
\(H_1\): \(\beta_{i} \neq 0\)
Test statistic: t-statistic = estimate of the coefficient / SE of coefficient
\[t = \frac{b_i - \beta_i}{SE_{b_1}}\]
Distribution: t(n-p-1) where \(p\) is the number of \(\beta\) parameters excluding \(\beta_0\)
t distribution with degrees of freedom = number of observations - number of explanatory variables - 1
p-value: probability of obtaining a test statistic as least as extreme as the observed assuming the null is true
\(H_1: \beta_i \neq 0\) is \(P(T\geq |t|)\)
summary(res)
##
## Call:
## lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.954 -7.154 1.550 6.599 14.888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.8759 25.7757 6.319 4.59e-06 ***
## Age -1.2103 0.3015 -4.015 0.00074 ***
## Severity -0.6659 0.8210 -0.811 0.42736
## Anxiety -8.6130 12.2413 -0.704 0.49021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.29 on 19 degrees of freedom
## Multiple R-squared: 0.6727, Adjusted R-squared: 0.621
## F-statistic: 13.01 on 3 and 19 DF, p-value: 7.482e-05
A \((1-\alpha)\times 100\)% confidence interval for the slope \(\beta_1\) is \[b_i \pm t^* \times SE_{b_i}\]
\(t^*\) is the value on the \(t(n-p-1)\) density curve with area \(c\) between the critical points \(-t^*\) and \(t^*\). That is, \(t^*\) is the value such that \(\alpha/2\) of the probability in the \(t(n-p-1)\) distribution is below \(t^*\).
Test null hypothesis ‘none of the explanatory variables are useful predictors of the response variable’ versus alternative hypothesis ‘at least one of the explanatory variables is related to the response variable’
\(H_0\): \(\beta_{1} =\beta_{2}=...=\beta_{p}\) = 0
\(H_1\): at least one of \(\beta_1\), \(\beta_2\), \(\dots\), \(\beta_p \neq\) 0
ANOVA table for multiple regression
Source | Sum of Squares | degrees of freedom | Mean square | F | p-value |
---|---|---|---|---|---|
Model | SSM | p | MSM=SSM/p | MSM/MSE | Tail area above F |
Error | SSE | n-p-1 | MSE=SSE/(n-p-1) | ||
Total | SST | n-1 |
The total variation in the response \(y\) is partitioned between two sources: variation due to the model, i.e., because \(y\) changes as the \(x\)’s change, and variation due the residuals, i.e., the scatter above and below the fitted regression surface.
Test statistic: F-statistic = mean squares model / mean squares residual
\[F = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2/p}{\sum_{i=1}^n ( y_i - \hat y_i)^2/(n-p-1)} = \frac{SSM/DFM}{SSE/DFE} = \frac{MSM}{MSE},\] where \(p\) is the number of \(\beta\) parameters excluding \(\beta_0\) in \(Y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon\).
Distribution F is F(p, n-p-1) F distribution with degrees of freedom = (number of explanatory variables, number of observations - number of explanatory variables - 1)
p-value: probability that test statistic has a value as extreme or more extreme than the observed
p-value = \(P(F \geq F_{\mbox{observed}})\)
summary(res)
##
## Call:
## lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.954 -7.154 1.550 6.599 14.888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.8759 25.7757 6.319 4.59e-06 ***
## Age -1.2103 0.3015 -4.015 0.00074 ***
## Severity -0.6659 0.8210 -0.811 0.42736
## Anxiety -8.6130 12.2413 -0.704 0.49021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.29 on 19 degrees of freedom
## Multiple R-squared: 0.6727, Adjusted R-squared: 0.621
## F-statistic: 13.01 on 3 and 19 DF, p-value: 7.482e-05
p-value=0.000. There is strong evidence that at least one of the explanatory variables in the model is related to the response.
Root Mean Square Error (RMSE)
\[RMSE = \sqrt{\sum_{i=1}^n ( y_i - \hat y_i)^2/(n-p-1)} = \sqrt{MSE}\] RMSE is ‘Residual standard error: 10.29 on 19 degrees of freedom’ in output
R\(^2\) (R-squared or coefficient of determination) is the proportion of variation of the response variable \(y\) (vertical scatter from the regression surface) that is explained by the model (explained by the changes in the \(x\)’s).
\[R^2 = \frac{\mbox{variability explained by the model}}{\mbox{total variability}} = 1 - \frac{\mbox{variability in residuals}}{\mbox{total variability}}\]
\[R^2 = \frac{SSM}{SST} = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{\sum_{i=1}^n ( y_i - \bar y)^2}\] or equivalently
\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{\sum_{i=1}^n ( y_i - \bar y)^2}\]
For simple linear regression, R\(^2\) is equal to the square of the correlation between \(y\) and \(x\), \(R^2 = r^2\)
\(R^2\) will increase as more explanatory variables are added to a regression model. However, adding more explanatory variables does not necessarily lead to a better model. We look at R\(^2\)(adj).
R\(^2\)(adj) or Adjusted R-squared is the proportion of variation of the response variable explained by the fitted model adjusting for the number of parameters.
We use the adjusted \(R^2\) to assess goodness of fit of the model when comparing models with different numbers of parameters.
\[R^2(adj) = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{SSE}{SST} \times \frac{n-1}{n-p-1},\] where \(n\) is the number of observations and \(p\) is the number of predictor variables in the model. Because \(p\) is never negative, \(R^2(adj)\) will be smaller than the unadjusted \(R^2\).
In this example, R\(^2\)(adj)= 62.10% The percentage of variation explained by the model is about 62% adjusting for the number of parameters.
summary(res)$adj.r.squared
## [1] 0.6209731
Standardized residuals
Standardized residual is an ordinary residual divided by an estimate of its standard deviation.
\[e_i = y_i - \hat{y_i}\]
\[r_i = \frac{e_i}{se(e_i)} = \frac{e_i}{\hat \sigma \sqrt{1-h_{ii}}} = \frac{e_i}{ \sqrt{MSE (1-h_{ii})}}\]
Random sample
To check normality we look at the normal Q-Q plot of the residuals The data show departure from normality because the plot does not resemble a straight line. This departure could be due to the few observations. There are only 23 observations.
plot(res, 2)
Linearity. Residual versus variables plots are randomly scattered. This indicates that assuming the effects of age, severity and anxiety level are linear is OK.
Constant variance. The plots suggest that the variance of \(y\), \(\sigma^2\), is the same for all values of \(x\) since we do not see any structure in the residuals.
par(mfrow = c(2, 2))
plot(d$Age, rstandard(res), main = "Standardized residuals versus Age")
abline(h = 0, lty = 2)
plot(d$Severity, rstandard(res), main = "Standardized residuals versus Severity")
abline(h = 0, lty = 2)
plot(d$Anxiety, rstandard(res), main = "Standardized residuals versus Anxiety")
abline(h = 0, lty = 2)
Outliers (unusual \(y\)). Standardized residuals less than -2 or greater than 2.
# Outliers
<- which(abs(rstandard(res)) > 2)
op op
## named integer(0)
High leverage observations (unusual \(x\)). Leverage greater than 2 times the average of all leverage values. That is, \(h_{ii} > 2(p+1)/n\).
# High leverage observations
<- hatvalues(res)
h <- which(h > 2*length(res$coefficients)/nrow(d))
lp lp
## named integer(0)
Influential values. Cook’s distance higher than \(4/n\).
# Influential observations
<- which(cooks.distance(res) > 4/nrow(d))
ip ip
## 11 13 15 17
## 11 13 15 17
plot(res, 4)
unique(c(lp, op, ip)), ] d[
There are not outliers or high leverage observations. Observations 11, 13, 15 and 17 have large Cook’s distance but it is not that different from the rest.