1 Multiple linear regression

Learning objectives

  • Multiple linear regression (model, assumptions, unusual observations)

1.1 Regression

A regression model describes the linear relationship between a continuous response variable \(y\) and one or more explanatory variables \(x_1, x_2, \dots, x_p\)

In simple linear regression there is one explanatory variable \[\hat y = b_0 + b_1 x_1\] Example: How does body mass index relate to physical activity?

In multiple linear regression there is more than one explanatory variable \[\hat y = b_0 + b_1 x_1 + ... + b_p x_p\] Example: How does systolic blood pressure relate to age and weight?

  • \(\hat y\) estimated mean or expected response
  • \(b_0\) is the value of \(\hat y\) when \(x_1=x_2=...=x_p=0\)
  • \(b_i\) denotes how much increase (or decrease) there is for \(\hat y\) as the \(x_i\) variable increases by one unit given that the rest of variables remain constant, \(i=1,\ldots,p\)

1.2 Assumptions

After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. If there any assumption violation, subsequent inferential procedures may be invalid

Assumptions:

  • Independency: errors are independent
  • Normality: errors are normally distributed around its mean 0
  • Constant variance: variance \(\sigma^2\) is the same for all values of \(x\)
  • Linearity: relationship between \(y\) and \(x\) is linear

We check normality of errors with a normal Q-Q plot of the residuals

Constant variance and linearity are checked using plots of the residuals (\(y\)-axis) against each explanatory variable (\(x\)-axis) to assess the linearity between \(y\) and each \(x\), and the constant variance of \(y\) across each \(x\).

1.3 Unusual observations

Outlier (unusual \(y\)): observation whose response value \(y\) does not follow the general trend of the rest of the data. Outliers have standardized residual less than -2 or greater than 2 (or -3 and 3). The bigger the data set, the more outliers we expect to get, about 5% of the sample.

High leverage observation (unusual \(x\)): observation that has extreme predictor \(x\) values. An observation is usually considered to have high leverage if it has leverage (“hat”) greater than 2 times the average of all leverage values. That is, if \(h_{ii} > 2(p+1)/n\).

Influential observation: an observation is influential if its omission causes statistical inference to change, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Cook’s distance combines residuals and leverages in a single measure of influence. We examine in closer detail the points with values Cook’s distance that are substantially larger than the rest.

1.4 Example. Patient satisfaction

A hospital administrator wants to study the association between patient satisfaction and patients age, severity of illness and anxiety level. 23 patients were randomly selected and the data were collected, where larger values of satisfaction, severity, and anxiety are associated with more satisfaction, increased severity of illness and more anxiety, respectively.

d <- read.table("data/patsat.txt", header = TRUE)
head(d)
summary(d)
##   Satisfaction        Age           Severity        Anxiety     
##  Min.   :26.00   Min.   :28.00   Min.   :43.00   Min.   :1.800  
##  1st Qu.:50.00   1st Qu.:33.00   1st Qu.:48.00   1st Qu.:2.150  
##  Median :60.00   Median :40.00   Median :50.00   Median :2.300  
##  Mean   :61.35   Mean   :39.61   Mean   :50.78   Mean   :2.296  
##  3rd Qu.:73.50   3rd Qu.:44.50   3rd Qu.:53.50   3rd Qu.:2.400  
##  Max.   :89.00   Max.   :55.00   Max.   :62.00   Max.   :2.900

1.5 Relationships between pairs of variables

We summarise the data graphically using a matrix of scatterplots to examine all pairwise relationships.

plot(d)

We also compute the correlations.

cor(d)
##              Satisfaction        Age   Severity    Anxiety
## Satisfaction    1.0000000 -0.7736828 -0.5874444 -0.6023105
## Age            -0.7736828  1.0000000  0.4666091  0.4976945
## Severity       -0.5874444  0.4666091  1.0000000  0.7945275
## Anxiety        -0.6023105  0.4976945  0.7945275  1.0000000

All explanatory variables by themselves are negatively and linearly related to Satisfaction.

The variable Age shows the strongest relationship with Satisfaction, followed by Anxiety and then Severity.

We need to perform a formal analysis using multiple regression to see which combination of variables is most useful to predict Satisfaction.

1.6 Collinearity

Collinearity occurs when two or more explanatory variables in a multiple regression model are highly correlated.

If the explanatory variables are strongly related to each other then it’s possible that they may affect the response in the same way and this makes it difficult to distinguish their contributions.

If this is the case, then it is likely that we would not need to use all the explanatory variables in the model. That is, we have some redundant explanatory variables.

Collinearity causes instability in the estimation of \(\beta\) but it is not so much of a problem for prediction.

We will try to avoid having very similar predictors in the model.

1.7 Regression equation

res <- lm(Satisfaction ~ Age + Severity + Anxiety, data = d)
summary(res)
## 
## Call:
## lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.954  -7.154   1.550   6.599  14.888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 162.8759    25.7757   6.319 4.59e-06 ***
## Age          -1.2103     0.3015  -4.015  0.00074 ***
## Severity     -0.6659     0.8210  -0.811  0.42736    
## Anxiety      -8.6130    12.2413  -0.704  0.49021    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.29 on 19 degrees of freedom
## Multiple R-squared:  0.6727, Adjusted R-squared:  0.621 
## F-statistic: 13.01 on 3 and 19 DF,  p-value: 7.482e-05

\(\widehat{satisfaction}\) = 162.9 - 1.210 age - 0.666 severity - 8.6 anxiety

  • mean patient satisfaction decreases by 1.210 for every unit increase in age holding the rest of variables constant
  • mean patient satisfaction decreases by 0.666 for every unit increase in severity of illness holding the rest of variables constant
  • mean patient satisfaction decreases by 8.6 for every unit increase in anxiety holding the rest of variables constant

1.8 Tests for coefficients. t test

Test null hypothesis ‘response variable \(y\) is not linearly related to variable \(x_i\)’ versus alternative hypothesis ‘\(y\) is linearly related to \(x_i\)’, \(i=1,\ldots,p\)

\(H_0\): \(\beta_{i}\) = 0
\(H_1\): \(\beta_{i} \neq 0\)

Test statistic: t-statistic = estimate of the coefficient / SE of coefficient

\[t = \frac{b_i - \beta_i}{SE_{b_1}}\]

Distribution: t(n-p-1) where \(p\) is the number of \(\beta\) parameters excluding \(\beta_0\)
t distribution with degrees of freedom = number of observations - number of explanatory variables - 1

p-value: probability of obtaining a test statistic as least as extreme as the observed assuming the null is true

\(H_1: \beta_i \neq 0\) is \(P(T\geq |t|)\)

summary(res)
## 
## Call:
## lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.954  -7.154   1.550   6.599  14.888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 162.8759    25.7757   6.319 4.59e-06 ***
## Age          -1.2103     0.3015  -4.015  0.00074 ***
## Severity     -0.6659     0.8210  -0.811  0.42736    
## Anxiety      -8.6130    12.2413  -0.704  0.49021    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.29 on 19 degrees of freedom
## Multiple R-squared:  0.6727, Adjusted R-squared:  0.621 
## F-statistic: 13.01 on 3 and 19 DF,  p-value: 7.482e-05
  • p-value=0.001 hence there is strong evidence that age affects the mean patient satisfaction allowing for severity and anxiety.
  • p-value=0.421 hence there is not evidence that severity affects the mean patient satisfaction allowing for age and anxiety.
  • p-value=0.490 hence there is not evidence that anxiety affects the mean patient satisfaction allowing for age and severity.

1.9 Confidence interval for the slope \(\beta_i\)

A \((1-\alpha)\times 100\)% confidence interval for the slope \(\beta_1\) is \[b_i \pm t^* \times SE_{b_i}\]

\(t^*\) is the value on the \(t(n-p-1)\) density curve with area \(c\) between the critical points \(-t^*\) and \(t^*\). That is, \(t^*\) is the value such that \(\alpha/2\) of the probability in the \(t(n-p-1)\) distribution is below \(t^*\).

1.10 ANOVA. F test

Test null hypothesis ‘none of the explanatory variables are useful predictors of the response variable’ versus alternative hypothesis ‘at least one of the explanatory variables is related to the response variable’

\(H_0\): \(\beta_{1} =\beta_{2}=...=\beta_{p}\) = 0
\(H_1\): at least one of \(\beta_1\), \(\beta_2\), \(\dots\), \(\beta_p \neq\) 0

ANOVA table for multiple regression

Source Sum of Squares degrees of freedom Mean square F p-value
Model SSM p MSM=SSM/p MSM/MSE Tail area above F
Error SSE n-p-1 MSE=SSE/(n-p-1)
Total SST n-1

The total variation in the response \(y\) is partitioned between two sources: variation due to the model, i.e., because \(y\) changes as the \(x\)’s change, and variation due the residuals, i.e., the scatter above and below the fitted regression surface.

Test statistic: F-statistic = mean squares model / mean squares residual

\[F = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2/p}{\sum_{i=1}^n ( y_i - \hat y_i)^2/(n-p-1)} = \frac{SSM/DFM}{SSE/DFE} = \frac{MSM}{MSE},\] where \(p\) is the number of \(\beta\) parameters excluding \(\beta_0\) in \(Y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon\).

Distribution F is F(p, n-p-1) F distribution with degrees of freedom = (number of explanatory variables, number of observations - number of explanatory variables - 1)

p-value: probability that test statistic has a value as extreme or more extreme than the observed

p-value = \(P(F \geq F_{\mbox{observed}})\)

summary(res)
## 
## Call:
## lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.954  -7.154   1.550   6.599  14.888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 162.8759    25.7757   6.319 4.59e-06 ***
## Age          -1.2103     0.3015  -4.015  0.00074 ***
## Severity     -0.6659     0.8210  -0.811  0.42736    
## Anxiety      -8.6130    12.2413  -0.704  0.49021    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.29 on 19 degrees of freedom
## Multiple R-squared:  0.6727, Adjusted R-squared:  0.621 
## F-statistic: 13.01 on 3 and 19 DF,  p-value: 7.482e-05

p-value=0.000. There is strong evidence that at least one of the explanatory variables in the model is related to the response.

Root Mean Square Error (RMSE)

\[RMSE = \sqrt{\sum_{i=1}^n ( y_i - \hat y_i)^2/(n-p-1)} = \sqrt{MSE}\] RMSE is ‘Residual standard error: 10.29 on 19 degrees of freedom’ in output

1.11 R\(^2\)(adj)

R\(^2\) (R-squared or coefficient of determination) is the proportion of variation of the response variable \(y\) (vertical scatter from the regression surface) that is explained by the model (explained by the changes in the \(x\)’s).

\[R^2 = \frac{\mbox{variability explained by the model}}{\mbox{total variability}} = 1 - \frac{\mbox{variability in residuals}}{\mbox{total variability}}\]

  • R\(^2\) measures how successfully the regression explains the response.
  • If the data lie exactly on the regression surface then R\(^2\) = 1.
  • If the data are randomly scattered then R\(^2\) = 0.

\[R^2 = \frac{SSM}{SST} = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{\sum_{i=1}^n ( y_i - \bar y)^2}\] or equivalently

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{\sum_{i=1}^n ( y_i - \bar y)^2}\]

For simple linear regression, R\(^2\) is equal to the square of the correlation between \(y\) and \(x\), \(R^2 = r^2\)

\(R^2\) will increase as more explanatory variables are added to a regression model. However, adding more explanatory variables does not necessarily lead to a better model. We look at R\(^2\)(adj).

R\(^2\)(adj) or Adjusted R-squared is the proportion of variation of the response variable explained by the fitted model adjusting for the number of parameters.

We use the adjusted \(R^2\) to assess goodness of fit of the model when comparing models with different numbers of parameters.

\[R^2(adj) = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{SSE}{SST} \times \frac{n-1}{n-p-1},\] where \(n\) is the number of observations and \(p\) is the number of predictor variables in the model. Because \(p\) is never negative, \(R^2(adj)\) will be smaller than the unadjusted \(R^2\).

In this example, R\(^2\)(adj)= 62.10% The percentage of variation explained by the model is about 62% adjusting for the number of parameters.

summary(res)$adj.r.squared
## [1] 0.6209731

1.12 Check assumptions

Standardized residuals

Standardized residual is an ordinary residual divided by an estimate of its standard deviation.

\[e_i = y_i - \hat{y_i}\]

\[r_i = \frac{e_i}{se(e_i)} = \frac{e_i}{\hat \sigma \sqrt{1-h_{ii}}} = \frac{e_i}{ \sqrt{MSE (1-h_{ii})}}\]

1.12.1 Independence

Random sample

1.12.2 Normality

To check normality we look at the normal Q-Q plot of the residuals The data show departure from normality because the plot does not resemble a straight line. This departure could be due to the few observations. There are only 23 observations.

plot(res, 2)

1.12.3 Linearity and constant variance

Linearity. Residual versus variables plots are randomly scattered. This indicates that assuming the effects of age, severity and anxiety level are linear is OK.

Constant variance. The plots suggest that the variance of \(y\), \(\sigma^2\), is the same for all values of \(x\) since we do not see any structure in the residuals.

par(mfrow = c(2, 2))
plot(d$Age, rstandard(res), main = "Standardized residuals versus Age")
abline(h = 0, lty = 2)
plot(d$Severity, rstandard(res), main = "Standardized residuals versus Severity")
abline(h = 0, lty = 2)
plot(d$Anxiety, rstandard(res), main = "Standardized residuals versus Anxiety")
abline(h = 0, lty = 2)

1.13 Unusual observations

Outliers (unusual \(y\)). Standardized residuals less than -2 or greater than 2.

# Outliers
op <- which(abs(rstandard(res)) > 2)
op
## named integer(0)

High leverage observations (unusual \(x\)). Leverage greater than 2 times the average of all leverage values. That is, \(h_{ii} > 2(p+1)/n\).

# High leverage observations
h <- hatvalues(res)
lp <- which(h > 2*length(res$coefficients)/nrow(d))
lp
## named integer(0)

Influential values. Cook’s distance higher than \(4/n\).

# Influential observations
ip <- which(cooks.distance(res) > 4/nrow(d))
ip
## 11 13 15 17 
## 11 13 15 17
plot(res, 4)

d[unique(c(lp, op, ip)), ]

There are not outliers or high leverage observations. Observations 11, 13, 15 and 17 have large Cook’s distance but it is not that different from the rest.