10  Example code MLR

10.1 Patient satisfaction

A hospital administrator wants to study the association between patient satisfaction and patients age, severity of illness and anxiety level. 23 patients were randomly selected and the data were collected, where larger values of satisfaction, severity, and anxiety are associated with more satisfaction, increased severity of illness and more anxiety, respectively.

d <- read.table("data/patsat.txt", header = TRUE)
head(d)
  Satisfaction Age Severity Anxiety
1           48  50       51     2.3
2           57  36       46     2.3
3           66  40       48     2.2
4           70  41       44     1.8
5           89  28       43     1.8
6           36  49       54     2.9
summary(d)
  Satisfaction        Age           Severity        Anxiety     
 Min.   :26.00   Min.   :28.00   Min.   :43.00   Min.   :1.800  
 1st Qu.:50.00   1st Qu.:33.00   1st Qu.:48.00   1st Qu.:2.150  
 Median :60.00   Median :40.00   Median :50.00   Median :2.300  
 Mean   :61.35   Mean   :39.61   Mean   :50.78   Mean   :2.296  
 3rd Qu.:73.50   3rd Qu.:44.50   3rd Qu.:53.50   3rd Qu.:2.400  
 Max.   :89.00   Max.   :55.00   Max.   :62.00   Max.   :2.900  

10.2 Relationships between pairs of variables

We summarise the data graphically using a matrix of scatterplots to examine all pairwise relationships.

plot(d)

We also compute the correlations.

cor(d)
             Satisfaction        Age   Severity    Anxiety
Satisfaction    1.0000000 -0.7736828 -0.5874444 -0.6023105
Age            -0.7736828  1.0000000  0.4666091  0.4976945
Severity       -0.5874444  0.4666091  1.0000000  0.7945275
Anxiety        -0.6023105  0.4976945  0.7945275  1.0000000

All explanatory variables by themselves are negatively and linearly related to Satisfaction.

The variable Age shows the strongest relationship with Satisfaction, followed by Anxiety and then Severity.

We need to perform a formal analysis using multiple regression to see which combination of variables is most useful to predict Satisfaction.

10.3 Collinearity

Collinearity occurs when two or more explanatory variables in a multiple regression model are highly correlated.

If the explanatory variables are strongly related to each other then it is possible that they may affect the response in the same way and this makes it difficult to distinguish their contributions.

If this is the case, then it is likely that we would not need to use all the explanatory variables in the model. That is, we have some redundant explanatory variables.

Collinearity causes instability in the estimation of \(\boldsymbol{\beta}\) but it is not so much of a problem for prediction.

We will try to avoid having very similar predictors in the model.

10.4 Regression equation

res <- lm(Satisfaction ~ Age + Severity + Anxiety, data = d)
summary(res)

Call:
lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.954  -7.154   1.550   6.599  14.888 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 162.8759    25.7757   6.319 4.59e-06 ***
Age          -1.2103     0.3015  -4.015  0.00074 ***
Severity     -0.6659     0.8210  -0.811  0.42736    
Anxiety      -8.6130    12.2413  -0.704  0.49021    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.29 on 19 degrees of freedom
Multiple R-squared:  0.6727,    Adjusted R-squared:  0.621 
F-statistic: 13.01 on 3 and 19 DF,  p-value: 7.482e-05

\(\widehat{satisfaction}\) = 162.9 - 1.210 age - 0.666 severity - 8.6 anxiety

  • mean patient satisfaction decreases by 1.210 for every unit increase in age holding the rest of variables constant
  • mean patient satisfaction decreases by 0.666 for every unit increase in severity of illness holding the rest of variables constant
  • mean patient satisfaction decreases by 8.6 for every unit increase in anxiety holding the rest of variables constant

10.5 Tests for coefficients. t test

Test null hypothesis ‘response variable \(Y\) is not linearly related to variable \(X_i\)’ versus alternative hypothesis ‘\(Y\) is linearly related to \(X_i\)’, \(i=1,\ldots,p\)

\(H_0\): \(\beta_{i}\) = 0
\(H_1\): \(\beta_{i} \neq 0\)

Test statistic: t-statistic = estimate of the coefficient / SE of coefficient

\[t = \frac{\hat \beta_i - \beta_i}{SE_{\hat \beta_1}}\]

Distribution: t(n-p-1) where \(p\) is the number of \(\beta\) parameters excluding \(\beta_0\)
t distribution with degrees of freedom = number of observations - number of explanatory variables - 1

p-value: probability of obtaining a test statistic as least as extreme as the observed assuming the null is true

\(H_1: \beta_i \neq 0\) is \(P(T\geq |t|)\)

summary(res)

Call:
lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.954  -7.154   1.550   6.599  14.888 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 162.8759    25.7757   6.319 4.59e-06 ***
Age          -1.2103     0.3015  -4.015  0.00074 ***
Severity     -0.6659     0.8210  -0.811  0.42736    
Anxiety      -8.6130    12.2413  -0.704  0.49021    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.29 on 19 degrees of freedom
Multiple R-squared:  0.6727,    Adjusted R-squared:  0.621 
F-statistic: 13.01 on 3 and 19 DF,  p-value: 7.482e-05
  • p-value=0.001 hence there is strong evidence that age affects the mean patient satisfaction allowing for severity and anxiety.
  • p-value=0.421 hence there is not evidence that severity affects the mean patient satisfaction allowing for age and anxiety.
  • p-value=0.490 hence there is not evidence that anxiety affects the mean patient satisfaction allowing for age and severity.

10.6 Confidence interval for the slope \(\beta_i\)

A \((1-\alpha)\times 100\)% confidence interval for the slope \(\beta_1\) is \[\hat \beta_i \pm t^* \times SE_{\hat \beta_i}\]

\(t^*\) is the value on the \(t(n-p-1)\) density curve with area \(c\) between the critical points \(-t^*\) and \(t^*\). That is, \(t^*\) is the value such that \(\alpha/2\) of the probability in the \(t(n-p-1)\) distribution is below \(t^*\).

10.7 ANOVA Table

The total variation in the response \(Y\) is partitioned between two sources: variation due to the model and variation due the residuals.

Sources of variation are:

  • Model (or Regression)
  • Error (or Residual)
  • Total

Sum of Squares

\[SSM + SSE = SST\]

\[SSM = \sum_{i=1}^n (\hat Y_i - \bar Y)^2\]

\[SSE = \sum_{i=1}^n (Y_i - \hat Y_i)^2\]

\[SST = \sum_{i=1}^n (Y_i - \bar Y)^2\] Degrees of Freedom

\[df_M + df_E = df_T\]

\(df_M = p\)

\(df_E = n-p-1\)

\(df_T = n-1\)

Mean Squares

\[MSM = \frac{SSM}{df_M} = \frac{\sum (\hat Y_i - \bar Y)^2}{p}\]

\[MSE = \frac{SSE}{df_E} = \frac{\sum (Y_i - \hat Y_i)^2}{n-p-1}\]

\[MST = \frac{SST}{df_T} = \frac{\sum (Y_i - \bar Y)^2}{n-1}\]

ANOVA Table

Source Sum of Squares degrees of freedom Mean square F
Model SSM p MSM=SSM/p MSM/MSE
Error SSE n-p-1 MSE=SSE/(n-p-1)
Total SST n-1

10.8 ANOVA. F test

Test null hypothesis ‘none of the explanatory variables are useful predictors of the response variable’ versus alternative hypothesis ‘at least one of the explanatory variables is related to the response variable’

\(H_0\): \(\beta_{1} =\beta_{2}=...=\beta_{p}\) = 0 (all coefficients are 0)

\(H_1\): at least one of \(\beta_1\), \(\beta_2\), \(\dots\), \(\beta_p \neq\) 0 (not all coefficients are 0)

ANOVA table for multiple regression

Source Sum of Squares degrees of freedom Mean square F p-value
Model SSM p MSM=SSM/p MSM/MSE Tail area above F
Error SSE n-p-1 MSE=SSE/(n-p-1)
Total SST n-1

The total variation in the response \(Y\) is partitioned between two sources: variation due to the model, i.e., because \(Y\) changes as the \(X\)’s change, and variation due the residuals, i.e., the scatter above and below the fitted regression surface.

Test statistic

F-statistic = \(\frac{MSM}{MSE} = \frac{\mbox{mean squares model}}{\mbox{mean squares error}}\)

\[F = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2/p}{\sum_{i=1}^n (Y_i - \hat Y_i)^2/(n-p-1)} = \frac{SSM/df_M}{SSE/df_E} = \frac{MSM}{MSE},\] where \(p\) is the number of \(\beta\) parameters excluding \(\beta_0\) in \(Y= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon\).

Under \(H_0\), distribution of \(F \sim F(p, n-p-1)\)

\(F\) has an F distribution with degrees of freedom = (number of explanatory variables, number of observations - number of explanatory variables - 1)

p-value: probability that test statistic has a value as extreme or more extreme than the observed

p-value = \(P(F \geq F_{\mbox{observed}})\)

  • If p-value < \(\alpha\) (or F larger than the critical value), we reject the null hypothesis.
    There is evidence that at least one of the regression coefficients \(\beta_i, i=1,\ldots,p\) is non-zero. Thus, at least one of the explanatory variables \(X, i=1,\ldots,p\) is useful for predicting \(Y\) (we don’t know which one though). There is evidence that at least one (one or more) of the explanatory variables in the model is potentially useful for predicting the response in the linear model.

  • If p-value > \(\alpha\) (or F smaller than the critical value), we fail to reject the null hypothesis.
    There is no evidence that any of the explanatory variables can help model the response variable using this kind of model. We cannot conclude that any of the explanatory variables is useful for prediction \(Y\).

summary(res)

Call:
lm(formula = Satisfaction ~ Age + Severity + Anxiety, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.954  -7.154   1.550   6.599  14.888 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 162.8759    25.7757   6.319 4.59e-06 ***
Age          -1.2103     0.3015  -4.015  0.00074 ***
Severity     -0.6659     0.8210  -0.811  0.42736    
Anxiety      -8.6130    12.2413  -0.704  0.49021    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.29 on 19 degrees of freedom
Multiple R-squared:  0.6727,    Adjusted R-squared:  0.621 
F-statistic: 13.01 on 3 and 19 DF,  p-value: 7.482e-05

p-value=0.000 \(< \alpha = 0.05\). There is strong evidence that at least one of the explanatory variables in the model is related to the response.

Root Mean Square Error (RMSE)

\[RMSE = \sqrt{\sum_{i=1}^n (Y_i - \hat Y_i)^2/(n-p-1)} = \sqrt{MSE}\] RMSE is ‘Residual standard error: 10.29 on 19 degrees of freedom’ in output

10.9 R\(^2\)

R\(^2\) (R-squared or coefficient of determination) is the proportion of variation of the response variable \(Y\) (vertical scatter from the regression surface) that is explained by the model (explained by the changes in the \(x\)’s).

\[R^2 = \frac{\mbox{variability explained by the model}}{\mbox{total variability}} = 1 - \frac{\mbox{variability in residuals}}{\mbox{total variability}}\]

  • R\(^2\) measures how successfully the regression explains the response.
  • If the data lie exactly on the regression surface then R\(^2\) = 1.
  • If the data are randomly scattered then R\(^2\) = 0.
  • For simple linear regression, R\(^2\) is equal to the square of the correlation between \(Y\) and \(X\), \(R^2 = r^2\)

\[R^2 = \frac{SSM}{SST} = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n ( Y_i - \bar Y)^2} \ \ \ \mbox{(proportion of variability explained by the model)}\] or equivalently

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^n ( Y_i - \hat Y_i)^2}{\sum_{i=1}^n ( Y_i - \bar Y)^2} \ \ \ \mbox{(1 - proportion of variability not explained by the model)}\]

10.10 R\(^2\)(adj)

\(R^2\) will increase as more explanatory variables are added to a regression model. However, adding more explanatory variables does not necessarily lead to a better model. We look at R\(^2\)(adj).

R\(^2\)(adj) or Adjusted R-squared is the proportion of variation of the response variable explained by the fitted model adjusting for the number of parameters.

We use the adjusted \(R^2\) to assess goodness of fit of the model when comparing models with different numbers of parameters.

\[R^2(adj) = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{SSE}{SST} \times \frac{n-1}{n-p-1},\] where \(n\) is the number of observations and \(p\) is the number of predictor variables in the model. Because \(p\) is never negative, \(R^2(adj)\) will be smaller than the unadjusted \(R^2\).

In this example, R\(^2\)(adj)= 62.10% The percentage of variation explained by the model is about 62% adjusting for the number of parameters.

summary(res)$adj.r.squared
[1] 0.6209731