15  Residuals and leverages

res <- lm(y ~ x, data = d)
hatvalues(res) # leverages
influence(res)$hat # leverages. this is the same as the previous line
rstandard(res) # standardized residuals
rstudent(res) # jackknife residuals
cooks.distance(res) # Cook's distance

15.1 Residuals

The residual is defined as

\[e_i = Y_i - \hat Y_i,\ i=1, \ldots,n\] where \(\hat Y = X \hat \beta\).

The estimate of the population variance computed from the sample of \(n\) residuals is the MSE

\[\hat \sigma^2 = \frac{1}{n-p-1}\sum_{i=1}^n e_i^2 = \frac{SSE}{n-p-1} = MSE\]

\(\hat Y = X \hat \beta = X(X'X)^{-1}X' Y = H Y\). Hat matrix \(H= X(X'X)^{-1}X'\).

\(e = Y - \hat Y = (I-H) Y = (I-H)(X\beta+\epsilon)= (I-H)X\beta + (I-H)\epsilon =\)

\(= (I-X(X'X)^{-1}X')X\beta + (I-H)\epsilon = (X\beta-X\beta) + (I-H)\epsilon = (I-H)\epsilon\)

The expectation of the residuals is \(E(e)= E((I-H)\epsilon) = 0\)

Since \(Var(\epsilon)=\sigma^2 I\),

\(Var(e) = Var((I-H)\epsilon) = (I-H)\sigma^2(I-H)'=\sigma^2(I-H)\)

Note that \(H\) and \(I-H\) are symmetric (\(A = A'\)) and idempotent (\(A^2=A\))


The covariance matrix of the residuals is \(Var(e) = \sigma^2 (I-H)\)

The variance of the \(i\)th residual \(e_i\) is \(Var(e_i) = \sigma^2 (1-h_{ii})\)

The standard error of the \(i\)th residual \(e_i\) is \(SE(e_i) = \sigma \sqrt{1-h_{ii}}\)

\(h_{ii} = H_{ii}\) is the \(i\)th element on the diagonal of the hat matrix \(H= X(X'X)^{-1}X'\)

\(0\leq h_{ii} \leq 1\)

15.2 Unusual observations (high leverage, outliers, influential)

15.3 Leverages

\[\hat Y = X(X'X)^{-1}X' Y = H Y\] \[\hat Y_i = h_{i1}Y_1 + h_{i2}Y_2 + \ldots + h_{ii}Y_i + \ldots + h_{in}Y_n,\ i=1,\ldots,n\] where \(h_{i1}, h_{i2}, \ldots, h_{ii}, \ldots, h_{in}\) depend only on the predictor values.

\(h_{ii} = H_{ii}\) is the \(i\)th element on the diagonal of the hat matrix \(H= X(X'X)^{-1}X'\).

\(h_{ii}\) are called leverages and are useful diagnostics.

The leverage \(h_{ii}\) quantifies the influence that the observed response \(Y_i\) has on its predicted value \(\hat Y_i\).

If \(h_{ii}\) is small (large), then the observed response \(Y_i\) plays a small (large) role in the value of the predicted response \(\hat Y_i\).

The leverage \(h_{ii} = X_i' (X'X)^{-1} X_i\) is a measure of the distance between the \(X\) value for the \(i\)th data point and the mean of the \(X\) values for all \(n\) data points. That is, the leverage \(h_{ii}\) quantifies how far away the \(i\)th \(x\) value is from the rest of the \(X\) values. If the \(i\)th \(X\) value is far away, the leverage \(h_{ii}\) will be large; and otherwise not.

Leverages can help us identify \(x\) values are extreme and therefore potentially influential on our regression analysis.

Rule of thumb: to flag any observation whose leverage value is more than 2 (or 3) times larger than the mean leverage value:

\(h_{ii} = H_{ii}\) is the \(i\)th element on the diagonal of the hat matrix \(H\).

The sum of the \(h_{ii}\) equals \(p+1\), the number of parameters (regression coefficients plus the intercept). \[trace(H) = trace(X(X'X)^{-1}X') = trace((X'X)^{-1}X'X) = trace(I) = p+1\] The leverage \(h_{ii}\) is a number between 0 and 1, inclusive, \(0\leq h_{ii} \leq 1\).

\[\bar h = \frac{\sum_i h_{ii}}{n} = \frac{p+1}{n}\] If \(h_{ii} > 2(\frac{p+1}{n})\) observation considered as unusual with large leverage.

15.4 Outliers

15.4.1 Standardized residual

Also called internally studentized residual.

Standardized residual is an ordinary residual divided by an estimate of its standard deviation.

\[e_i = Y_i - \hat{Y_i}\]

\[r_i = \frac{e_i}{se(e_i)} = \frac{e_i}{\hat \sigma \sqrt{1-h_{ii}}} = \frac{e_i}{ \sqrt{MSE (1-h_{ii})}} \] Standardized residuals quantify how large the residuals are in standard deviation units, and therefore can be easily used to identify outliers.

An observation with a standardized residual that is larger than 2 or 3 (in absolute value) is deemed by some to be an outlier.

When trying to identify outliers, one problem that can arise is when there is a potential outlier that influences the regression model to such an extent that the estimated regression function is “pulled” towards the potential outlier, so that it is not flagged as an outlier using the standardized residual criterion.

15.4.2 Externally studentized residual

Also called jacknife residual, cross-validated residual, externally studentized residual or studentized deleted residual.

Studentized residual is a deleted residual divided by its estimated standard deviation.

\[t_i = \frac{Y_i - \hat Y_{i(-i)}}{se(Y_i-\hat Y_{i(-i)})}=\frac{e_i}{\hat \sigma_{(-i)}\sqrt{1-h_{ii}}} = \frac{e_i}{ \sqrt{MSE_{(-i)} (1-h_{ii})}} = r_i\sqrt{\frac{(n-p-1)-1}{(n-p-1)-r_i^2}}\]

  • \(\hat Y_{i(-i)}\) is the predicted value for the \(i\)th data point when the \((X_i, Y_i)\) is omitted from the data
  • \(\hat \sigma^2_{(-i)}\) is the estimated variance when \((X_i, Y_i)\) is omitted from the data

That is, a jacknife residual is just a deleted residual divided by its estimated standard deviation. This turns out to be equivalent to the ordinary residual divided by a factor that includes the mean square error based on the estimated model with the \(i\)th observation deleted. Note that the only difference between the standardized residuals considered before and the jacknife residuals considered here is that standardized residuals use the mean square error for the model based on all observations, \(MSE\), while jacknife residuals use the mean square error based on the estimated model with the \(i\)th observation deleted, \(MSE_{(-i)}\).

Another formula for jacknife residuals \(t_i\) allows them to be calculated using only the results for the model fit to all the observations using \(r_i\) (the \(i\)th standardized residual), \(n\) (the number of observations), and \(p\) (the number of predictors).

So we can compute \(t_i\) without ever having to actually delete the observation and re-fit the model.

If an observation has a jacknife residual that is larger than 3 (in absolute value) we can call it an outlier.

Using a cutoff of 2 may be a little conservative, but perhaps it is better to be safe than sorry. The key here is not to take the cutoffs of either 2 or 3 too literally. Instead, treat them simply as red warning flags to investigate the data points further.

15.5 Influential observations

The idea is to delete the observations one at a time, each time refitting the regression model on the remaining \(n–1\) observations. Then, we compare the results using all \(n\) observations to the results with the \(i\)th observation deleted to see how much influence the observation has on the analysis. Analyzed as such, we are able to assess the potential impact each data point has on the regression analysis.

15.5.1 Cook’s distance

Cook’s D measures the influence of the \(i\)th observation on all \(n\) fitted values

\[D_i = \frac{(\hat Y - \hat Y_{(-i)})'(\hat Y -\hat Y_{(-i)})}{(p+1)MSE}\]

  • \(\hat Y\) is the vector of fitted values when all \(n\) observations are included
  • \(\hat Y_{(-i)}\) is the vector of fitted values when the \(i\)th observation is deleted

\[D_i = \frac{e_i^2}{(p+1)MSE} \frac{h_{ii}}{(1-h_{ii})^2}\] That is, \(D_i\) depends both on the size of the residual \(e_i\) and the leverage \(h_{ii}\). That is, both the \(x\) value and the \(y\) value of the data point play a role in the calculation of Cook’s distance.

\(D_i\) summarizes how much all of the fitted values change when the \(i\)th observation is deleted. A data point having a large \(D_i\) indicates that the data point strongly influences the fitted values.

Observation is influential if Cook’s distance is greater than \(4/n\), where \(n\) is the sample size.

15.6 Dealing with unusual observations

First, check for obvious data errors:

  • If the error is just a data entry or data collection error, correct it.
  • If the data point is not representative of the intended study population, delete it.
  • If the data point is a procedural error and invalidates the measurement, delete it.

Consider the possibility that you might have just misformulated your regression model:

  • Did you leave out any important predictors?
  • Should you consider adding some interaction terms?
  • Is there any nonlinearity that needs to be modeled?

Decide whether or not deleting data points is warranted:

  • Do not delete data points just because they do not fit your preconceived regression model.
  • You must have a good, objective reason for deleting data points.
  • If you delete any data after you’ve collected it, justify and describe it in your reports.
  • If you are not sure what to do about a data point, analyze the data twice - once with and once without the data point - and report the results of both analyses.

16 Multicollinearity

Multicollinearity occurs when two or more independent variables are highly correlated with one another.

Multicollinearity can be a problem for interpretability. When multicollinearity is present, we would not be able to distinguish between the individual effects of the independent variables on the dependent variable. Consider \(\hat Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2\). Coefficient \(\beta_1\) is the increase in \(\hat Y\) for a unit increase in \(X_1\) while keeping \(X_2\) constant. But since \(X_1\) and \(X_2\) are highly correlated, changes in \(X_1\) would also cause changes in \(X_2\) and we would not be able to see their individual effect on \(\hat Y\).

\[\hat \beta = (X'X)^{-1}X' Y\]

\[Var(\hat \beta) = \sigma^2 (X'X)^{-1}\] \(Var(\hat \beta)\) will blow up when \(X'X\) is singular (determinant = 0). If \(X'X\) is not exactly singular, but is close to being non-invertible, the variances will become huge.

When multicollinearity exists, the standard errors of the estimated coefficients are inflated.

Dealing with multicollinearity

Since not all of the variables are actually contributing information, a natural way of dealing with collinearity is to drop some variables from the model. If you want to do this, you should think very carefully about which variable to delete.

We can examine pairwise correlations and delete variables with correlation close to -1 or 1.

We can also look at the variance inflation factor for the estimated regresssion coefficients. The variance inflation factor for the estimated regression coefficient \(\hat \beta_j\) is the factor by which the variance of \(\hat \beta_j\) is “inflated” by the existence of correlation among the predictor variables in the model.

Specifically, VIF is calculated as the ratio of the variance of a coefficient in a model with multiple predictors, divided by the variance of that coefficient in a model with only one predictor. A VIF value of 1 indicates no multicollinearity, while values greater than 1 indicate increasing levels of multicollinearity.

  • Variance inflation factor for \(\hat \beta_j\)

\[VIF_j = \frac{1}{1-R^2_j},\]

where \(R^2_j\) is the \(R^2\) value obtained by regressing the \(j\)th predictor on the remaining predictors. The greater the linear dependence among the predictor and the other predictors, the larger \(R^2_j\). And the larger \(R^2_j\), the larger the variance of \(\hat \beta_j\).

A VIF of 1 means that there is no correlation among the \(j\)th predictor and the remaining predictor variables, and hence the variance of \(\hat \beta_j\) is not inflated at all. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.

Example

Calculate VIFs with the vif() function of the car package

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  • mtcars: data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)
  • mpg: miles/(US) gallon
  • disp: displacement (cu.in.)
  • hp: gross horsepower
  • wt: weight (1000 lbs)
  • drat: rear axle ratio
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
summary(model)

Call:
lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5077 -1.9052 -0.5057  0.9821  5.6883 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.148738   6.293588   4.631  8.2e-05 ***
disp         0.003815   0.010805   0.353  0.72675    
hp          -0.034784   0.011597  -2.999  0.00576 ** 
wt          -3.479668   1.078371  -3.227  0.00327 ** 
drat         1.768049   1.319779   1.340  0.19153    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.602 on 27 degrees of freedom
Multiple R-squared:  0.8376,    Adjusted R-squared:  0.8136 
F-statistic: 34.82 on 4 and 27 DF,  p-value: 2.704e-10

We can see from the output that the R-squared value for the model is 0.8376. We can also see that the overall F-statistic is 34.82 and the corresponding p-value is 2.704e-10, which indicates that the overall regression model is significant. Also, the predictor variables hp and wt are statistically significant at the 0.05 significance level while disp and drat are not.

# VIF for each predictor variable in the model
library(car)
Loading required package: carData
vif(model)
    disp       hp       wt     drat 
8.209402 2.894373 5.096601 2.279547 
# correlation matrix
cor(mtcars[ , c("disp", "hp", "wt", "drat")])
           disp         hp         wt       drat
disp  1.0000000  0.7909486  0.8879799 -0.7102139
hp    0.7909486  1.0000000  0.6587479 -0.4487591
wt    0.8879799  0.6587479  1.0000000 -0.7124406
drat -0.7102139 -0.4487591 -0.7124406  1.0000000

We can see that the VIF for both disp and wt are greater than 5, which is potentially concerning.

The variable disp had a VIF value over 8, which was the largest VIF value among all of the predictor variables in the model. From the correlation matrix we can see that disp is strongly correlated with all three of the other predictor variables, which explains why it has such a high VIF value.

In this case, you may want to remove disp from the model because it has a high VIF value and it was not statistically significant at the 0.05 significance level.