A regression model describes the linear relationship between a response variable and one or more explanatory variables
In simple linear regression there is one explanatory variable
Example: How does body mass index relate to physical activity?
In multiple linear regression there is more than one explanatory variable
Example: How does systolic blood pressure relate to age and weight?
In correlation analysis we assess the linear association between two variables when there is no clear response-explanatory relationship
Example: What is the correlation between arm length and leg length?
The Pearson correlation \(r\) gives a measure of the linear relationship between two variables \(x\) and \(y\).
Given a set of observations \((x_1, y_1), (x_2,y_2), \ldots, (x_n,y_n)\), the correlation between \(x\) and \(y\) is given by
\[r = \frac{1}{n-1} \sum_{i=1}^n \left(\frac{x_i-\bar x}{s_x}\right) \left(\frac{y_i-\bar y}{s_y}\right),\] where \(\bar x\), \(\bar y\), \(s_x\) and \(s_y\) are the sample means and standard deivations for each variable.
\[-1\leq r \leq 1\]
Correlation describes only the linear part of a relationship. It does not describe curved relationships.
Example
What is the correlation coefficients for these data sets?
Solution
Read data
<- read.table("data/pabmi.txt", header = TRUE) d
First rows of data
head(d)
Summary data
summary(d)
## SUBJECT PA BMI
## Min. : 1.00 Min. : 3.186 Min. :14.20
## 1st Qu.: 25.75 1st Qu.: 6.803 1st Qu.:21.10
## Median : 50.50 Median : 8.409 Median :24.45
## Mean : 50.50 Mean : 8.614 Mean :23.94
## 3rd Qu.: 75.25 3rd Qu.:10.274 3rd Qu.:26.75
## Max. :100.00 Max. :14.209 Max. :35.10
Scatterplot of BMI versus PA
plot(d$PA, d$BMI)
There is a lot of variation but does it look like there is a linear trend? Could the average BMI be a straight line in PA?
\[y \approx \beta_0 + \beta_1 x\]
The equation of the least-squares regression line of \(y\) on \(x\) is \[\hat y = \hat \beta_0 + \hat \beta_1 x\] \[\hat y = b_0 + b_1 x\]
\(\hat y\) is the estimated mean or expected response. \(\hat y\) estimates the mean response for any value of \(x\) that we substitute in the fitted regression line
\(b_0\) is the intercept of the fitted line. It is the value of \(\hat y\) when \(x=0\).
\(b_1\) is the slope of the fitted line. It tells how much increase (or decrease) there is for \(\hat y\) as \(x\) increases by one unit
If \(b_1>0\) \(\hat y\) increases as \(x\) increases
If \(b_1<0\) \(\hat y\) decreases as \(x\) increases
The slope \(b_1=r\frac{s_y}{s_x}\) and the intercept \(b_0 = \bar y - b_1 \bar x\)
Here \(r\) is the correlation which gives a measure of the linear relationship between \(x\) and \(y\)
\[-1\leq r \leq 1\]
After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. If there is any assumption violation, subsequent inferential procedures may be invalid. For example, we need to look at residuals. Are they randomly scattered so a straight line is a good model to fit?
Example
The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).
The mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is 0.636.
Solution
Travel time: \(\bar y= 129\) minutes, \(s_y = 113\) minutes.
Distance: \(\bar x = 108\) miles, \(s_x= 99\) miles.
Correlation between travel time and distance \(r = 0.63\)
First calculate the slope: \(b_1 = r \times s_y/s_x = 0.636 \times 113/99 = 0.726\). Next, make use of the fact that the regression line passes through the point \((\bar x, \bar y)\): \(\bar y = b_0+b_1 \times \bar x\). Then, \(b_0 = \bar y - b_1 \bar x = 129 - 0.726 \times 108 = 51\).
\(\widehat{\mbox{travel time}} = 51 + 0.726 \times \mbox{distance}\).
\(b_1\): For each additional mile in distance, the model predicts an additional 0.726 minutes in travel time.
\(b_0\): When the distance traveled is 0 miles, the travel time is expected to be 51 minutes. It does not make sense to have a travel distance of 0 miles in this context. Here, the y-intercept serves only to adjust the height of the line and is meaningless by itself.
\(\widehat{\mbox{travel time}} = 51 + 0.726 \times \mbox{distance} = 51+0.726 \times 103 \approx 126\) minutes.
(Note: we should be cautious in our predictions with this model since we have not yet evaluated whether it is a well-fit model).
\(e_i = y_i - \hat y_i = 168-126=42\) minutes. A positive residual means that the model underestimates the travel time.
I would not be appropriate since this calculation would require extrapolation.
<- lm(BMI ~ PA, data = d)
res summary(res)
##
## Call:
## lm(formula = BMI ~ PA, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3819 -2.5636 0.2062 1.9820 8.5078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.5782 1.4120 20.948 < 2e-16 ***
## PA -0.6547 0.1583 -4.135 7.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.655 on 98 degrees of freedom
## Multiple R-squared: 0.1485, Adjusted R-squared: 0.1399
## F-statistic: 17.1 on 1 and 98 DF, p-value: 7.503e-05
The most important test is whether the slope is 0 or not. If the slope is 0, then there is no linear relationship. Test null hypothesis ‘response variable \(y\) is not linearly related to variable \(x\)’ versus alternative hypothesis ‘\(y\) is linearly related to \(x\)’.
\(H_0: \beta_1 =0\)
\(H_1: \beta_1 \neq 0\)
Test statistic: t-value = estimate of the coefficient / SE of coefficient
\[T = \frac{b_1 - 0}{SE_{b_1}}\]
\[T \sim t(n-2)\]
p-value: probability of obtaining a test statistic as extreme or more extreme than the observed, assuming the null is true.
\(H_1: \beta_1 \neq 0\). p-value is \(P(T\geq |t|)\)
summary(res)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.5782471 1.4119783 20.948089 5.706905e-38
## PA -0.6546858 0.1583361 -4.134784 7.503032e-05
In the example p-value=7.503032e-05. There is strong evidence to suggest slope is not zero. There is strong evidence to suggest PA and BMI are linearly related.
Output also gives test for intercept. This tests the null hypothesis the intercept is 0 versus the alternative hypothesis the intercept is not 0. Usually we are not interested in this test.
A \((c)\times 100\)% (or \((1-\alpha)\times 100\)%) confidence interval for the slope \(\beta_1\) is \[b_1 \pm t^* \times SE_{b_1}\]
\(t^*\) is the value on the \(t(n-2)\) density curve with area \(c\) between the critical points \(-t^*\) and \(t^*\). Equivalently, \(t^*\) is the value such that \(\alpha/2\) of the probability in the \(t(n-2)\) distribution is below \(t^*\).
summary(res)
##
## Call:
## lm(formula = BMI ~ PA, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3819 -2.5636 0.2062 1.9820 8.5078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.5782 1.4120 20.948 < 2e-16 ***
## PA -0.6547 0.1583 -4.135 7.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.655 on 98 degrees of freedom
## Multiple R-squared: 0.1485, Adjusted R-squared: 0.1399
## F-statistic: 17.1 on 1 and 98 DF, p-value: 7.503e-05
\[R^2 = \frac{\mbox{variability explained by the model}}{\mbox{total variability}} = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{\sum_{i=1}^n ( y_i - \bar y)^2}\]
or equivalently
\[R^2 = 1 - \frac{\mbox{variability in residuals}}{\mbox{total variability}} = 1 - \frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{\sum_{i=1}^n ( y_i - \bar y)^2}\]
R\(^2\) measures how successfully the regression explains the response.
If the data lie exactly on the regression line then R\(^2\) = 1.
If the data are randomly scattered or in a horizontal line then R\(^2\) = 0.
For simple linear regression, R\(^2\) is equal to the square of the correlation between response variable \(y\) and explanatory variable \(x\). \[R^2 = r^2\]
Remember the correlation \(r\) gives a measure of the linear relationship between \(x\) and \(y\)
\[-1\leq r \leq 1\]
\(R^2\) will increase as more explanatory variables are added to a regression model. We look at R\(^2\)(adj).
R\(^2\)(adj) or Adjusted R-squared is the proportion of variation of the response variable explained by the fitted model adjusting for the number of parameters.
summary(res)$adj.r.squared
## [1] 0.1398518
The regression equation can be used to estimate the mean response or to predict a new response for particular values of the explanatory variable.
When PA is 10, the estimated mean of BMI is \(\widehat{BMI} = b_0 + b_1 \times 10 = 29.5782 - 0.6547 \times 10 = 23.03\).
When PA is 10, the predicted BMI of a new individual is also \(\widehat{BMI} = b_0 + b_1 \times 10 = 29.5782 - 0.6547 \times 10 = 23.03\).
Extrapolation occurs when we estimate the mean response or predict a new response for values of the explanatory variables which are not in the range of the data used to determine the estimated regression equation. In general, this is not a good thing to do.
When PA is 50, the estimated mean of BMI is \(\widehat{BMI} = b_0 + b_1 \times 50 = 29.5782 - 0.6547 \times 50 = -3.15\).
A confidence interval is an interval estimate for the mean response for particular values of the explanatory variables
We can calculate the 95% CI as \(\widehat{y} \pm t^* \times SE(\widehat{y})\) where \(t^*\) is the value such that \(\alpha/2\) probability on the t(n-2) distribution is below \(t^*\).
In R, we can calculate the 95% CI with the predict()
function and the option interval = "confidence"
.
<- data.frame(PA = 10)
new predict(lm(BMI ~ PA, data = d), new, interval = "confidence", se.fit = TRUE)
## $fit
## fit lwr upr
## 1 23.03139 22.18533 23.87744
##
## $se.fit
## [1] 0.4263387
##
## $df
## [1] 98
##
## $residual.scale
## [1] 3.654883
c(23.03139 + qt(0.05/2, 98)*0.4263387,
23.03139 - qt(0.05/2, 98)*0.4263387)
## [1] 22.18533 23.87745
A prediction interval is an interval estimate for an individual value of the response for particular values of the explanatory variables
We can calculate the 95% PI as \(\widehat{y}_{new} \pm t^* \times \sqrt{MSE + [SE(\widehat{y})]^2}\) where \(t^*\) is the value such that \(\alpha/2\) probability on the t(n-2) distribution is below \(t^*\). The Residual standard error
and $residual.scale
in the R output is equal to \(\sqrt{MSE}\).
In R, we can calculate the 95% CI with the predict()
function and the option interval = "prediction"
.
predict(lm(BMI ~ PA, data = d), new, interval = "prediction", se.fit = TRUE)
## $fit
## fit lwr upr
## 1 23.03139 15.72921 30.33357
##
## $se.fit
## [1] 0.4263387
##
## $df
## [1] 98
##
## $residual.scale
## [1] 3.654883
c(23.03139 + qt(0.05/2, 98)*sqrt(3.654883^2 + 0.4263387^2),
23.03139 - qt(0.05/2, 98)*sqrt(3.654883^2 + 0.4263387^2))
## [1] 15.72921 30.33357
After fitting the model it is crucial to check whether the model assumptions are valid before performing inference. If there any assumption violation, subsequent inferential procedures may be invalid
Assumptions:
We check normality of errors with a normal Q-Q plot of the residuals
A normal Q-Q plot compares the data with what to expect to get if the theoretical distribution from which the data come is normal
The Q-Q plot displays the value of observed quantiles in the standardized residual distribution on the y-axis versus the quantiles of the theoretical normal distribution on the x-axis
If data are normally distributed, we should see the plotted points lie close to the straight line
If we see a concave normal probability plot, log transforming the response variable may remove the problem
Check constant variance using a residual plot, i.e., a plot of the residuals (y-axis) against the values fitted/predicted from the model (x-axis).
In multiple regression models, constant variance should also be checked using plots of the residuals versus individual independent variables.
Points should be randomly scattered. The spread of the points should be the same across the x-axis.
If residuals are more spread out for larger fitted values than smaller ones, there is non-constant variance. Sometimes a log transformation of the response variable may also help.
Check linearity using a residual plot, i.e., a plot of the residuals (y-axis) against the values fitted/predicted from the model (x-axis).
In multiple regression models, linearity should also be checked using plots of the residuals versus individual independent variables.
Points should be randomly scattered, i.e., an unstructured horizontal band of points centred at zero.
In particular, points should not form a pattern, e.g., a quadratic, or other type of curve, should not be apparent.
If the relationship is curved, the residuals versus fits will be curved.
If one or more of the assumptions are violated, we can sometimes remedy this by transforming one or both of the variables.
However, transforming a variable to satisfy one assumption, e.g., to make the relationship linear, may harm the normality or constant variance assumptions. In this case, more sophisticated modelling is needed, such as non-linear regression.
Independency: data are randomly collected, observations are independent
Normality: in the normal Q-Q plot points lie close to the straight line
plot(res, 2)
plot(res, 1)
Outlier (unusual \(y\)): observation whose response value \(y\) does not follow the general trend of the rest of the data. Outliers have standardized residual less than -2 or greater than 2 (or -3 and 3). The bigger the data set, the more outliers we expect to get, about 5% of the sample.
High leverage observation (unusual \(x\)): observation that has extreme predictor \(x\) values. With a single predictor, an extreme \(x\) value is simply one that is particularly high or low compared with all the other data observations. With multiple predictors, extreme \(x\) values may be particularly high or low for one or more predictors, or may be “unusual” combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor). An observation is usually considered to have high leverage if it has leverage (“hat”) greater than 2 times the average of all leverage values. That is, if \(h_{ii} > 2(p+1)/n\).
Influential observation: An observation is influential if its omission causes statistical inference to change, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. We usually say an observation is influential is Cook’s distance is greater than \(4/n\), where \(n\) is the sample size. We should examine in closer detail the points with values Cook’s distance that are substantially larger than the rest.
It is possible for an observation to be both an outlier and have high leverage.
Influential observations can be outliers, leverage observations or both. But an outlier or leverage observation may not be influential.
Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine whether or not they are actually influential.
Example 1
All of the data points follow the general trend of the rest of the data, so there are no outliers (unusual \(y\)). And, none of the data points are extreme with respect to \(x\), so there are no high leverage points. Overall, none of the data points appear to be influential with respect to the location of the best fitting line.
Example 2
Because the red data point does not follow the general trend of the rest of the data, it would be considered an outlier. However, this point does not have an extreme x value, so it does not have high leverage.
Is the red data point influential? An easy way to determine if the data point is influential is to find the best fitting line twice - once with the red data point included and once with the red data point excluded.
The predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the red data point. Therefore, the data point is not deemed influential. In summary, the red data point is not influential and does not have high leverage, but it is an outlier.
Example 3
In this case, the red data point does follow the general trend of the rest of the data. Therefore, it is not deemed an outlier. However, this point does have an extreme \(x\) value, so it does have high leverage.
Is the red data point influential? It certainly appears to be far removed from the rest of the data (in the \(x\) direction), but is that sufficient to make the data point influential in this case?
The predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the red data point. Therefore, the data point is not deemed influential. In summary, the red data point is not influential, nor is it an outlier, but it does have high leverage.
Example 4
The red data point is certainly an outlier and has high leverage. The red data point does not follow the general trend of the rest of the data and it also has an extreme \(x\) value.
In this case the red data point is influential. The two best fitting lines - one obtained when the red data point is included and one obtained when the red data point is excluded are substantially different. The solid line represents the estimated regression equation with the red data point included, while the dashed line represents the estimated regression equation with the red data point taken excluded.
Outliers (unusual \(y\)). Standardized residuals less than -2 or greater than 2.
# Outliers
<- which(abs(rstandard(res)) > 2)
op op
## 1 17 37 65 89
## 1 17 37 65 89
plot(d$PA, d$BMI)
text(d$PA[op], d$BMI[op]-.5, paste("outlier", op))
High leverage observations (unusual \(x\)). Leverage greater than 2 times the average of all leverage values. That is, \(h_{ii} > 2(p+1)/n\).
# High leverage observations
<- hatvalues(res)
h <- which(h > 2*length(res$coefficients)/nrow(d))
lp lp
## 6 17 37 59 79 83 85
## 6 17 37 59 79 83 85
plot(d$PA, d$BMI)
text(d$PA[lp], d$BMI[lp]-.5, paste("lev", lp))
Influential values. Cook’s distance higher than \(4/n\).
Cook’s distance combines residuals and leverages in a single measure of influence. We should examine in closer detail the points with values Cook’s distance that are substantially larger than the rest (17 and 37).
# Influential observations
<- which(cooks.distance(res) > 4/nrow(d))
ip ip
## 1 3 17 37 65 83
## 1 3 17 37 65 83
plot(d$PA, d$BMI)
text(d$PA[ip], d$BMI[ip]-.5, paste("infl", ip))
plot(res, 4)
Standardized residuals versus leverage values
plot(res, 5)
unique(c(lp, op, ip)), ] d[