In a general linear model
\[y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i\]
The response \(y_i\), \(i = 1, \ldots, n\) is modelled by a linear function of explanatory variables \(x_i\), \(j=1, \ldots, p\) plus an error term.
Here general refers to the dependence on potentially more than one explanatory variable, versus the simple linear model:
\[y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i\]
The model is linear in the parameters, e.g.,
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \epsilon_i\]
but not, e.g.,
\[y_i = \beta_0 + \beta_1 x_1^{\beta_2} + \epsilon_i\]
We assume that the errors \(\epsilon_i\) are independent and identically distributed such that
\[E[\epsilon_i] = 0\] \[Var[\epsilon_i] = \sigma^2\]
Although a very useful framework, there are some situations where general linear models are not appropriate.
In simple and multiple regression the response variable is a continuous variable (e.g., BMI, Systolic blood pressure)
\[y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i\]
We can have other situations when the response variable is a binary or a count.
Generalized linear models extend the general linear model framework to address these issues.
A generalized linear model is made up of a linear predictor
\[\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}\]
and two functions
Binary data may occur in two forms
The natural distribution for such data is the \(\mbox{Binomial}(n, p)\) distribution, where in the first case \(n = 1\)
Suppose
\[Y_i \sim Binomial(n_i, p_i)\] and we wish to model the proportions \[Y_i/n_i\] Then
\[E(Y_i/n_i) = p_i\] \[Var(Y_i/n_i)= \frac{1}{n_i}p_i(1-p_i)\]
Link function must map from \((0, 1)\) to the real line \((-\infty, \infty)\). A common choice is logit
\[\eta_i = g(p_i)=\mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right)\]
<- function(p){log(p/(1-p))}
logit <- seq(0, 1, 0.01)
p plot(p, logit(p), type = "l", xlab = "probability", ylab = "logit")
The inverse of the logit function is the sigmoid function: invlogit(logit(p)) = p
The inverse of the logit function maps real values to (0, 1).
\[invlogit(x) = \frac{exp(x)}{1+exp(x)} = \frac{1}{1+exp(-x)}\]
<- function(x){exp(x)/(1 + exp(x))}
invlogit <- seq(-4, 4, 0.1)
x plot(x, invlogit(x), type = "l", xlab = "logit", ylab = "probability")
abline(h = 0.5)
Binomial GLM with logit link is known as logistic regression model.
\[Y \sim Binomial(n, p)\]
Canonical link function for Binomial data is the logit link.
\[logit(p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p\]
If we increase \(x_j\) by one unit
\[logit(p^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p\]
\[logit(p^*) - logit(p) = \beta_j\]
\[\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_j\]
\[\log\left(\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} \right) = \beta_j\] \[\mbox{Odds ratio (OR)}=\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_j)\]
Therefore,
\[\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j\]
\[\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)\]
Example response variable disease or no disease
Consider the logistic model,
\[Y \sim Bernoulli(p)\]
\[\mbox{logit}(p)=\log(\mbox{odds})=\log\left(\frac{p}{1-p}\right)=\beta_0 + \beta_1 x_{1}\]
If we increase \(x_1\) by one unit,
\[\log\left(\frac{p^*}{1-p^*}\right)=\beta_0 + \beta_1 (x_{1}+1)\]
\[\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_1\]
\[\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_1)\]
Therefore
\[\log\left(\frac{p}{1-p}\ \ |_{x_1+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_1}\right) = \beta_1\]
\[\frac{p}{1-p} |_{x_1+1} = \frac{p}{1-p} |_{x_1} \times exp(\beta_1)\]
In a Childhood Cancer case-control study, cases were ascertained as all children under ten years of age in England and Wales who died of cancer during the period 1954-65.
Exposure variable was whether or not the children received in utero irradiation (XRAY), as reported by the mother.
The year of birth (YOB \(< 1954\), YOB \(\geq 1954\)) was considered as a confounder of the association between XRAY and cancer.
Let \(p\) = the probability of being a case in the case-control study
Model:
\[Y \sim Bernoulli(p)\]
\[logit(p)=log(odds)=\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 XRAY + \beta_2 BINYOB\] \[\log\left(\frac{p}{1-p}\right)= -0.061 + 0.501 XRAY + (-0.005)BINYOB\]
\[\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j\]
\[\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)\]
Collett (1991) describes an experiment on the toxicity of the pyrethoid trans - cypermethrin to the tobacco budworm. Batches of 20 moths of each sex were exposed to varying doses of the pyrethoid for three days and the number knocked out in each batch was recorded.
<- rep(0:5, 2)
ldose <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
dead <- factor(rep(c("M", "F"), c(6, 6)))
sex
cbind(dead, ldose, sex)
## dead ldose sex
## [1,] 1 0 2
## [2,] 4 1 2
## [3,] 9 2 2
## [4,] 13 3 2
## [5,] 18 4 2
## [6,] 20 5 2
## [7,] 0 0 1
## [8,] 2 1 1
## [9,] 6 2 1
## [10,] 10 3 1
## [11,] 12 4 1
## [12,] 16 5 1
Generalized linear models can be fitted in R using the glm()
function, which is similar to the lm()
function for fitting linear models.
Binomial responses can be specified as follows:
The logit link is the default for the binomial family so does not need to be specified.
<- cbind(dead, 20-dead)
y
<- glm(y ~ ldose + sex, family = binomial)
parr summary(parr)
##
## Call:
## glm(formula = y ~ ldose + sex, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.10540 -0.65343 -0.02225 0.48471 1.42944
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***
## ldose 1.0642 0.1311 8.119 4.70e-16 ***
## sexM 1.1007 0.3558 3.093 0.00198 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 124.8756 on 11 degrees of freedom
## Residual deviance: 6.7571 on 9 degrees of freedom
## AIC: 42.867
##
## Number of Fisher Scoring iterations: 4
\[Y_i \sim Binomial(n_i, p_i)\]
\[\log\left(\frac{p_i}{1-p_i}\right)= -3.47 + 1.061 \times \mbox{ldose}_i + 1.10 \times \mbox{sexM}_i\] Therefore
Count data, in which there is no upper limit to the number of counts, usually fall into two types
Examples include
In such situations, the counts can be assumed to follow a Poisson distribution
\[Y_i \sim Po(\mu_i)\] Then \[E(Y_i)=\mu_i\] \[Var(Y_i)=\mu_i\]
Link function must map from \((0, \infty)\) to \((-\infty, \infty)\). A natural choice is \[\eta_i = g(\mu_i)=\log(\mu_i)\]
<- seq(0.1, 100, 1)
x plot(x, log(x), type = "l", xlab = "x", ylab = "log")
<- seq(-2, 5, 0.1)
x plot(x, exp(x), type = "l", xlab = "log", ylab = "x")
Poisson GLM with the log link is known as log-linear model.
\[Y \sim Po(\mu)\]
Canonical link function for Poisson data is the log link.
\[\log(\mu) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p\]
If we increase \(x_j\) one unit
\[\log(\mu^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p\] \[\log(\mu^*) - \log(\mu) = \beta_j\]
\[\log\left(\frac{\mu^*}{\mu}\right) = \beta_j\]
\[\frac{\mu^*}{\mu}= exp(\beta_j)\]
\[\mu^* = exp(\beta_j) \mu\]
Therefore,
\[\log\left(\mu\ \ |_{x_j+1}\right)-\log\left(\mu\ \ |_{x_j}\right) = \beta_j\]
\[\mu |_{x_j+1} = \mu |_{x_j} \times exp(\beta_j)\]
In many cases we are making comparisons across observation units \(i = 1, \ldots, n\) with different levels of exposure to the event and hence the measure of interest is the rate of occurrence, e.g.
Since the counts are Poisson distributed, we would like to use a GLM to model the expected rate, \(\lambda_i=\mu_i/t_i\), where \(t_i\) is the exposure for unit \(i\).
\[Y_i \sim Po(\mu_i)\]
\[\mu_i = \lambda_i t_i\]
\[\log(\mu_i/t_i)= \log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}\]
\[\log(\mu_i)= \log(t_i) + \log(\lambda_i) = \log(t_i) + \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}\]
The standardizing term \(\log(t_i)\) is an example of an offset: a term with a fixed coefficient of 1.
Offsets are easily specified to glm()
, either using the offset argument or using the offset function in the formula, e.g. offset(log(time))
.
If all the observations have the same exposure, the model does not need an offset term and we can model \(\log(\mu_i)\) directly.
\[\log\left(\lambda\ \ |_{x_j+1}\right)-\log\left(\lambda\ \ |_{x_j}\right) = \beta_j\]
\[\lambda |_{x_j+1} = \lambda |_{x_j} \times exp(\beta_j)\]
The ships
data from the MASS
package concern a type of damage caused by waves to the forward section of cargo-carrying vessels. The variables are
incidents
number of damage incidentsservice
aggregate months of serviceperiod
period of operation : 1960-74, 75-79year
year of construction: 1960-64, 65-69, 70-74, 75-79type
type: A to EHere it makes sense to model the expected number of incidents per aggregate months of service.
Let us consider a log-linear model including all the variables. We convert the period
and year
variables to factors.
library(MASS)
data(ships)
# Model non-zero counts. We exclude ships with 0 months of service
<- subset(ships, service > 0)
ships2 $year <- as.factor(ships2$year)
ships2$period <- as.factor(ships2$period)
ships2
<- glm(incidents ~ type + year + period, family = "poisson", data = ships2, offset = log(service))
res summary(res)
##
## Call:
## glm(formula = incidents ~ type + year + period, family = "poisson",
## data = ships2, offset = log(service))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6768 -0.8293 -0.4370 0.5058 2.7912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.40590 0.21744 -29.460 < 2e-16 ***
## typeB -0.54334 0.17759 -3.060 0.00222 **
## typeC -0.68740 0.32904 -2.089 0.03670 *
## typeD -0.07596 0.29058 -0.261 0.79377
## typeE 0.32558 0.23588 1.380 0.16750
## year65 0.69714 0.14964 4.659 3.18e-06 ***
## year70 0.81843 0.16977 4.821 1.43e-06 ***
## year75 0.45343 0.23317 1.945 0.05182 .
## period75 0.38447 0.11827 3.251 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 146.328 on 33 degrees of freedom
## Residual deviance: 38.695 on 25 degrees of freedom
## AIC: 154.56
##
## Number of Fisher Scoring iterations: 5
An increase by one unit in the in the \(x_j\) variable is associated with a multiplicative change in the rate by \(exp(\beta_j)\) (holding the rest of variables constant).
Normal linear model is a special case of GLM
Normal observations
\[Y_i \sim N(\mu_i, \sigma^2)\] Linear predictor
\[\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}\]
Link function
\[g(\mu_i) = \mu_i\]
Griffiths, Hill and Judge (1993) present a dataset on food expenditure for households that have three family members. We consider two variables, the expenditure on food and the household income:
<- read.table("data/GHJ_food_income.txt", header = TRUE)
d plot(d$Income, d$Food, xlab = "Weekly Household Income ($)",
ylab = "Weekly Household Expenditure on Food ($)")
We first fit the model using lm()
, then compare to the results using glm()
. The default family for glm()
is "gaussian"
so the arguments of the call are unchanged.
<- lm(Food ~ Income, data = d)
foodLM summary(foodLM)
##
## Call:
## lm(formula = Food ~ Income, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50837 -0.15781 -0.00536 0.18789 0.49142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.409418 0.161976 14.875 < 2e-16 ***
## Income 0.009976 0.002234 4.465 6.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2766 on 38 degrees of freedom
## Multiple R-squared: 0.3441, Adjusted R-squared: 0.3268
## F-statistic: 19.94 on 1 and 38 DF, p-value: 6.951e-05
<- glm(Food ~ Income, data = d)
foodGLM summary(foodGLM)
##
## Call:
## glm(formula = Food ~ Income, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.50837 -0.15781 -0.00536 0.18789 0.49142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.409418 0.161976 14.875 < 2e-16 ***
## Income 0.009976 0.002234 4.465 6.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.07650739)
##
## Null deviance: 4.4325 on 39 degrees of freedom
## Residual deviance: 2.9073 on 38 degrees of freedom
## AIC: 14.649
##
## Number of Fisher Scoring iterations: 2
The estimated coefficients are unchanged. t-tests test the significance of each coefficient in the presence of the others.
The dispersion parameter \(\phi (= \sigma^2)\) for the Gaussian family is equal to the residual variance \(\hat \sigma^2 = \sum (y_i -\hat y_i)^2/(n-p-1)\).
summary(foodLM)$sigma
## [1] 0.2765997
summary(foodLM)$sigma^2
## [1] 0.07650739
summary(foodGLM)$dispersion
## [1] 0.07650739
The deviance of a model is defined through the difference of the log-likelihoods between the saturated model (number of parameters equal to the number of observations) and the fitted model:
\[D = 2 \left( l({\boldsymbol{\hat\beta_{sat}}}) - l(\boldsymbol{\hat \beta}) \right) \phi,\]
\(l({\boldsymbol{\hat\beta_{sat}}})\) is the log-likelihood evaluated at the MLE of the saturated model, and \(l(\boldsymbol{\hat \beta})\) is the log-likelihood evaluated at the MLE of the fitted model.
A related quantity is the scaled deviance
\[D^* = D/\phi = 2 \left( l({\boldsymbol{\hat\beta_{sat}}})- l(\boldsymbol{\hat \beta}) \right)\]
If \(\phi=1\) such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance are the same.
The scaled deviance has asymptotic chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.
\[D^* \sim \chi^2_{n-p-1}\]
Example:
For linear regression with Gaussian data, the deviance is equal to the residual sum of squares RSS: \(\sum (y_i - \hat y_i)^2\).
In the case of Gaussian model:
\[D^* = \frac{RSS}{\sigma^2} = \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{\sigma^2} \mbox{ is distributed as a } \chi^2_{n-p-1}\]
Deviance residuals
Deviance
\[D = \sum_{i=1}^n d_i\]
\(d_i\) is the component of the deviance contributed by the \(i\)th datum
In the R output, a five-number summary of the deviance residuals is given. Since the response is assumed to be normally distributed these are the same as the residuals returned from lm()
.
Null deviance and Residual deviance
The R output shows the deviance of two models:
Null deviance
The null deviance is the deviance between the saturated model and the null model (model with just an intercept). Degrees of freedom are the number of data points \(n\) minus 1 for the intercept.
Residual Deviance
The residual deviance is the deviance between the saturated model and the fitted model. \(n − p - 1\) degrees of freedom, where \(p\) is the number of explanatory variables and 1 comes from the intercept.
The AIC is a measure of fit that penalizes for the number of parameters \(p\)
\[AIC = -2 l + 2(p+1)\] where \(l\) is the maximized log likelihood of the model and \(p+1\) the number of model parameters including the intercept.
Smaller values indicate better fit and thus the AIC can be used to compare models (not necessarily nested).