- Generalized linear models (GLMs)

- Binary data
- Count data
- Normal data

In a general linear model

\[y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i\]

The response \(y_i\), \(i = 1, \ldots, n\) is modelled by a linear function of explanatory variables \(x_i\), \(j=1, \ldots, p\) plus an error term.

Here **general** refers to the dependence on potentially more than one explanatory variable, versus the **simple linear model**:

\[y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i\]

The model is **linear** in the parameters, e.g.,

\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \epsilon_i\]

but not, e.g.,

\[y_i = \beta_0 + \beta_1 x_1^{\beta_2} + \epsilon_i\]

We assume that the **errors** \(\epsilon_i\) are independent and identically distributed such that

\[E[\epsilon_i] = 0\] \[Var[\epsilon_i] = \sigma^2\]

Although a very useful framework, there are some situations where general linear models are not appropriate.

In simple and multiple regression the response variable is a continuous variable (e.g., BMI, Systolic blood pressure)

\[y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i\]

We can have other situations when the response variable is a binary or a count.

**Generalized linear models** extend the general linear model framework to address these issues.

A generalized linear model is made up of a linear predictor

\[\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}\]

and two functions

- a
**link**function that describes how the mean, \(E[Y_i]=\mu_i\) depends on the linear predictor: \(g(\mu_i) = \eta_i\) - a
**variance**function that describes how the variance depends on the mean \(Var(Y_i) = V(\mu)\phi\), where the dispersion parameter \(\phi\) is a constant

Binary data may occur in two forms

**ungrouped**in which the response variable is a binary variable (categorical variables with two categories) that can take one of two values, say success/failure**grouped**in which the variable is the number of successes in a given number of trials

The natural distribution for such data is the \(\mbox{Binomial}(n, p)\) distribution, where in the first case \(n = 1\)

Suppose

\[Y_i \sim Binomial(n_i, p_i)\] and we wish to model the proportions \[Y_i/n_i\] Then

\[E(Y_i/n_i) = p_i\] \[Var(Y_i/n_i)= \frac{1}{n_i}p_i(1-p_i)\]

Link function must map from \((0, 1)\) to the real line \((-\infty, \infty)\). A common choice is logit

\[\eta_i = g(p_i)=\mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right)\]

```
<- function(p){log(p/(1-p))}
logit <- seq(0, 1, 0.01)
p plot(p, logit(p), type = "l", xlab = "probability", ylab = "logit")
```

The inverse of the logit function is the sigmoid function: invlogit(logit(p)) = p

The inverse of the logit function maps real values to (0, 1).

\[invlogit(x) = \frac{exp(x)}{1+exp(x)} = \frac{1}{1+exp(-x)}\]

```
<- function(x){exp(x)/(1 + exp(x))}
invlogit <- seq(-4, 4, 0.1)
x plot(x, invlogit(x), type = "l", xlab = "logit", ylab = "probability")
abline(h = 0.5)
```

Binomial GLM with logit link is known as **logistic regression model**.

\[Y \sim Binomial(n, p)\]

Canonical link function for Binomial data is the logit link.

\[logit(p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p\]

If we increase \(x_j\) by one unit

\[logit(p^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p\]

\[logit(p^*) - logit(p) = \beta_j\]

\[\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_j\]

\[\log\left(\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} \right) = \beta_j\] \[\mbox{Odds ratio (OR)}=\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_j)\]

Therefore,

- \(\beta_j\) is the increase or decrease in the log-odds associated with one-unit increase in \(x_j\), holding all other variables constant

\[\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j\]

- \(exp(\beta_j)\) is the odds ratio \[OR= \frac{\frac{p}{1-p}\ |_{x_j+1}} { \frac{p}{1-p} \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the odds are multiplied by \(exp(\beta_j)\)

\[\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)\]

Example response variable disease or no disease

- We use variable \(Y = 1\) if disease, \(Y = 0\) if no disease
- \(p\) is the probability response variable \(Y = 1\) (disease)
- \(\frac{p}{1-p}\) is the odds of disease

Consider the logistic model,

\[Y \sim Bernoulli(p)\]

\[\mbox{logit}(p)=\log(\mbox{odds})=\log\left(\frac{p}{1-p}\right)=\beta_0 + \beta_1 x_{1}\]

If we increase \(x_1\) by one unit,

\[\log\left(\frac{p^*}{1-p^*}\right)=\beta_0 + \beta_1 (x_{1}+1)\]

\[\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_1\]

\[\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_1)\]

Therefore

- \(\beta_1\) is the increase or decrease in the log-odds associated with one-unit increase in \(x_1\)

\[\log\left(\frac{p}{1-p}\ \ |_{x_1+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_1}\right) = \beta_1\]

- \(exp(\beta_1)\) is the odds ratio \[OR= \frac{\frac{p}{1-p}\ |_{x_1+1}} { \frac{p}{1-p} \ |_{x_1}} = exp(\beta_1)\]
- When \(x_1\) increases one unit, the odds are multiplied by \(exp(\beta_1)\)

\[\frac{p}{1-p} |_{x_1+1} = \frac{p}{1-p} |_{x_1} \times exp(\beta_1)\]

In a Childhood Cancer case-control study, cases were ascertained as all children under ten years of age in England and Wales who died of cancer during the period 1954-65.

Exposure variable was whether or not the children received in utero irradiation (XRAY), as reported by the mother.

The year of birth (YOB \(< 1954\), YOB \(\geq 1954\)) was considered as a confounder of the association between XRAY and cancer.

Let \(p\) = the probability of being a case in the case-control study

- XRAY=1 if the child received in utero irradiation and 0 if not
- BINYOB=1 if YOB \(\geq 1954\) and 0 if YOB \(< 1954\)

Model:

\[Y \sim Bernoulli(p)\]

\[logit(p)=log(odds)=\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 XRAY + \beta_2 BINYOB\] \[\log\left(\frac{p}{1-p}\right)= -0.061 + 0.501 XRAY + (-0.005)BINYOB\]

- 0.501 is the log(odds) of cancer for a person with XRAY=1 versus another person with XRAY=0, and with the same BINYOB.
- -0.005 is the log(odds) of cancer for a person with BINYOB=1 versus another person with BINYOB=0, and with the same XRAY.

- \(\beta_j\) is the increase or decrease in the log-odds associated with one-unit increase in \(x_j\), holding all other variables constant

\[\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j\]

- \(exp(\beta_j)\) is the odds ratio \[OR= \frac{\frac{p}{1-p}\ |_{x_j+1}} { \frac{p}{1-p} \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the odds are multiplied by \(exp(\beta_j)\)

\[\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)\]

Collett (1991) describes an experiment on the toxicity of the pyrethoid trans - cypermethrin to the tobacco budworm. Batches of 20 moths of each sex were exposed to varying doses of the pyrethoid for three days and the number knocked out in each batch was recorded.

```
<- rep(0:5, 2)
ldose <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
dead <- factor(rep(c("M", "F"), c(6, 6)))
sex
cbind(dead, ldose, sex)
```

```
## dead ldose sex
## [1,] 1 0 2
## [2,] 4 1 2
## [3,] 9 2 2
## [4,] 13 3 2
## [5,] 18 4 2
## [6,] 20 5 2
## [7,] 0 0 1
## [8,] 2 1 1
## [9,] 6 2 1
## [10,] 10 3 1
## [11,] 12 4 1
## [12,] 16 5 1
```

Generalized linear models can be fitted in R using the `glm()`

function, which is similar to the `lm()`

function for fitting linear models.

Binomial responses can be specified as follows:

- a numeric 0/1 vector (0 = failure), a logical vector (FALSE = failure), or a factor (first level = failure)
- a two-column matrix with the number of successes and the number of failures

The logit link is the default for the binomial family so does not need to be specified.

```
<- cbind(dead, 20-dead)
y
<- glm(y ~ ldose + sex, family = binomial)
parr summary(parr)
```

```
##
## Call:
## glm(formula = y ~ ldose + sex, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.10540 -0.65343 -0.02225 0.48471 1.42944
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***
## ldose 1.0642 0.1311 8.119 4.70e-16 ***
## sexM 1.1007 0.3558 3.093 0.00198 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 124.8756 on 11 degrees of freedom
## Residual deviance: 6.7571 on 9 degrees of freedom
## AIC: 42.867
##
## Number of Fisher Scoring iterations: 4
```

\[Y_i \sim Binomial(n_i, p_i)\]

\[\log\left(\frac{p_i}{1-p_i}\right)= -3.47 + 1.061 \times \mbox{ldose}_i + 1.10 \times \mbox{sexM}_i\] Therefore

- the odds of death for a male moth are exp(1.10) = 3.01 times that for a female moth, given a fixed dose of the pyrethroid.
- the odds of death increase by a factor of exp(1.06) = 2.90 for every \(\mu\)g of pyrethroid, for male or female moths.

Count data, in which there is no upper limit to the number of counts, usually fall into two types

**rates**counts per unit of time/area/distance, etc**contingency tables**counts cross-classified by categorical variables

Examples include

- number of household burglaries in a city in a given year
- number of customers served by a saleperson in a given month
- number of train accidents in a given year

In such situations, the counts can be assumed to follow a Poisson distribution

\[Y_i \sim Po(\mu_i)\] Then \[E(Y_i)=\mu_i\] \[Var(Y_i)=\mu_i\]

Link function must map from \((0, \infty)\) to \((-\infty, \infty)\). A natural choice is \[\eta_i = g(\mu_i)=\log(\mu_i)\]

```
<- seq(0.1, 100, 1)
x plot(x, log(x), type = "l", xlab = "x", ylab = "log")
```

```
<- seq(-2, 5, 0.1)
x plot(x, exp(x), type = "l", xlab = "log", ylab = "x")
```

Poisson GLM with the log link is known as **log-linear model**.

\[Y \sim Po(\mu)\]

Canonical link function for Poisson data is the log link.

\[\log(\mu) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p\]

If we increase \(x_j\) one unit

\[\log(\mu^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p\] \[\log(\mu^*) - \log(\mu) = \beta_j\]

\[\log\left(\frac{\mu^*}{\mu}\right) = \beta_j\]

\[\frac{\mu^*}{\mu}= exp(\beta_j)\]

\[\mu^* = exp(\beta_j) \mu\]

Therefore,

- \(\beta_j\) is the increase or decrease in the log of the mean associated with one-unit increase in \(x_j\), holding all other variables constant

\[\log\left(\mu\ \ |_{x_j+1}\right)-\log\left(\mu\ \ |_{x_j}\right) = \beta_j\]

- \(exp(\beta_j)\) \[\frac{\mu\ |_{x_j+1}} {\mu \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the mean is multiplied by \(exp(\beta_j)\)

\[\mu |_{x_j+1} = \mu |_{x_j} \times exp(\beta_j)\]

In many cases we are making comparisons across observation units \(i = 1, \ldots, n\) with different levels of exposure to the event and hence the measure of interest is the rate of occurrence, e.g.

- number of household burglaries per 10,000 households in city \(i\) in a given year
- number of customers served per hour by salesperson \(i\) in a given month
- number of train accidents per billion train-kilometers in year \(i\)

Since the counts are Poisson distributed, we would like to use a GLM to model the expected rate, \(\lambda_i=\mu_i/t_i\), where \(t_i\) is the exposure for unit \(i\).

\[Y_i \sim Po(\mu_i)\]

\[\mu_i = \lambda_i t_i\]

\[\log(\mu_i/t_i)= \log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}\]

\[\log(\mu_i)= \log(t_i) + \log(\lambda_i) = \log(t_i) + \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}\]

The standardizing term \(\log(t_i)\) is an example of an **offset**: a term with a fixed coefficient of 1.

Offsets are easily specified to `glm()`

, either using the offset argument or using the offset function in the formula, e.g. `offset(log(time))`

.

If all the observations have the same exposure, the model does not need an offset term and we can model \(\log(\mu_i)\) directly.

- \(\beta_j\) is the increase or decrease in the log of the rate associated with one-unit increase in \(x_j\), holding all other variables constant

\[\log\left(\lambda\ \ |_{x_j+1}\right)-\log\left(\lambda\ \ |_{x_j}\right) = \beta_j\]

- \(exp(\beta_j)\) \[\frac{\lambda\ |_{x_j+1}} {\lambda \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the mean is multiplied by \(exp(\beta_j)\)

\[\lambda |_{x_j+1} = \lambda |_{x_j} \times exp(\beta_j)\]

The `ships`

data from the `MASS`

package concern a type of damage caused by waves to the forward section of cargo-carrying vessels. The variables are

`incidents`

number of damage incidents`service`

aggregate months of service`period`

period of operation : 1960-74, 75-79`year`

year of construction: 1960-64, 65-69, 70-74, 75-79`type`

type: A to E

Here it makes sense to model the expected number of incidents per aggregate months of service.

Let us consider a log-linear model including all the variables. We convert the `period`

and `year`

variables to factors.

```
library(MASS)
data(ships)
# Model non-zero counts. We exclude ships with 0 months of service
<- subset(ships, service > 0)
ships2 $year <- as.factor(ships2$year)
ships2$period <- as.factor(ships2$period)
ships2
<- glm(incidents ~ type + year + period, family = "poisson", data = ships2, offset = log(service))
res summary(res)
```

```
##
## Call:
## glm(formula = incidents ~ type + year + period, family = "poisson",
## data = ships2, offset = log(service))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6768 -0.8293 -0.4370 0.5058 2.7912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.40590 0.21744 -29.460 < 2e-16 ***
## typeB -0.54334 0.17759 -3.060 0.00222 **
## typeC -0.68740 0.32904 -2.089 0.03670 *
## typeD -0.07596 0.29058 -0.261 0.79377
## typeE 0.32558 0.23588 1.380 0.16750
## year65 0.69714 0.14964 4.659 3.18e-06 ***
## year70 0.81843 0.16977 4.821 1.43e-06 ***
## year75 0.45343 0.23317 1.945 0.05182 .
## period75 0.38447 0.11827 3.251 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 146.328 on 33 degrees of freedom
## Residual deviance: 38.695 on 25 degrees of freedom
## AIC: 154.56
##
## Number of Fisher Scoring iterations: 5
```

An increase by one unit in the in the \(x_j\) variable is associated with a multiplicative change in the rate by \(exp(\beta_j)\) (holding the rest of variables constant).

Normal linear model is a special case of GLM

Normal observations

\[Y_i \sim N(\mu_i, \sigma^2)\] Linear predictor

\[\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}\]

Link function

\[g(\mu_i) = \mu_i\]

Griffiths, Hill and Judge (1993) present a dataset on food expenditure for households that have three family members. We consider two variables, the expenditure on food and the household income:

```
<- read.table("data/GHJ_food_income.txt", header = TRUE)
d plot(d$Income, d$Food, xlab = "Weekly Household Income ($)",
ylab = "Weekly Household Expenditure on Food ($)")
```

We first fit the model using `lm()`

, then compare to the results using `glm()`

. The default family for `glm()`

is `"gaussian"`

so the arguments of the call are unchanged.

```
<- lm(Food ~ Income, data = d)
foodLM summary(foodLM)
```

```
##
## Call:
## lm(formula = Food ~ Income, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50837 -0.15781 -0.00536 0.18789 0.49142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.409418 0.161976 14.875 < 2e-16 ***
## Income 0.009976 0.002234 4.465 6.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2766 on 38 degrees of freedom
## Multiple R-squared: 0.3441, Adjusted R-squared: 0.3268
## F-statistic: 19.94 on 1 and 38 DF, p-value: 6.951e-05
```

```
<- glm(Food ~ Income, data = d)
foodGLM summary(foodGLM)
```

```
##
## Call:
## glm(formula = Food ~ Income, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.50837 -0.15781 -0.00536 0.18789 0.49142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.409418 0.161976 14.875 < 2e-16 ***
## Income 0.009976 0.002234 4.465 6.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.07650739)
##
## Null deviance: 4.4325 on 39 degrees of freedom
## Residual deviance: 2.9073 on 38 degrees of freedom
## AIC: 14.649
##
## Number of Fisher Scoring iterations: 2
```

The estimated coefficients are unchanged. t-tests test the significance of each coefficient in the presence of the others.

The dispersion parameter \(\phi (= \sigma^2)\) for the Gaussian family is equal to the residual variance \(\hat \sigma^2 = \sum (y_i -\hat y_i)^2/(n-p-1)\).

`summary(foodLM)$sigma`

`## [1] 0.2765997`

`summary(foodLM)$sigma^2`

`## [1] 0.07650739`

`summary(foodGLM)$dispersion`

`## [1] 0.07650739`

The **deviance** of a model is defined through the difference of the log-likelihoods between the saturated model (number of parameters equal to the number of observations) and the fitted model:

\[D = 2 \left( l({\boldsymbol{\hat\beta_{sat}}}) - l(\boldsymbol{\hat \beta}) \right) \phi,\]

\(l({\boldsymbol{\hat\beta_{sat}}})\) is the log-likelihood evaluated at the MLE of the **saturated model**, and \(l(\boldsymbol{\hat \beta})\) is the log-likelihood evaluated at the MLE of the fitted model.

A related quantity is the **scaled deviance**

\[D^* = D/\phi = 2 \left( l({\boldsymbol{\hat\beta_{sat}}})- l(\boldsymbol{\hat \beta}) \right)\]

If \(\phi=1\) such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance are the same.

The scaled deviance has asymptotic chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.

\[D^* \sim \chi^2_{n-p-1}\]

Example:

For linear regression with Gaussian data, the deviance is equal to the residual sum of squares RSS: \(\sum (y_i - \hat y_i)^2\).

In the case of Gaussian model:

\[D^* = \frac{RSS}{\sigma^2} = \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{\sigma^2} \mbox{ is distributed as a } \chi^2_{n-p-1}\]

**Deviance residuals**

Deviance

\[D = \sum_{i=1}^n d_i\]

\(d_i\) is the component of the deviance contributed by the \(i\)th datum

In the R output, a five-number summary of the **deviance residuals** is given. Since the response is assumed to be normally distributed these are the same as the residuals returned from `lm()`

.

**Null deviance and Residual deviance**

The R output shows the **deviance** of two models:

`Null deviance`

The null deviance is the deviance between the saturated model and the null model (model with just an intercept). Degrees of freedom are the number of data points \(n\) minus 1 for the intercept.

`Residual Deviance`

The residual deviance is the deviance between the saturated model and the fitted model. \(n − p - 1\) degrees of freedom, where \(p\) is the number of explanatory variables and 1 comes from the intercept.

The AIC is a measure of fit that penalizes for the number of parameters \(p\)

\[AIC = -2 l + 2(p+1)\] where \(l\) is the maximized log likelihood of the model and \(p+1\) the number of model parameters including the intercept.

Smaller values indicate better fit and thus the AIC can be used to compare models (not necessarily nested).