# Learning objectives

• Generalized linear models (GLMs)
• Binary data
• Count data
• Normal data

# 2 The general linear model

In a general linear model

$y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i$

The response $$y_i$$, $$i = 1, \ldots, n$$ is modelled by a linear function of explanatory variables $$x_i$$, $$j=1, \ldots, p$$ plus an error term.

Here general refers to the dependence on potentially more than one explanatory variable, versus the simple linear model:

$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$

The model is linear in the parameters, e.g.,

$y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \epsilon_i$

but not, e.g.,

$y_i = \beta_0 + \beta_1 x_1^{\beta_2} + \epsilon_i$

We assume that the errors $$\epsilon_i$$ are independent and identically distributed such that

$E[\epsilon_i] = 0$ $Var[\epsilon_i] = \sigma^2$

Although a very useful framework, there are some situations where general linear models are not appropriate.

In simple and multiple regression the response variable is a continuous variable (e.g., BMI, Systolic blood pressure)

$y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i$

We can have other situations when the response variable is a binary or a count.

Generalized linear models extend the general linear model framework to address these issues.

# 3 Generalized linear models (GLMs)

A generalized linear model is made up of a linear predictor

$\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}$

and two functions

• a link function that describes how the mean, $$E[Y_i]=\mu_i$$ depends on the linear predictor: $$g(\mu_i) = \eta_i$$
• a variance function that describes how the variance depends on the mean $$Var(Y_i) = V(\mu)\phi$$, where the dispersion parameter $$\phi$$ is a constant

# 4 Modeling Binomial data

Binary data may occur in two forms

• ungrouped in which the response variable is a binary variable (categorical variables with two categories) that can take one of two values, say success/failure
• grouped in which the variable is the number of successes in a given number of trials

The natural distribution for such data is the $$\mbox{Binomial}(n, p)$$ distribution, where in the first case $$n = 1$$

Suppose

$Y_i \sim Binomial(n_i, p_i)$ and we wish to model the proportions $Y_i/n_i$ Then

$E(Y_i/n_i) = p_i$ $Var(Y_i/n_i)= \frac{1}{n_i}p_i(1-p_i)$

Link function must map from $$(0, 1)$$ to the real line $$(-\infty, \infty)$$. A common choice is logit

$\eta_i = g(p_i)=\mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right)$

logit <- function(p){log(p/(1-p))}
p <- seq(0, 1, 0.01)
plot(p, logit(p), type = "l", xlab = "probability", ylab = "logit") The inverse of the logit function is the sigmoid function: invlogit(logit(p)) = p

The inverse of the logit function maps real values to (0, 1).

$invlogit(x) = \frac{exp(x)}{1+exp(x)} = \frac{1}{1+exp(-x)}$

invlogit <- function(x){exp(x)/(1 + exp(x))}
x <- seq(-4, 4, 0.1)
plot(x, invlogit(x), type = "l", xlab = "logit", ylab = "probability")
abline(h = 0.5) Binomial GLM with logit link is known as logistic regression model.

$Y \sim Binomial(n, p)$

$logit(p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p$

If we increase $$x_j$$ by one unit

$logit(p^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p$

$logit(p^*) - logit(p) = \beta_j$

$\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_j$

$\log\left(\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} \right) = \beta_j$ $\mbox{Odds ratio (OR)}=\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_j)$

Therefore,

• $$\beta_j$$ is the increase or decrease in the log-odds associated with one-unit increase in $$x_j$$, holding all other variables constant

$\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j$

• $$exp(\beta_j)$$ is the odds ratio $OR= \frac{\frac{p}{1-p}\ |_{x_j+1}} { \frac{p}{1-p} \ |_{x_j}} = exp(\beta_j)$
• When $$x_j$$ increases one unit, the odds are multiplied by $$exp(\beta_j)$$

$\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)$

## 4.1 Example binomial (1/0)

Example response variable disease or no disease

• We use variable $$Y = 1$$ if disease, $$Y = 0$$ if no disease
• $$p$$ is the probability response variable $$Y = 1$$ (disease)
• $$\frac{p}{1-p}$$ is the odds of disease

Consider the logistic model,

$Y \sim Bernoulli(p)$

$\mbox{logit}(p)=\log(\mbox{odds})=\log\left(\frac{p}{1-p}\right)=\beta_0 + \beta_1 x_{1}$

If we increase $$x_1$$ by one unit,

$\log\left(\frac{p^*}{1-p^*}\right)=\beta_0 + \beta_1 (x_{1}+1)$

$\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_1$

$\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_1)$

Therefore

• $$\beta_1$$ is the increase or decrease in the log-odds associated with one-unit increase in $$x_1$$

$\log\left(\frac{p}{1-p}\ \ |_{x_1+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_1}\right) = \beta_1$

• $$exp(\beta_1)$$ is the odds ratio $OR= \frac{\frac{p}{1-p}\ |_{x_1+1}} { \frac{p}{1-p} \ |_{x_1}} = exp(\beta_1)$
• When $$x_1$$ increases one unit, the odds are multiplied by $$exp(\beta_1)$$

$\frac{p}{1-p} |_{x_1+1} = \frac{p}{1-p} |_{x_1} \times exp(\beta_1)$

## 4.2 Example binomial (1/0)

In a Childhood Cancer case-control study, cases were ascertained as all children under ten years of age in England and Wales who died of cancer during the period 1954-65.

Exposure variable was whether or not the children received in utero irradiation (XRAY), as reported by the mother.

The year of birth (YOB $$< 1954$$, YOB $$\geq 1954$$) was considered as a confounder of the association between XRAY and cancer.

Let $$p$$ = the probability of being a case in the case-control study

• XRAY=1 if the child received in utero irradiation and 0 if not
• BINYOB=1 if YOB $$\geq 1954$$ and 0 if YOB $$< 1954$$

Model:

$Y \sim Bernoulli(p)$

$logit(p)=log(odds)=\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 XRAY + \beta_2 BINYOB$ $\log\left(\frac{p}{1-p}\right)= -0.061 + 0.501 XRAY + (-0.005)BINYOB$

• 0.501 is the log(odds) of cancer for a person with XRAY=1 versus another person with XRAY=0, and with the same BINYOB.
• -0.005 is the log(odds) of cancer for a person with BINYOB=1 versus another person with BINYOB=0, and with the same XRAY.

• $$\beta_j$$ is the increase or decrease in the log-odds associated with one-unit increase in $$x_j$$, holding all other variables constant

$\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j$

• $$exp(\beta_j)$$ is the odds ratio $OR= \frac{\frac{p}{1-p}\ |_{x_j+1}} { \frac{p}{1-p} \ |_{x_j}} = exp(\beta_j)$
• When $$x_j$$ increases one unit, the odds are multiplied by $$exp(\beta_j)$$

$\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)$

## 4.3 Example binomial (grouped)

Collett (1991) describes an experiment on the toxicity of the pyrethoid trans - cypermethrin to the tobacco budworm. Batches of 20 moths of each sex were exposed to varying doses of the pyrethoid for three days and the number knocked out in each batch was recorded.

ldose <- rep(0:5, 2)
dead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))

cbind(dead, ldose, sex)
##       dead ldose sex
##  [1,]    1     0   2
##  [2,]    4     1   2
##  [3,]    9     2   2
##  [4,]   13     3   2
##  [5,]   18     4   2
##  [6,]   20     5   2
##  [7,]    0     0   1
##  [8,]    2     1   1
##  [9,]    6     2   1
## [10,]   10     3   1
## [11,]   12     4   1
## [12,]   16     5   1

Generalized linear models can be fitted in R using the glm() function, which is similar to the lm() function for fitting linear models.

Binomial responses can be specified as follows:

• a numeric 0/1 vector (0 = failure), a logical vector (FALSE = failure), or a factor (first level = failure)
• a two-column matrix with the number of successes and the number of failures

The logit link is the default for the binomial family so does not need to be specified.

y <- cbind(dead, 20-dead)

parr <- glm(y ~ ldose + sex, family = binomial)
summary(parr)
##
## Call:
## glm(formula = y ~ ldose + sex, family = binomial)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -1.10540  -0.65343  -0.02225   0.48471   1.42944
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -3.4732     0.4685  -7.413 1.23e-13 ***
## ldose         1.0642     0.1311   8.119 4.70e-16 ***
## sexM          1.1007     0.3558   3.093  0.00198 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   6.7571  on  9  degrees of freedom
## AIC: 42.867
##
## Number of Fisher Scoring iterations: 4

$Y_i \sim Binomial(n_i, p_i)$

$\log\left(\frac{p_i}{1-p_i}\right)= -3.47 + 1.061 \times \mbox{ldose}_i + 1.10 \times \mbox{sexM}_i$ Therefore

• the odds of death for a male moth are exp(1.10) = 3.01 times that for a female moth, given a fixed dose of the pyrethroid.
• the odds of death increase by a factor of exp(1.06) = 2.90 for every $$\mu$$g of pyrethroid, for male or female moths.

# 5 Modeling Poisson data (count data)

Count data, in which there is no upper limit to the number of counts, usually fall into two types

• rates counts per unit of time/area/distance, etc
• contingency tables counts cross-classified by categorical variables

Examples include

• number of household burglaries in a city in a given year
• number of customers served by a saleperson in a given month
• number of train accidents in a given year

In such situations, the counts can be assumed to follow a Poisson distribution

$Y_i \sim Po(\mu_i)$ Then $E(Y_i)=\mu_i$ $Var(Y_i)=\mu_i$

Link function must map from $$(0, \infty)$$ to $$(-\infty, \infty)$$. A natural choice is $\eta_i = g(\mu_i)=\log(\mu_i)$

x <- seq(0.1, 100, 1)
plot(x, log(x), type = "l", xlab = "x", ylab = "log") x <- seq(-2, 5, 0.1)
plot(x, exp(x), type = "l", xlab = "log", ylab = "x") Poisson GLM with the log link is known as log-linear model.

$Y \sim Po(\mu)$

$\log(\mu) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p$

If we increase $$x_j$$ one unit

$\log(\mu^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p$ $\log(\mu^*) - \log(\mu) = \beta_j$

$\log\left(\frac{\mu^*}{\mu}\right) = \beta_j$

$\frac{\mu^*}{\mu}= exp(\beta_j)$

$\mu^* = exp(\beta_j) \mu$

Therefore,

• $$\beta_j$$ is the increase or decrease in the log of the mean associated with one-unit increase in $$x_j$$, holding all other variables constant

$\log\left(\mu\ \ |_{x_j+1}\right)-\log\left(\mu\ \ |_{x_j}\right) = \beta_j$

• $$exp(\beta_j)$$ $\frac{\mu\ |_{x_j+1}} {\mu \ |_{x_j}} = exp(\beta_j)$
• When $$x_j$$ increases one unit, the mean is multiplied by $$exp(\beta_j)$$

$\mu |_{x_j+1} = \mu |_{x_j} \times exp(\beta_j)$

## 5.1 Modeling rates

In many cases we are making comparisons across observation units $$i = 1, \ldots, n$$ with different levels of exposure to the event and hence the measure of interest is the rate of occurrence, e.g.

• number of household burglaries per 10,000 households in city $$i$$ in a given year
• number of customers served per hour by salesperson $$i$$ in a given month
• number of train accidents per billion train-kilometers in year $$i$$

Since the counts are Poisson distributed, we would like to use a GLM to model the expected rate, $$\lambda_i=\mu_i/t_i$$, where $$t_i$$ is the exposure for unit $$i$$.

$Y_i \sim Po(\mu_i)$

$\mu_i = \lambda_i t_i$

$\log(\mu_i/t_i)= \log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}$

$\log(\mu_i)= \log(t_i) + \log(\lambda_i) = \log(t_i) + \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}$

The standardizing term $$\log(t_i)$$ is an example of an offset: a term with a fixed coefficient of 1.

Offsets are easily specified to glm(), either using the offset argument or using the offset function in the formula, e.g. offset(log(time)).

If all the observations have the same exposure, the model does not need an offset term and we can model $$\log(\mu_i)$$ directly.

• $$\beta_j$$ is the increase or decrease in the log of the rate associated with one-unit increase in $$x_j$$, holding all other variables constant

$\log\left(\lambda\ \ |_{x_j+1}\right)-\log\left(\lambda\ \ |_{x_j}\right) = \beta_j$

• $$exp(\beta_j)$$ $\frac{\lambda\ |_{x_j+1}} {\lambda \ |_{x_j}} = exp(\beta_j)$
• When $$x_j$$ increases one unit, the mean is multiplied by $$exp(\beta_j)$$

$\lambda |_{x_j+1} = \lambda |_{x_j} \times exp(\beta_j)$

## 5.2 Example Poisson

The ships data from the MASS package concern a type of damage caused by waves to the forward section of cargo-carrying vessels. The variables are

• incidents number of damage incidents
• service aggregate months of service
• period period of operation : 1960-74, 75-79
• year year of construction: 1960-64, 65-69, 70-74, 75-79
• type type: A to E

Here it makes sense to model the expected number of incidents per aggregate months of service.

Let us consider a log-linear model including all the variables. We convert the period and year variables to factors.

library(MASS)
data(ships)
# Model non-zero counts. We exclude ships with 0 months of service
ships2 <- subset(ships, service > 0)
ships2$year <- as.factor(ships2$year)
ships2$period <- as.factor(ships2$period)

res <- glm(incidents ~ type + year + period, family = "poisson", data = ships2, offset = log(service))
summary(res)
##
## Call:
## glm(formula = incidents ~ type + year + period, family = "poisson",
##     data = ships2, offset = log(service))
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.6768  -0.8293  -0.4370   0.5058   2.7912
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.40590    0.21744 -29.460  < 2e-16 ***
## typeB       -0.54334    0.17759  -3.060  0.00222 **
## typeC       -0.68740    0.32904  -2.089  0.03670 *
## typeD       -0.07596    0.29058  -0.261  0.79377
## typeE        0.32558    0.23588   1.380  0.16750
## year65       0.69714    0.14964   4.659 3.18e-06 ***
## year70       0.81843    0.16977   4.821 1.43e-06 ***
## year75       0.45343    0.23317   1.945  0.05182 .
## period75     0.38447    0.11827   3.251  0.00115 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 146.328  on 33  degrees of freedom
## Residual deviance:  38.695  on 25  degrees of freedom
## AIC: 154.56
##
## Number of Fisher Scoring iterations: 5

An increase by one unit in the in the $$x_j$$ variable is associated with a multiplicative change in the rate by $$exp(\beta_j)$$ (holding the rest of variables constant).

# 6 Modeling Normal data

Normal linear model is a special case of GLM

Normal observations

$Y_i \sim N(\mu_i, \sigma^2)$ Linear predictor

$\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}$

$g(\mu_i) = \mu_i$

## 6.1 Example Normal data

Griffiths, Hill and Judge (1993) present a dataset on food expenditure for households that have three family members. We consider two variables, the expenditure on food and the household income:

d <- read.table("data/GHJ_food_income.txt", header = TRUE)
plot(d$Income, d$Food, xlab = "Weekly Household Income ($)", ylab = "Weekly Household Expenditure on Food ($)") We first fit the model using lm(), then compare to the results using glm(). The default family for glm() is "gaussian" so the arguments of the call are unchanged.

foodLM <- lm(Food ~ Income, data = d)
summary(foodLM)
##
## Call:
## lm(formula = Food ~ Income, data = d)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.50837 -0.15781 -0.00536  0.18789  0.49142
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.409418   0.161976  14.875  < 2e-16 ***
## Income      0.009976   0.002234   4.465 6.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2766 on 38 degrees of freedom
## Multiple R-squared:  0.3441, Adjusted R-squared:  0.3268
## F-statistic: 19.94 on 1 and 38 DF,  p-value: 6.951e-05
foodGLM <- glm(Food ~ Income, data = d)
summary(foodGLM)
##
## Call:
## glm(formula = Food ~ Income, data = d)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -0.50837  -0.15781  -0.00536   0.18789   0.49142
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.409418   0.161976  14.875  < 2e-16 ***
## Income      0.009976   0.002234   4.465 6.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.07650739)
##
##     Null deviance: 4.4325  on 39  degrees of freedom
## Residual deviance: 2.9073  on 38  degrees of freedom
## AIC: 14.649
##
## Number of Fisher Scoring iterations: 2

### 6.1.1 Coefficients

The estimated coefficients are unchanged. t-tests test the significance of each coefficient in the presence of the others.

### 6.1.2 Dispersion parameter

The dispersion parameter $$\phi (= \sigma^2)$$ for the Gaussian family is equal to the residual variance $$\hat \sigma^2 = \sum (y_i -\hat y_i)^2/(n-p-1)$$.

summary(foodLM)$sigma ##  0.2765997 summary(foodLM)$sigma^2
##  0.07650739
summary(foodGLM)\$dispersion
##  0.07650739

### 6.1.3 Deviance

The deviance of a model is defined through the difference of the log-likelihoods between the saturated model (number of parameters equal to the number of observations) and the fitted model:

$D = 2 \left( l({\boldsymbol{\hat\beta_{sat}}}) - l(\boldsymbol{\hat \beta}) \right) \phi,$

$$l({\boldsymbol{\hat\beta_{sat}}})$$ is the log-likelihood evaluated at the MLE of the saturated model, and $$l(\boldsymbol{\hat \beta})$$ is the log-likelihood evaluated at the MLE of the fitted model.

A related quantity is the scaled deviance

$D^* = D/\phi = 2 \left( l({\boldsymbol{\hat\beta_{sat}}})- l(\boldsymbol{\hat \beta}) \right)$

If $$\phi=1$$ such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance are the same.

The scaled deviance has asymptotic chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.

$D^* \sim \chi^2_{n-p-1}$

Example:

For linear regression with Gaussian data, the deviance is equal to the residual sum of squares RSS: $$\sum (y_i - \hat y_i)^2$$.

In the case of Gaussian model:

$D^* = \frac{RSS}{\sigma^2} = \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{\sigma^2} \mbox{ is distributed as a } \chi^2_{n-p-1}$

Deviance residuals

Deviance

$D = \sum_{i=1}^n d_i$

$$d_i$$ is the component of the deviance contributed by the $$i$$th datum

In the R output, a five-number summary of the deviance residuals is given. Since the response is assumed to be normally distributed these are the same as the residuals returned from lm().

Null deviance and Residual deviance

The R output shows the deviance of two models:

• Null deviance

The null deviance is the deviance between the saturated model and the null model (model with just an intercept). Degrees of freedom are the number of data points $$n$$ minus 1 for the intercept.

• Residual Deviance

The residual deviance is the deviance between the saturated model and the fitted model. $$n − p - 1$$ degrees of freedom, where $$p$$ is the number of explanatory variables and 1 comes from the intercept.

### 6.1.4 Akaike Information Criterion (AIC)

The AIC is a measure of fit that penalizes for the number of parameters $$p$$

$AIC = -2 l + 2(p+1)$ where $$l$$ is the maximized log likelihood of the model and $$p+1$$ the number of model parameters including the intercept.

Smaller values indicate better fit and thus the AIC can be used to compare models (not necessarily nested).