<- seq(0.1, 100, 1)
x plot(x, log(x), type = "l", xlab = "x", ylab = "log")
20 Count, binary and normal data
- Generalized linear models (GLMs)
- Binary data
- Count data
- Normal data
21 The general linear model
In a general linear model, the response \(Y_i\), \(i = 1, \ldots, n\), is modeled as a linear function of explanatory variables \(x_i\), \(j=1, \ldots, p\) plus an error term:
\[Y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi} + \epsilon_i,\]
where we assume errors \(\epsilon_i\) are independent and identically Gaussian distributed with \(E[\epsilon_i] = 0,\ \ Var[\epsilon_i] = \sigma^2\).
Here general refers to the dependence on potentially more than one explanatory variable, versus the simple linear model:
\[Y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i\]
The model is linear in the parameters, e.g.,
\[Y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \epsilon_i\]
but not, e.g.,
\[Y_i = \beta_0 + \beta_1 x_1^{\beta_2} + \epsilon_i\]
Although a very useful framework, there are some situations where general linear models are not appropriate.
In a general linear model the response variable is a continuous variable (e.g., BMI, systolic blood pressure). We can have other situations when the response variable is binary or a count.
Generalized linear models (GLMs) extend the general linear model framework to address these issues.
22 Generalized linear models (GLMs)
A generalized linear model (GLM) makes the distributional assumptions that the \(Y_i\) are independent and
\(Y_i \sim\) distribution from exponential family (e.g., normal, Poisson, Binomial).
The link function describes how the mean of the response and a linear combination of the predictors are related
\[\eta_i = g(\mu_i) = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}\]
- The link function describes how the mean of the response, \(E[Y_i]=\mu_i\) depends on the linear predictor: \(g(\mu_i) = \eta_i\)
- The variance function describes how the variance depends on the mean \(Var(Y_i) = V(\mu)\phi\), where the dispersion parameter \(\phi\) is a constant
Here we will see how to model the following data:
- Poisson data
- Binomial data
- Gaussian data
23 Modeling Poisson data
Count data, in which there is no upper limit to the number of counts, may represent counts or rates which are counts per unit of time/area/distance, etc.
For example,
- number of household burglaries in a city in a given year
- number of customers served by a saleperson in a given month
- number of train accidents in a given year
In such situations, the counts can be assumed to follow a Poisson distribution
\[Y_i \sim Po(\mu_i)\] Then \[E(Y_i)=\mu_i\] \[Var(Y_i)=\mu_i\]
Link function must map from \((0, \infty)\) to \((-\infty, \infty)\). A natural choice is \[\eta_i = g(\mu_i)=\log(\mu_i)\]
The inverse of the log function is the exponential function: exp(log(mu)) = mu
The inverse of the log function maps real values to \((0, \infty)\).
<- seq(-2, 5, 0.1)
x plot(x, exp(x), type = "l", xlab = "log", ylab = "x")
Poisson GLM with the log link is known as log-linear model.
Canonical link function for Poisson data is the log link.
\[\log(\mu) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p\]
If we increase \(x_j\) by one unit
\[\log(\mu^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p\] \[\log(\mu^*) - \log(\mu) = \beta_j\]
\[\log\left(\frac{\mu^*}{\mu}\right) = \beta_j\]
\[\frac{\mu^*}{\mu}= exp(\beta_j)\]
\[\mu^* = exp(\beta_j) \mu\]
\(exp(\beta_j)\) is called relative risk (risk relative to some baseline)
Therefore,
- \(\beta_j\) is the increase or decrease in the log of the mean associated with one-unit increase in \(x_j\), holding all other variables constant
\[\log\left(\mu\ \ |_{x_j+1}\right)-\log\left(\mu\ \ |_{x_j}\right) = \beta_j\]
- \(exp(\beta_j)\) \[\frac{\mu\ |_{x_j+1}} {\mu \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the mean is multiplied by \(exp(\beta_j)\)
\[\mu |_{x_j+1} = \mu |_{x_j} \times exp(\beta_j)\]
23.1 Example Poisson (counts)
Consider an experiment designed to evaluate the effectiveness of an anti-fungal chemical on plants. A total of 60 plant leaves infected with a fungus were randomly assigned to treatment with 0, 5, 10, 15, 20, or 25 units of the anti-fungal chemical, with 10 plant leaves for each amount of anti-fungal chemical. Following a two-week period, the leaves were studied under a microscope, and the number of infected cells was counted and recorded for each leaf.
<- read.csv("data/glms-LeafInfectionData.txt")
d head(d)
x y
1 0 30
2 0 57
3 0 23
4 0 118
5 0 212
6 0 132
Plot observed number of infected cells for each amount of anti-fungal treatment.
plot(d$x, d$y, xlab = "Amount of anti-fungal treatment", ylab = "Number of infected cells")
<- glm(y ~ x, family = "poisson", data = d)
res summary(res)
Call:
glm(formula = y ~ x, family = "poisson", data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.373003 0.032434 134.83 <2e-16 ***
x -0.162011 0.004933 -32.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2314.87 on 59 degrees of freedom
Residual deviance: 614.65 on 58 degrees of freedom
AIC: 849.53
Number of Fisher Scoring iterations: 5
\[\log(\mu) = 4.37 -0.16 \times x\]
For one unit increase in the amount of anti-fungal chemical, the log of the mean number of infected cells decreases by 0.16.
When the amount of anti-fungal chemical increases by one unit, the mean number of infected cells is multiplied by exp(-0.162011) = 0.85.
Predictions
Now we predict the mean number of infected cells per leaf as a function of the amount of anti-fungal chemical.
We have \(\log(\mu) = \beta_0 + \beta_1 x\). That is, \(\mu = exp(\beta_0 + \beta_1 x)\).
Then, the mean number of infected cells for a leaf treated with \(x\) units of the anti-fungal chemical can be estimated by \(exp(\hat \beta_0 + \hat \beta_1 x)\).
# Predict at these chemical amounts
<- seq(0, 25, by = 0.1)
xamount # Coefficients
<- coef(res)
b # Predictions
<- exp(b[1] + b[2]*xamount)
predictions # Plot
plot(d$x, d$y, xlab = "Amount of anti-fungal treatment", ylab = "Number of infected cells")
lines(xamount, predictions, col = "red")
We can also use the function predict(object, newdata, type, ...)
to obtain predictions (?predict.glm
):
object
is the glm fitted objectnewdata
is a data frame with the variables with which to predict. If omitted, the fitted linear predictors are used.type
is type of prediction. Whentype = "link"
predictions are on the scale of the linear predictors (log in the case of Poisson data, log-odds in the case of Binomial data). Whentype = "response"
predictions are on the scale of the response variable (mean in the case of Poisson data, probabilities in the case of Binomial data).
Type "response"
(mean)
# Predict at these chemical amounts
<- seq(0, 25, by = 0.1)
xamount # Predictions
<- predict(res, data.frame(x = xamount), type = "response")
predictions # Plot
plot(d$x, d$y, xlab = "Amount of anti-fungal treatment", ylab = "Number of infected cells")
lines(xamount, predictions, col = "red")
Type "link"
(log(mean))
# Predict at these chemical amounts
<- seq(0, 25, by = 0.1)
xamount # Predictions
<- predict(res, data.frame(x = xamount), type = "link")
predictions # Plot
plot(xamount, predictions, col = "red", type = "l",
xlab = "Amount of anti-fungal treatment", ylab = "log(number of infected cells)")
23.2 Exercise Poisson (counts)
Some female crabs attract many males while others do not attract any. The males that cluster around a female are called satellites. In order to understand what influences the presence of satellite crabs, researchers selected female crabs and collected data on the following characteristics:
- number of male satellites
- width of her carapace shell (in centimeters)
- condition of her spine (bad, good, middle)
- color of her shell (dark, darker, light, medium)
- weight of the female (in grams)
Interpret coefficients of the following model:
library("glmbb")
data("crabs")
# In the data, variable y denotes as.numeric(satell > 0)
dim(crabs)
[1] 173 6
head(crabs)
color spine width satell weight y
1 medium bad 28.3 8 3050 1
2 dark bad 22.5 0 1550 0
3 light good 26.0 9 2300 1
4 dark bad 24.8 0 2100 0
5 dark bad 26.0 4 2600 1
6 medium bad 23.8 0 2100 0
summary(crabs)
color spine width satell weight
dark :44 bad :121 Min. :21.0 Min. : 0.000 Min. :1200
darker:22 good : 37 1st Qu.:24.9 1st Qu.: 0.000 1st Qu.:2000
light :12 middle: 15 Median :26.1 Median : 2.000 Median :2350
medium:95 Mean :26.3 Mean : 2.919 Mean :2437
3rd Qu.:27.7 3rd Qu.: 5.000 3rd Qu.:2850
Max. :33.5 Max. :15.000 Max. :5200
y
Min. :0.0000
1st Qu.:0.0000
Median :1.0000
Mean :0.6416
3rd Qu.:1.0000
Max. :1.0000
<- glm(satell ~ width + spine, family = "poisson", data = crabs)
res summary(res)
Call:
glm(formula = satell ~ width + spine, family = "poisson", data = crabs)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.12469 0.56497 -5.531 3.19e-08 ***
width 0.15668 0.02096 7.476 7.69e-14 ***
spinegood 0.09899 0.10490 0.944 0.345
spinemiddle -0.10033 0.19309 -0.520 0.603
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 632.79 on 172 degrees of freedom
Residual deviance: 566.60 on 169 degrees of freedom
AIC: 929.9
Number of Fisher Scoring iterations: 6
23.3 Modeling rates
In many cases we are making comparisons across observation units \(i = 1, \ldots, n\) with different levels of exposure to the event and hence the measure of interest is the rate of occurrence. For example,
- number of household burglaries per 10,000 households in city \(i\) in a given year
- number of customers served per hour by salesperson \(i\) in a given month
- number of train accidents per billion train-kilometers in year \(i\)
Since the counts are Poisson distributed, we would like to use a GLM to model the expected rate, \(\lambda_i=\mu_i/t_i\), where \(t_i\) is the exposure for unit \(i\).
Typically explanatory variables have a multiplicative effect rather than an additive effect on the expected rate, therefore a suitable model is
\[Y_i \sim Po(\mu_i)\]
\[\mu_i = \lambda_i t_i\]
\[\log(\mu_i/t_i)= \log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}\]
\[\log(\mu_i)= \log(t_i) + \log(\lambda_i) = \log(t_i) + \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{ip}\]
\(\log(t_i)\) is an offset: a term with a fixed coefficient of 1.
Offsets are easily specified to glm()
, either using the offset
argument (offset = log(time)
) or using the offset function in the formula (+ offset(log(time))
).
Examples of offsets. When modeling the number of disease cases in regions, the offset is the population of each region. When modeling the number of accidents in periods of time, the offset is the length of the period of time for each observation.
If all the observations have the same exposure, the model does not need an offset term and we can model \(\log(\mu_i)\) directly.
- \(\beta_j\) is the increase or decrease in the log of the expected rate associated with one-unit increase in \(x_j\), holding all other variables constant
\[\log\left(\lambda\ \ |_{x_j+1}\right)-\log\left(\lambda\ \ |_{x_j}\right) = \beta_j\]
- \(exp(\beta_j)\) \[\frac{\lambda\ |_{x_j+1}} {\lambda \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the expected rate is multiplied by \(exp(\beta_j)\)
\[\lambda |_{x_j+1} = \lambda |_{x_j} \times exp(\beta_j)\]
23.4 Example Poisson (rates)
The ships
data from the MASS
package contains the number of ships damage incidents caused by waves and aggregate months of service by ship type, year of construction, and period of operation. The variables are
incidents
number of damage incidentsservice
aggregate months of servicetype
type: A to Eyear
year of construction: 1960-64, 65-69, 70-74, 75-79period
period of operation: 1960-74, 75-79
Here it makes sense to model the expected number of incidents per aggregate months of service. Let us consider a log-linear model including all the variables.
library(MASS)
Attaching package: 'MASS'
The following object is masked _by_ '.GlobalEnv':
crabs
data(ships)
# We exclude ships with 0 months of service
<- subset(ships, service > 0)
ships2 # Convert period and year to factor variables
$year <- as.factor(ships2$year)
ships2$period <- as.factor(ships2$period)
ships2summary(ships2)
type year period service incidents
A:7 60: 9 60:15 Min. : 45 Min. : 0.00
B:7 65:10 75:19 1st Qu.: 371 1st Qu.: 1.00
C:7 70:10 Median : 1095 Median : 4.00
D:7 75: 5 Mean : 4811 Mean :10.47
E:6 3rd Qu.: 2223 3rd Qu.:11.75
Max. :44882 Max. :58.00
<- glm(incidents ~ type + year + period, family = "poisson", data = ships2, offset = log(service))
res summary(res)
Call:
glm(formula = incidents ~ type + year + period, family = "poisson",
data = ships2, offset = log(service))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.40590 0.21744 -29.460 < 2e-16 ***
typeB -0.54334 0.17759 -3.060 0.00222 **
typeC -0.68740 0.32904 -2.089 0.03670 *
typeD -0.07596 0.29058 -0.261 0.79377
typeE 0.32558 0.23588 1.380 0.16750
year65 0.69714 0.14964 4.659 3.18e-06 ***
year70 0.81843 0.16977 4.821 1.43e-06 ***
year75 0.45343 0.23317 1.945 0.05182 .
period75 0.38447 0.11827 3.251 0.00115 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 146.328 on 33 degrees of freedom
Residual deviance: 38.695 on 25 degrees of freedom
AIC: 154.56
Number of Fisher Scoring iterations: 5
If \(x_j\) is continuous, an increase by one unit in the \(x_j\) variable is associated with a multiplicative change in the expected rate by \(exp(\beta_j)\) (holding the rest of variables constant). If we are interpreting the coefficient of a dummy variable created for a categorical variable, interpretation will be with respect to the reference category.
24 Modeling Binomial data
Data may represent the number of successes in a given number of trials.
For example,
- Number of defective items in a set of 100 items
- Number of spam e-mails in 20 e-mails received
- Heads observed when tossing 1 coin. Outcome can be heads or tails
We model that data with \(\mbox{Binomial}(n, p)\). If \(n=1\), the response variable is a binary variable that can take one of two values, say success/failure.
Suppose
\[Y_i \sim Binomial(n_i, p_i),\ \ E(Y_i) = n_i p_i,\ \ Var(Y_i)= n_i p_i(1-p_i)\] and we wish to model the proportions \[Y_i/n_i\] Then
\[E(Y_i/n_i) = p_i\] \[Var(Y_i/n_i)= \frac{1}{n_i}p_i(1-p_i)\]
Odds are \(\frac{p}{1-p}\).
\(\mbox{logit}(p)\) is log-odds \(\log\left(\frac{p}{1-p}\right)\).
For example, if the probability of an event is \(p=0.2\),
the odds of an event occurring are
\[\mbox{odds} = \frac{p}{1-p}=\frac{0.2}{0.8}=0.25\] the log-odds of the event are
\[\log(\mbox{odds})=\log\left(\frac{p}{1-p} \right) = \log\left(\frac{0.2}{0.8}\right) =1.39\]
Link function must map from \((0, 1)\) to the real line \((-\infty, \infty)\). A common choice is logit
\[\eta_i = g(p_i)=\mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right)\]
<- function(p){log(p/(1-p))}
logit <- seq(0, 1, 0.01)
p plot(p, logit(p), type = "l", xlab = "probability", ylab = "logit")
The inverse of the logit function is the sigmoid function: invlogit(logit(p)) = p
The inverse of the logit function maps real values to (0, 1).
\[invlogit(x) = \frac{exp(x)}{1+exp(x)} = \frac{1}{1+exp(-x)}\]
<- function(x){exp(x)/(1 + exp(x))}
invlogit <- seq(-4, 4, 0.1)
x plot(x, invlogit(x), type = "l", xlab = "logit", ylab = "probability")
abline(h = 0.5)
Binomial GLM with logit link is known as logistic regression model.
Canonical link function for Binomial data is the logit link.
\[logit(p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j x_j + \ldots + \beta_p x_p\]
If we increase \(x_j\) by one unit
\[logit(p^*) = \beta_0 + \beta_1 x_1 + \ldots + \beta_j (x_j+1) + \ldots + \beta_p x_p\]
\[logit(p^*) - logit(p) = \beta_j\]
\[\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_j\]
\[\log\left(\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} \right) = \beta_j\] \[\mbox{Odds ratio (OR)}=\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_j)\]
Therefore,
- \(\beta_j\) is the increase or decrease in the log-odds associated with one-unit increase in \(x_j\), holding all other variables constant
\[\log\left(\frac{p}{1-p}\ \ |_{x_j+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_j}\right) = \beta_j\]
- \(exp(\beta_j)\) is the odds ratio \[OR= \frac{\frac{p}{1-p}\ |_{x_j+1}} { \frac{p}{1-p} \ |_{x_j}} = exp(\beta_j)\]
- When \(x_j\) increases one unit, the odds are multiplied by \(exp(\beta_j)\)
\[\frac{p}{1-p} |_{x_j+1} = \frac{p}{1-p} |_{x_j} \times exp(\beta_j)\]
Link function
The logit link function \(g(p)=\log(p/(1-p))\) is the canonical link function (logit makes the linear predictor \(\eta\) the same as the canonical parameter \(\theta\)).
The canonical link is not the only choice of link function. In the Binomial setting another common choice is the probit link function: \(g(p) = \Phi^{-1}(p)\) where \(\Phi\) is the standard normal cumulative distribution function.
24.1 Example binomial (1/0)
Example response variable disease or no disease
- We use variable \(Y = 1\) if disease, \(Y = 0\) if no disease
- \(p\) is the probability response variable \(Y = 1\) (disease)
- \(\frac{p}{1-p}\) is the odds of disease
Consider the logistic model,
\[Y \sim Bernoulli(p)\]
\[\mbox{logit}(p)=\log(\mbox{odds})=\log\left(\frac{p}{1-p}\right)=\beta_0 + \beta_1 x_{1}\]
If we increase \(x_1\) by one unit,
\[\log\left(\frac{p^*}{1-p^*}\right)=\beta_0 + \beta_1 (x_{1}+1)\]
\[\log\left(\frac{p^*}{1-p^*}\right) - \log\left(\frac{p}{1-p}\right) = \beta_1\]
\[\frac{\frac{p^*}{1-p^*}}{\frac{p}{1-p}} = exp(\beta_1)\]
Therefore
- \(\beta_1\) is the increase or decrease in the log-odds associated with one-unit increase in \(x_1\)
\[\log\left(\frac{p}{1-p}\ \ |_{x_1+1}\right)-\log\left(\frac{p}{1-p}\ \ |_{x_1}\right) = \beta_1\]
- \(exp(\beta_1)\) is the odds ratio
\[OR= \frac{\frac{p}{1-p}\ |_{x_1+1}} { \frac{p}{1-p} \ |_{x_1}} = exp(\beta_1)\]
- When \(x_1\) increases one unit, the odds are multiplied by \(exp(\beta_1)\)
\[\frac{p}{1-p} |_{x_1+1} = \frac{p}{1-p} |_{x_1} \times exp(\beta_1)\]
24.2 Example binomial (1/0)
In a Childhood Cancer case-control study, cases were ascertained as all children under ten years in England and Wales who died of cancer during 1954-1965. Exposure variable was whether or not the children received in utero irradiation. The year of birth was considered as a confounder of the association between irradiation and cancer.
Let \(p\) be the probability of being a case in the case-control study
XRAY
= 1 if the child received in utero irradiation and 0 if notBINYOB
= 1 if YOB \(\geq 1954\) and 0 if YOB \(< 1954\)
\[logit(p) = \log(odds)=\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 XRAY + \beta_2 BINYOB\] \[\log(odds) = -0.061 + 0.501 XRAY + (-0.005)BINYOB\]
XRAY
0.501 is the increase of log(odds) of cancer for a person with XRAY = 1
versus another person with XRAY = 0
, and with the same BINYOB
.
The odds of cancer for a person with XRAY = 1
are exp(0.501)=1.65 times the odds of cancer of a person with XRAY = 0
, and with the same BINYOB
.
BINYOB
0.005 is the decrease of log(odds) of cancer for a person with BINYOB = 1
versus another person with BINYOB = 0
, and with the same XRAY
.
The odds of cancer for a person with BINYOB = 1
are exp(-0.005)=0.995 times the odds of cancer of a person with BINYOB = 0
, and with the same XRAY
.
24.3 Example binomial (grouped)
An experiment measures the toxicity of a pesticide to the tobacco budworm. Batches of 20 budworms of each sex were exposed to varying doses of the pesticide for three days and the number of budworms killed in each batch was recorded.
<- rep(0:5, 2)
ldose <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
dead <- factor(rep(c("M", "F"), c(6, 6)))
sex
data.frame(dead, ldose, sex)
dead ldose sex
1 1 0 M
2 4 1 M
3 9 2 M
4 13 3 M
5 18 4 M
6 20 5 M
7 0 0 F
8 2 1 F
9 6 2 F
10 10 3 F
11 12 4 F
12 16 5 F
Logistic regression models can be fitted with glm()
specifying family = "binomial"
. The logit link is the default for the binomial family so does not need to be specified.
Binomial responses can be specified as follows:
- vector numeric 0/1 (0 = failure), logical (FALSE = failure), or factor (first level = failure)
- two-column matrix with number of successes and number of failures
<- cbind(dead, 20-dead)
y y
dead
[1,] 1 19
[2,] 4 16
[3,] 9 11
[4,] 13 7
[5,] 18 2
[6,] 20 0
[7,] 0 20
[8,] 2 18
[9,] 6 14
[10,] 10 10
[11,] 12 8
[12,] 16 4
<- cbind(dead, 20-dead)
y
<- glm(y ~ ldose + sex, family = "binomial")
res summary(res)
Call:
glm(formula = y ~ ldose + sex, family = "binomial")
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***
ldose 1.0642 0.1311 8.119 4.70e-16 ***
sexM 1.1007 0.3558 3.093 0.00198 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.8756 on 11 degrees of freedom
Residual deviance: 6.7571 on 9 degrees of freedom
AIC: 42.867
Number of Fisher Scoring iterations: 4
\[\log\left(\frac{p}{1-p}\right)= -3.47 + 1.061 \times \mbox{ldose} + 1.10 \times \mbox{sexM}\]
- the odds of death for a male budworm are exp(1.10) = 3.01 times that for a female budworm, given a fixed dose of the pesticide.
- the odds of death increase by a factor of exp(1.06) = 2.90 for every \(\mu\)g of pesticide, for male or female budworms.
Predictions
We can use the function predict(object, newdata, type, ...)
to obtain predictions (?predict.glm
):
object
is the glm fitted objectnewdata
is a data frame with the variables with which to predict. If omitted, the fitted linear predictors are used.type
is type of prediction. Whentype = "link"
predictions are on the scale of the linear predictors (log in the case of Poisson data, log-odds in the case of Binomial data). Whentype = "response"
predictions are on the scale of the response variable (mean in the case of Poisson data, probabilities in the case of Binomial data).
Type "response"
(probabilities):
# Predict at these dose levels
<- seq(0, 5, by = 0.1)
ld # Predict for males and dose levels seq(0, 5, by = 0.1)
<- data.frame(ldose = ld,
newdataM sex = factor(rep("M", length(ld)), levels = levels(sex)))
# Predict for females and dose levels seq(0, 5, by = 0.1)
<- data.frame(ldose = ld,
newdataF sex = factor(rep("F", length(ld)), levels = levels(sex)))
# Predictions for male and female
<- predict(res, newdataM, type = "response")
predictionsM <- predict(res, newdataF, type = "response")
predictionsF
# Plot
plot(c(0, 5), c(0, 1), type = "n", xlab = "dose", ylab = "probability")
text(ldose, dead/20, as.character(sex))
lines(ld, predictionsM)
lines(ld, predictionsF)
Type "link"
(log-odds):
# Predict at these dose levels
<- seq(0, 5, by = 0.1)
ld # Predict for males and dose levels seq(0, 5, by = 0.1)
<- data.frame(ldose = ld,
newdataM sex = factor(rep("M", length(ld)), levels = levels(sex)))
# Predict for females and dose levels seq(0, 5, by = 0.1)
<- data.frame(ldose = ld,
newdataF sex = factor(rep("F", length(ld)), levels = levels(sex)))
# Predictions for male and female
<- predict(res, newdataM, type = "link")
predictionsM <- predict(res, newdataF, type = "link")
predictionsF
# Plot
plot(c(0, 5), c(-4, 3), type = "n", xlab = "dose",ylab = "log(p/(1-p))")
lines(ld, predictionsM)
lines(ld, predictionsF)
25 Modeling Normal data
The general linear model is a special case of GLM
Normal observations
\[Y_i \sim N(\mu_i, \sigma^2)\] Linear predictor
\[\eta_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}\]
Link function is the identity
\[g(\mu_i) = \mu_i\]
25.1 Example Normal data
Data on weekly food expenditure and household income for households with three family members.
<- read.table("data/GHJ_food_income.txt", header = TRUE)
d dim(d)
[1] 40 2
plot(d$Income, d$Food, xlab = "Weekly Household Income ($)",
ylab = "Weekly Household Food Expenditure ($)")
We first fit the model using lm()
, then compare to the results using glm()
. The default family for glm()
is "gaussian"
so the arguments of the call are unchanged.
<- lm(Food ~ Income, data = d)
foodLM summary(foodLM)
Call:
lm(formula = Food ~ Income, data = d)
Residuals:
Min 1Q Median 3Q Max
-0.50837 -0.15781 -0.00536 0.18789 0.49142
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.409418 0.161976 14.875 < 2e-16 ***
Income 0.009976 0.002234 4.465 6.95e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2766 on 38 degrees of freedom
Multiple R-squared: 0.3441, Adjusted R-squared: 0.3268
F-statistic: 19.94 on 1 and 38 DF, p-value: 6.951e-05
<- glm(Food ~ Income, data = d)
foodGLM summary(foodGLM)
Call:
glm(formula = Food ~ Income, data = d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.409418 0.161976 14.875 < 2e-16 ***
Income 0.009976 0.002234 4.465 6.95e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.07650739)
Null deviance: 4.4325 on 39 degrees of freedom
Residual deviance: 2.9073 on 38 degrees of freedom
AIC: 14.649
Number of Fisher Scoring iterations: 2
25.1.1 Coefficients
The estimated coefficients are the same. t-tests test the significance of each coefficient in the presence of the others.
25.1.2 Dispersion parameter
The dispersion parameter \(\phi \ (= \sigma^2)\) for the Gaussian family is estimated with the MSE \(\hat \sigma^2 = \sum (Y_i -\hat Y_i)^2/(n-p-1)\).
summary(foodLM)$sigma
[1] 0.2765997
summary(foodLM)$sigma^2
[1] 0.07650739
summary(foodGLM)$dispersion
[1] 0.07650739
25.1.3 Deviance
The deviance of a model is defined through the difference of the log-likelihoods between the saturated model (number of parameters equal to the number of observations) and the fitted model:
\[D = 2 \left( l({\boldsymbol{\hat\beta_{sat}}}) - l(\boldsymbol{\hat \beta}) \right) \phi,\]
\(l({\boldsymbol{\hat\beta_{sat}}})\) is the log-likelihood evaluated at the MLE of the saturated model, and \(l(\boldsymbol{\hat \beta})\) is the log-likelihood evaluated at the MLE of the fitted model.
The scaled deviance is obtained by dividing the deviance by the dispersion parameter
\[D^* = D/\phi = 2 \left( l({\boldsymbol{\hat\beta_{sat}}})- l(\boldsymbol{\hat \beta}) \right)\]
If \(\phi=1\) such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance are the same.
The scaled deviance has asymptotic chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.
\[D^* \sim \chi^2_{n-p-1}\]
Example
For Gaussian data, the deviance is equal to the residual sum of squares RSS.
\[D = RSS = \sum (Y_i - \hat Y_i)^2\]
In the Gaussian model:
\[D^* = \frac{D}{\phi} = \frac{RSS}{\sigma^2} = \frac{\sum_{i=1}^n(Y_i - \hat Y_i)^2}{\sigma^2} \mbox{ is distributed as a } \chi^2_{n-p-1}\]
Null deviance and Residual deviance
The R output shows the deviance of two models:
Null deviance
Null deviance defined through is the difference of the log-likelihoods between the saturated model and the null model (model with just an intercept). Degrees of freedom \(n-1\), where \(n\) is the number of observations and 1 comes from the intercept.
Residual Deviance
Residual deviance defined through the difference of the log-likelihoods between the saturated model and the fitted model. Degrees of freedom \(n − p - 1\), where \(p\) is the number of explanatory variables and 1 comes from the intercept.
25.1.4 Akaike Information Criterion (AIC)
The AIC is a measure of fit that penalizes for the number of parameters \(p\)
\[AIC = -2 \mbox{log-likelihood} + 2(p+1)\] where \(\mbox{log-likelihood}\) is the maximized log likelihood of the model and \(p+1\) the number of model parameters including the intercept.
Smaller values indicate better fit and thus the AIC can be used to compare models (not necessarily nested).