23  Inference

Learning objectives
  • Inference for GLMs
  • Fitting generalized linear models
  • Estimation using iteratively re-weighted least squares (IRLS)

24 Inference for GLMs

In general, for Generalized Linear Models (GLMs) there is no closed form solution for the maximum likelihood estimator \(\hat{\boldsymbol{\beta}}\).

Fisher’s scoring method can be used to obtain an MLE for \(\boldsymbol{\beta}\) denoted as \(\hat{\boldsymbol{\beta}}\). Fisher’s scoring method maximizes the log likelihood function \(l(\boldsymbol{\beta}|\boldsymbol{y})\) over \(\boldsymbol{\beta}\) which results in an iterative re-weighted least squares approach. For sufficiently large samples, \(\hat{\boldsymbol{\beta}}\) is approximately normal with mean \(\boldsymbol{\beta}\) and a variance-covariance matrix that can be approximated by the estimated inverse of the Fisher Information matrix \({I}^{-1}(\hat{\boldsymbol{\beta}}).\)

Large sample distribution of \(\boldsymbol{\hat \beta}\):

\[\boldsymbol{\hat \beta} \sim N(\boldsymbol{\beta}, (\boldsymbol{X' W X})^{-1} \phi)\]

For distributions with known scale parameter \(\phi\), this result can be used directly to find confidence intervals for the parameters.

If the scale parameter is unknown (e.g., for the normal distribution), then it must be estimated and intervals must be based on an appropriate \(t\) distribution.

25 Iteratively re-weighted least squares (IRLS)

Iteratively re-weighted least square (IRLS) algorithm

  • Wood (2006) Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science)

  • Section 3.1.2 Page 105: PDF Book Generalized Additive Models Chapter 3

  • Page 106 of Book Wood GAMs. Last formula wrong. We need transpose X’: (X’WX)^{-1}X’W z

The log-likelihood function for \(Y\) from the exponential family is

\[l(\boldsymbol{\beta}) = \sum_{i=1}^n \left(\frac{y_i \theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi) \right),\]

where the dependence of the right hand side on \(\beta\) is through the dependence of \(\theta_i\) on \(\beta\). \[E[Y] = \mu = b'(\theta),\ \ \ b'^{-1}(\mu)=\theta,\ \ \ g(\mu) = \eta = X \beta\]

We estimate \(\beta\) by maximizing the log-likelihood.

Fisher scoring technique is used for maximizing the GLM log-likelihood over \(\beta\). The basic step is

\[\beta^{(k+1)} = \beta^{(k)} + I(\beta^{(k)})^{-1} \frac{\partial l(\beta^{(k)})}{\partial \beta}\]

Fisher Information is minus the expected Hessian (second derivative of the log-likelihood): \(I(\beta) = - E(H(\beta)) = - E\left(\frac{\partial^2 l}{\partial \beta \partial \beta'}\right) = X'WX/\phi\)

Fisher scoring simplifies to

\[\beta^{k+1} = (X' W X)^{-1} X' W z\]

where \(W\) is a diagonal matrix with

\[W_{ii} = 1/ (g'(\hat \mu_i)^2 V(\hat \mu_i))\]

and

\[z_i = g'(\hat \mu_i)(y_i - \hat \mu_i) + \hat \eta_i\]

Both equations of \(W_{ii}\) and \(z_i\) use \(\beta^{(k)}\) and the derived values of \(\theta_i^{(k)}\) and \(\mu_i^{(k)}\).

26 IRLS

GLMs can be estimated by the iteratively re-weighted least square (IRLS) algorithm as follows.

  • Initialize \(\hat \mu_i = y_i + \delta_i\) and \(\hat \eta_i = g(\hat \mu_i)\) where \(\delta_i\) is usually zero, but may be a small constant ensuring that \(\hat \eta_i\) is finite (for example to avoid taking the logarithm of zero).

  • Iterate the following two steps until convergence:

  • Compute pseudodata \[z_i = g'(\hat \mu_i) (y_i - \hat \mu_i) + \hat \eta_i\] and iterative weights \[w_i = 1/ (g'(\hat \mu_i)^2 V(\hat \mu_i))\]

  • Find \(\hat{\boldsymbol{\beta}}\), the minimizer of the weighted least squares objective

\[\sum_{i=1}^n w_i (z_i - \boldsymbol{X}_i \boldsymbol{\beta})^2\] then update \(\hat{\boldsymbol{\eta}}=\boldsymbol{X}\hat{\boldsymbol{\beta}}\) and \(\hat \mu_i = g^{-1}(\hat \eta_i)\). That is, we calculate the weighted least-squares estimate \[\hat{\boldsymbol{\beta}} = (\boldsymbol{X' W X})^{-1} \boldsymbol{X' W z}\]

Here, \(\boldsymbol{W} \mbox{ diagonal matrix with weights } w_i\).

The procedure is repeated until successive estimates change by less than a specified small amount.

26.1 Example binomial response with R

An experiment measuring death rates for insects, with 30 insects at each of five treatment levels.

  • dead: number dead
  • alive: number alive
  • conc: concentration of insecticide
library(faraway)
data(bliss, package = "faraway")
head(bliss)
  dead alive conc
1    2    28    0
2    8    22    1
3   15    15    2
4   23     7    3
5   27     3    4

\[Y \sim Binomial(n, \pi)\] \[logit(\pi) = \beta_0 + \beta_1 \times \mbox{concentration}\]

glm()

An experiment measuring death rates for insects, with 30 insects at each of five treatment levels.

res <- glm(cbind(dead, alive) ~ conc, family = binomial, bliss)
summary(res)$coef
             Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.323790  0.4178878 -5.560798 2.685438e-08
conc         1.161895  0.1814158  6.404598 1.507665e-10

IRLS

For a binomial response we have:

\(\eta = g(\mu) = \log(\frac{\pi}{1-\pi})=\log(\frac{\mu}{n-\mu}) = \log(\mu)-log(n-\mu)\)

\(\frac{d \eta}{d \mu} = g'(\mu) =\frac{1}{\mu}+\frac{1}{n-\mu}=\frac{n}{\mu(n-\mu)}=\frac{1}{n \pi (1-\pi)}\)

The working dependent variable, which in general is

\[z = \eta + (y-\mu)\frac{d\eta}{d\mu}\] turns out to be

\[z = \eta + \frac{y-n\pi}{n\pi(1-\pi)}\]

The iterative weight turns out to be (\(\phi=1\))

\[w= 1/ (g'(\hat\mu)^2 V(\hat \mu)) = \frac{1}{n\pi(1-\pi)}(n\pi(1-\pi))^2 =n\pi(1-\pi)\]

# data
n <- 30
y <- bliss$dead
# initialize mu and eta
pi <- y/n
eta <- logit(pi)
# working data and iterative weights
z <- eta + (y-n*pi)/(n*pi*(1-pi))
w <- n*pi*(1-pi)
# estimate betas with WLS
res <- lm(z ~ conc, weights = w, bliss)
coef(res)
(Intercept)        conc 
  -2.302462    1.153587 
for(i in 1:5){
# update eta and mu
eta <- res$fit
pi <- exp(eta)/(1+exp(eta))
# working data and iterative weights
z <- eta + (y-n*pi)/(n*pi*(1-pi))
w <- n*pi*(1-pi)
# estimate betas with WLS
res <- lm(z ~ bliss$conc, weights = w)
print(coef(res))
}
(Intercept)  bliss$conc 
  -2.323672    1.161847 
(Intercept)  bliss$conc 
  -2.323790    1.161895 
(Intercept)  bliss$conc 
  -2.323790    1.161895 
(Intercept)  bliss$conc 
  -2.323790    1.161895 
(Intercept)  bliss$conc 
  -2.323790    1.161895 

Estimates of variance may be obtained using standard likelihood theory from \[\hat Var(\hat{\boldsymbol{\beta}})= (\boldsymbol{X'W X})^{-1} \hat \phi\] where \(W\) is a diagonal matrix formed from the weights \(w\).

# standard errors
xm <- model.matrix(res)
wm <- diag(w)
sqrt(diag(solve(t(xm) %*% wm %*% xm)))
(Intercept)  bliss$conc 
  0.4178878   0.1814158