21  Exponential family

21.1 Generalized linear models (GLMs)

Generalized linear models allow for response distributions other than normal, and for a degree of non-linearity in the model structure.

A GLM is defined by specifying two components:

  • The response should be a member of the exponential family distribution.
  • The link function describes how the mean of the response and a linear combination of the predictors are related.

Specifically, a GLM makes the distributional assumptions that the \(Y_i\) are independent and

\(Y_i \sim\) some exponential family distributions (e.g., normal, Poisson, Binomial, Gamma).

\[g(\mu_i) = \boldsymbol{X}_i' \boldsymbol{\beta}\]

  • \(\mu_i = E(Y_i)\)
  • \(g\) is a smooth monotonic link function
  • \(\boldsymbol{\beta}\) is a vector of unknown parameters
  • \(\boldsymbol{X}_i'=(X_{1i}, \ldots, X_{pi})\) is the \(i\)th row of a model matrix \(\boldsymbol{X}\).

Because GLMs are specified in terms of the linear predictor \(\boldsymbol{\eta} \equiv \boldsymbol{X \beta}\), many of the general ideas and concepts of linear modeling carry over, to generalized linear modeling.

21.2 Exponential family of distributions

In a GLM the distribution of the response variable \(Y_i\) comes from a distribution in the exponential family with probability density function

\[f(y_i;\theta_i,\phi)=\exp\left\{{y_i\theta_i -b(\theta_i)\over a_i(\phi)}+c(y_i,\phi)\right\}.\]

Here \(\theta_i\) is an unknown parameter that is a function of the mean, \(\phi\) is a dispersion parameter that may or may not be known, and \(a_i(\phi)\), \(b(\theta_i)\) and \(c(y_i, \phi)\) are known functions.

  • \(\theta_i\) is called canonical parameter and represents the location
  • \(\phi\) is called dispersion parameter and represents the scale

The exponential family includes as special cases many common distributions. Some exponential family distributions:

Note: In Binomial distribution, \(\mu = n \pi\)

Normal or Gaussian

\[f(y) = \frac{1}{\sqrt{2\pi}\sigma} exp\left(-\frac{(y-\mu)^2}{2 \sigma^2}\right)= exp\left( \frac{y \mu-\mu^2/2}{\sigma^2} - \frac{1}{2}\left( \frac{y^2}{\sigma^2} + log(2 \pi \sigma^2) \right) \right)\]

So we can write

  • \(\theta = \mu\)
  • \(\phi = \sigma^2\)
  • \(a(\phi)= \phi\)
  • \(b(\theta) = \theta^2/2\)
  • \(c(y, \phi) = -(y^2/\phi+ log(2\pi \phi))/2\)

Poisson

\[f(y) = e^{-\mu}\mu^y/y! = exp(y \log(\mu)-\mu-\log(y!))\] So we can write

  • \(\theta = \log(\mu)\)
  • \(\phi = 1\)
  • \(a(\phi)= 1\)
  • \(b(\theta) = exp(\theta)\)
  • \(c(y, \phi) = -\log(y!)\)

Binomial

\[f(y) ={n \choose y} \pi^y (1-\pi)^{n-y} =\] \[exp \left( y \log \pi + (n-y) \log(1-\pi) + \log {n \choose y} \right) = exp \left( y \log \frac{\pi}{1-\pi} + n \log{(1-\pi)} + \log {n \choose y} \right)\]

So we can write

  • \(\theta = \log \frac{\pi}{1-\pi}\) (then \(\pi = \frac{e^\theta}{1+e^\theta}\) so \(1-\pi = \frac{1}{1+e^\theta}\))
  • \(b(\theta) = - n \log(1-\pi) = n \log(1+ exp(\theta))\)
  • \(c(y, \phi) = \log{n \choose y}\)
  • \(\phi = 1\)
  • \(a(\phi)= 1\)

21.2.1 Mean and variance of the exponential family distributions

If \(Y_i\) has a distribution in the exponential family then it has mean and variance

\[E(Y_i) = b'(\theta_i)\] \[Var(Y_i) = b''(\theta_i) a_i(\phi),\]

where \(b'(\theta_i)\) and \(b''(\theta_i)\) are the first and second derivatives of \(b(\theta_i)\).

That is, the mean is a function of the location only, while the variance is a product of functions of the location and the scale.

Normal

  • \(\theta = \mu\)
  • \(\phi = \sigma^2\)
  • \(a(\phi)= \phi\)
  • \(b(\theta) = \theta^2/2\)

\[E(Y) = b'(\theta) = (\theta^2/2)' = \theta = \mu\] \[Var(Y) = b''(\theta) a(\phi) = (\theta)' a(\phi) = a(\phi) = \sigma^2\]

Poisson

  • \(\theta = \log(\mu)\)
  • \(\phi = 1\)
  • \(a(\phi)= 1\)
  • \(b(\theta) = exp(\theta)\)

\[E(Y) = b'(\theta) = (exp(\theta))' = exp(\theta) = \mu\]

\[Var(Y) = b''(\theta) a(\phi) = (exp(\theta))' a(\phi) = exp(\theta) = \mu\]

Binomial

  • \(\theta = \log \frac{\pi}{1-\pi}\) (then \(\pi = \frac{e^\theta}{1+e^\theta}\) so \(1-\pi = \frac{1}{1+e^\theta}\))
  • \(b(\theta) = - n \log(1-\pi) = n \log(1+ exp(\theta))\)
  • \(\phi = 1\)
  • \(a(\phi)= 1\)

\[E(Y) = b'(\theta) = ( n \log(1+ exp(\theta)))' = n \frac{exp(\theta)}{1+exp(\theta)} = n \pi = \mu\] Differentiating again using the quotient rule \(\left( (g/h)' = (g'h-gh')/h^2 \right)\)

\[Var(Y) = b''(\theta) a(\phi) = \left(n \frac{e^\theta}{1+e^\theta }\right)' = n \frac{e^{\theta}}{(1+e^{\theta})^2} = n \frac{e^{\theta}}{(1+e^{\theta})} \frac{1}{(1+e^{\theta})} = n \pi(1-\pi) = \mu (1 - \mu/n)\]

Derivation mean and variance of the exponential family distributions

\[E(Y) = b'(\theta)\] \[Var(Y) = b''(\theta) a(\phi)\]

The log-likelihood for a single \(y\) is given by

\[l(\theta) = (y \theta - b(\theta))/a(\phi) + c(y, \phi)\] Taking derivatives with respect to \(\theta\) gives

\[l'(\theta) = (y - b'(\theta))/a(\phi)\]

Taking expectation over \(y\) gives

\[E[l'(\theta)] = (E[Y]-b'(\theta))/a(\phi)\] From likelihood theory, we know \(E[l'(\theta)]=0\) at the true value of \(\theta\) so

\[E[Y]=\mu=b'(\theta)\]

Taking second derivatives:

\[l''(\theta) = - b''(\theta)/a(\phi)\]

From likelihood theory, we know \(E[l''(\theta)]=-E[(l'(\theta))^2]\). We evaluate at the true value of \(\theta\) and obtain

\[b''(\theta)/a(\phi) = E[(Y-b'(\theta))^2]/a^2(\phi)\] (\(l''(\theta) = - b''(\theta)/a(\phi)\) and \(E[l''(\theta)] = - b''(\theta)/a(\phi)\))

(\(l'(\theta) = (y - b'(\theta))/a(\phi)\) and \(E[l'(\theta)^2] = E[(y - b'(\theta))^2/a(\phi)^2]\))

which gives

\[Var[Y]= E[(Y-b'(\theta))^2] = b''(\theta) a(\phi)\]

21.2.2 Variance function

\[E(Y) = b'(\theta)\] \[Var(Y) = b''(\theta) a(\phi),\]

\(a(\phi)\) could in principle be any function of \(\phi\), and when working with GLMs there is no difficulty in handling any form of \(a(\phi)\), if \(\phi\) is known. However, when \(\phi\) is unknown matters become awkward, unless we can write \(a(\phi) = \phi/w\), where \(w\) are known weights that may vary between observations. For example, \(a(\phi) = \phi/w\) allows the possibility of unequal variances in models based on the normal distribution, but in most cases \(w\) is simply 1.

Then we have

\[Var(Y) = b''(\theta) \frac{\phi}{w}\]

Subsequently it is convenient to write \(Var(Y)\) as a function of \(\mu = E(Y)\)

\[Var(Y)=V(\mu)\phi\]

\[V(\mu) = \frac{b''(\theta)}{w} = \frac{b''(b'^{-1}(\mu))}{w}\]

The variance function \(V(\mu)\) describes how the variance relates to the mean (\(Var(Y)=V(\mu)\phi\)) using the known relationship between \(\theta\) and \(\mu\)\(E[Y] = \mu = b'(\theta)\), \(b'^{-1}(\mu) = \theta\).

Example: In the normal case, \(b(\theta)= \theta^2/2\) and \(b''(\theta)=1\) and so the variance is independent of the mean \(Var(Y)=V(\mu)\phi = 1 \times \sigma^2\).