22 Deviance

Learning objectives

Deviance
Comparing models
Deviance residuals
AIC

22.1 Deviance

When working with GLMs in practice, it is useful to have a quantity that can be interpreted in a similar way to the residual sum of squares in ordinary linear modeling.

The deviance of a model is defined through the difference of the log-likelihoods between the saturated model and the fitted model (\(\phi\) is the dispersion parameter):

\[D = 2 \left(l({\boldsymbol{\hat\beta_{sat}}}) - l(\hat{\boldsymbol{\beta}})\right)\phi\]

\(l(\boldsymbol{\hat\beta_{sat}})\) is the maximized log-likelihood of the saturated model, the model with one parameter per data point. For exponential family distributions it is computed by simply setting \(\hat{\boldsymbol{\mu}}=y\) and evaluating the likelihood.

The deviance measures the difference of the fitted model with respect to a perfect model for the sample \(\{(x_i, y_i)\}_{i=1}^n\). This perfect model, known as the saturated model, is the model that perfectly fits the data, in the sense that the fitted responses \(\hat y_i\) equal the observed responses \(y_i\).

Saturated model is the model where each observation has their own unique \(\mu_i\). Then, \(\hat \mu_i = y_i\)
“Null” model is the model with just an intercept. Then, \(\hat \mu_i = \hat \mu = \bar y\)
In the fitted model \(\hat \mu_i = g^{-1}(\boldsymbol{x_i' \hat \beta})\)

log-likelihoods

\(l({\boldsymbol{\hat\beta_{sat}}})\) is the log-likelihood evaluated at the MLE of the saturated model, and \(l(\boldsymbol{\hat \beta})\) is the log-likelihood evaluated at the MLE of the fitted model

\(l(\boldsymbol{\hat \beta}_{sat})\) perfect fit (interpolation)

\(l(\boldsymbol{\hat \beta})\) model analyzed

\(l(\hat \beta_0)\) worst fit (intercept only)

Null deviance

The null deviance defined through the difference of the log-likelihoods between the saturated model and the null model.

\(D_0 = 2 \left( l(\boldsymbol{\hat \beta}_{sat}) - l(\hat \beta_0) \right) \phi\)

Residual Deviance

The residual deviance defined through the difference of the log-likelihoods between the saturated model and the given model.

\(D = 2 \left( l(\boldsymbol{\hat \beta}_{sat}) - l(\boldsymbol{\hat \beta}) \right) \phi\)

22.2 Scaled deviance

A related quantity is the scaled deviance

\[D^* = D/\phi = 2 \left( l(\boldsymbol{\hat \beta}_{sat}) - l(\boldsymbol{\hat \beta}) \right)\]

If \(\phi=1\) such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance are the same.

The scaled deviance has asymptotic distribution

\[D^* \sim \chi^2_{n-p-1}\]

Example: In the case of Gaussian model

\[D^* = \frac{D}{\phi} = \frac{RSS}{\sigma^2} = \frac{\sum_{i=1}^n(Y_i - \hat Y_i)^2}{\sigma^2} \mbox{ is distributed as a } \chi^2_{n-p-1}\]

22.3 Estimating \(\phi\)

This result provides a way of estimating \(\phi\) when it is unknown. Match \(D^* = \frac{D}{\phi}\) with the expectation E of variable \(\chi^2_{n-p-1}\) which is \(n-p-1\).

\[\hat \phi = \frac{D}{n-p-1}\]

Example: In the case of Gaussian model

\[\hat \sigma^2= \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n-p-1}\]

22.4 Model comparison with known \(\phi\)

Model M1 with \(p_1\) predictors nested in model M2 with \(p_2\) predictors, \(p_1 < p_2\).

\(H_0: \beta_{p_1+1} = \ldots = \beta_{p_2} = 0\) (M1 model fits as well as M2 model) vs.

\(H_1: \beta_{j} \neq 0\) for any \(p_1 < j \leq p_2\) (M2 fits better than M1)

The change in deviances has a chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.

\[D^*_{p_1} - D^*_{p_2} \sim \chi^2_{p_2-p_1}\] \[2 (l(\hat{\boldsymbol{\beta_2}}) - l(\hat{\boldsymbol{\beta_1}})) \sim \chi^2_{p_2-p_1}\]

If M1 fits the data as well as M2 (\(H_0\)), then \(D^*_{p_1} - D^*_{p_2}\) is expected to be small.

Thus, we would reject \(H_0\) if the test statistic is \(>\) quantile \(\chi^2_{1-\alpha,p_2-p_1}\).

22.5 Model comparison with unknown \(\phi\)

\(D^*\) cannot be computed unless \(\phi\) is known.

\(F\) can be calculated without knowing \(\phi\) which cancels from top and bottom of the ratio.

Under \(H_0\), the assuming numerator and denominator are asymptotically independent:

\[F=\frac{(D_{p_1}^*-D_{p_2}^*)/(p_2-p_1)}{D_{p_2}^*/(n-p_2-1)} \sim F_{p_2-p_1, n-p_2-1}\] Thus, we would reject \(H_0\) if the test statistic is \(>\) quantile \(F_{1-\alpha,p_2-p_1, n-p_2-1}\).

22.6 AIC

Akaike’s information criterion (AIC) can be used for model selection.

\[AIC = -2 \mbox{log-likelihood} + 2(p+1),\] where \(\mbox{log-likelihood}\) is the maximized log likelihood of the model and \(p+1\) the number of model parameters including the intercept. The model with the lowest AIC is selected. Models under comparison do not need to be nested.