22 Deviance
- Deviance
- Comparing models
- Deviance residuals
- AIC
22.1 Deviance
When working with GLMs in practice, it is useful to have a quantity that can be interpreted in a similar way to the residual sum of squares in ordinary linear modeling.
The deviance of a model is defined through the difference of the log-likelihoods between the saturated model and the fitted model (\(\phi\) is the dispersion parameter):
\[D = 2 \left(l({\boldsymbol{\hat\beta_{sat}}}) - l(\hat{\boldsymbol{\beta}})\right)\phi\]
\(l(\boldsymbol{\hat\beta_{sat}})\) is the maximized log-likelihood of the saturated model, the model with one parameter per data point. For exponential family distributions it is computed by simply setting \(\hat{\boldsymbol{\mu}}=y\) and evaluating the likelihood.
The deviance measures the difference of the fitted model with respect to a perfect model for the sample \(\{(x_i, y_i)\}_{i=1}^n\). This perfect model, known as the saturated model, is the model that perfectly fits the data, in the sense that the fitted responses \(\hat y_i\) equal the observed responses \(y_i\).
Saturated model is the model where each observation has their own unique \(\mu_i\). Then, \(\hat \mu_i = y_i\)
“Null” model is the model with just an intercept. Then, \(\hat \mu_i = \hat \mu = \bar y\)
In the fitted model \(\hat \mu_i = g^{-1}(\boldsymbol{x_i' \hat \beta})\)
log-likelihoods
\(l({\boldsymbol{\hat\beta_{sat}}})\) is the log-likelihood evaluated at the MLE of the saturated model, and \(l(\boldsymbol{\hat \beta})\) is the log-likelihood evaluated at the MLE of the fitted model
\(l(\boldsymbol{\hat \beta}_{sat})\) perfect fit (interpolation)
\(l(\boldsymbol{\hat \beta})\) model analyzed
\(l(\hat \beta_0)\) worst fit (intercept only)
Null deviance
The null deviance defined through the difference of the log-likelihoods between the saturated model and the null model.
\(D_0 = 2 \left( l(\boldsymbol{\hat \beta}_{sat}) - l(\hat \beta_0) \right) \phi\)
Residual Deviance
The residual deviance defined through the difference of the log-likelihoods between the saturated model and the given model.
\(D = 2 \left( l(\boldsymbol{\hat \beta}_{sat}) - l(\boldsymbol{\hat \beta}) \right) \phi\)
22.2 Scaled deviance
A related quantity is the scaled deviance
\[D^* = D/\phi = 2 \left( l(\boldsymbol{\hat \beta}_{sat}) - l(\boldsymbol{\hat \beta}) \right)\]
If \(\phi=1\) such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance are the same.
The scaled deviance has asymptotic distribution
\[D^* \sim \chi^2_{n-p-1}\]
Example: In the case of Gaussian model
\[D^* = \frac{D}{\phi} = \frac{RSS}{\sigma^2} = \frac{\sum_{i=1}^n(Y_i - \hat Y_i)^2}{\sigma^2} \mbox{ is distributed as a } \chi^2_{n-p-1}\]
22.3 Estimating \(\phi\)
This result provides a way of estimating \(\phi\) when it is unknown. Match \(D^* = \frac{D}{\phi}\) with the expectation E of variable \(\chi^2_{n-p-1}\) which is \(n-p-1\).
\[\hat \phi = \frac{D}{n-p-1}\]
Example: In the case of Gaussian model
\[\hat \sigma^2= \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n-p-1}\]
22.4 Model comparison with known \(\phi\)
Model M1 with \(p_1\) predictors nested in model M2 with \(p_2\) predictors, \(p_1 < p_2\).
\(H_0: \beta_{p_1+1} = \ldots = \beta_{p_2} = 0\) (M1 model fits as well as M2 model) vs.
\(H_1: \beta_{j} \neq 0\) for any \(p_1 < j \leq p_2\) (M2 fits better than M1)
The change in deviances has a chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.
\[D^*_{p_1} - D^*_{p_2} \sim \chi^2_{p_2-p_1}\] \[2 (l(\hat{\boldsymbol{\beta_2}}) - l(\hat{\boldsymbol{\beta_1}})) \sim \chi^2_{p_2-p_1}\]
If M1 fits the data as well as M2 (\(H_0\)), then \(D^*_{p_1} - D^*_{p_2}\) is expected to be small.
Thus, we would reject \(H_0\) if the test statistic is \(>\) quantile \(\chi^2_{1-\alpha,p_2-p_1}\).
22.5 Model comparison with unknown \(\phi\)
\(D^*\) cannot be computed unless \(\phi\) is known.
\(F\) can be calculated without knowing \(\phi\) which cancels from top and bottom of the ratio.
Under \(H_0\), the assuming numerator and denominator are asymptotically independent:
\[F=\frac{(D_{p_1}^*-D_{p_2}^*)/(p_2-p_1)}{D_{p_2}^*/(n-p_2-1)} \sim F_{p_2-p_1, n-p_2-1}\] Thus, we would reject \(H_0\) if the test statistic is \(>\) quantile \(F_{1-\alpha,p_2-p_1, n-p_2-1}\).
22.6 AIC
Akaike’s information criterion (AIC) can be used for model selection.
\[AIC = -2 \mbox{log-likelihood} + 2(p+1),\] where \(\mbox{log-likelihood}\) is the maximized log likelihood of the model and \(p+1\) the number of model parameters including the intercept. The model with the lowest AIC is selected. Models under comparison do not need to be nested.