28 Cox proportional hazards model

Model selection and interpretation: Chapter 6 PDF Applied Survival Analysis using R

28.1 Cox proportional hazards (PH) model

The general form of the Cox PH model gives an expression for the hazard at time \(t\) for an individual with a given specification of a set of explanatory variables denoted by \(\boldsymbol{x}=(x_1, x_2, \ldots, x_p)\).

The hazard at time \(t\) is the product of two quantities:

\(h_0(t)\): baseline hazard function
exponential of the linear sum of \(\beta_i x_i\)

\[h(t, \boldsymbol{x}) = h_0(t) exp(\sum_{i=1}^p \beta_i x_i)\]

Here the baseline hazard is a function of \(t\), but does not involve the \(x\)’s, whereas the exponential expression involves the \(x\)’s, but does not involve \(t\). The \(x\)’s here are called time-independent \(x\)’s (e.g., age, weight if their values do not change much over time, etc.).

It is possible, nevertheless, to consider \(x\)’s that involve \(t\). Such \(x\)’s are called time-dependent variables. If time-dependent variables are considered, the Cox model form may still be used, but such a model no longer satisfies the PH assumption, and is called the extended Cox model.

The baseline hazard \(h_0(t)\) is an unspecified function. This property makes the Cox model a semiparametric model.

28.1.1 Maximum likelihood estimation of the Cox model

The estimates of the Cox model parameters are derived by maximizing a likelihood function.

The formula for the Cox model likelihood function is actually called a “partial” likelihood function rather than a (complete) likelihood function. The term “partial” likelihood is used because the likelihood formula considers probabilities only for those subjects who fail, and does not explicitly consider probabilities for those subjects who are censored. Thus, the likelihood for the Cox model does not consider probabilities for all subjects, and so it is called a “partial” likelihood.

28.1.2 Assumptions

The Cox PH model assumes that the hazard ratio comparing any two specifications of predictors is constant over time

\[\frac{\hat h(t, \boldsymbol{x^*})}{\hat h(t, \boldsymbol{x})} = \hat \theta \mbox{ constant over } t\] Equivalently, this means that the hazard for one individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time

\[\hat h(t, \boldsymbol{x^*}) = \hat \theta \hat h({t, \boldsymbol{x})}\]

General approaches for assessing the PH assumption include graphical and other testing approaches.

28.1.3 Hazard ratio (HR)

From the Cox PH model, we can obtain a general formula to estimate a hazard ratio that compares two specifications of the covariates

Hazard for individual with covariate vector \(\boldsymbol{x} = (x_1, x_2, \ldots, x_p)\)

\[h(t, \boldsymbol{x}) = h_0(t) exp(\beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p),\]

where baseline hazard \(h_0(t)\) is hazard function for individual with all \(x_1 = x_2 = \ldots = x_p = 0\)

Hazard ratio (HR)

Consider a subject with \(\boldsymbol{x} = (x_1, x_2, \ldots, x_p)\) where \(x_1=0\)

Consider another subject with \(\boldsymbol{x}' = (x_1', x_2', \ldots, x_p')\) where \(x_1'=1\) and \(x_j'=x_j\), \(j > 1\) \[HR = \frac{h(t, \boldsymbol{x'})}{h(t, \boldsymbol{x})} = exp(\beta_1)\]

\[HR = \frac{h(t, \boldsymbol{x'})}{h(t, \boldsymbol{x})} = exp(\beta_1)\]

Example: Melanoma data

\[h(t, \boldsymbol{x}) = h_0(t) exp(\beta_1 x_1 + \beta_2 x_2 + \beta_2 x_3 + \beta_4 x_4)\]

\(x_1=\) sex (M=1, F=0)
\(x_2=\) indicator of ulceration
\(x_3=\) age
\(x_4=\) thickness (mm)

Hazard ratio between those with ulceration (\(\boldsymbol{x'}=(x_1, 1, x_3, x_4)\)) and without ulceration (\(\boldsymbol{x}=(x_1, 0, x_3, x_4)\)) adjusted for sex, age and thickness

\[HR = \frac{h(t, \boldsymbol{x'})}{h(t, \boldsymbol{x})} = exp(\beta_2)\]

28.2 Model selection and interpretation

28.2.1 Example 1

Effect of treatment on survival in the presence of a genetic confounder.

Simulated data from example in page 51 of the book Applied survival analysis using R by Moore.

We shall set up a simulated data set from a clinical trial comparing a standard therapy (control) to an experimental therapy (treated). For simplicity, we suppose that the survival times are exponentially distributed, and that the disease is rapidly fatal, so that there is no censoring. We also suppose that there is a confounding variable, “genotype”, which can either be wild type (i.e. normal) or mutant, and that patients carrying the mutant genotype have a considerably poorer prognosis.

Specifically, we set the hazard rate for a mutant patient in the control group at 0.03 per day, and we assume that the effect of treatment is to reduce the hazard by a factor of 0.55. We also assume that the hazard rate for wild type patients is reduced by a factor of 0.2 as compared to mutant patients, and that the multiplicative effect of treatment on the wild type patients is the same as for the mutant patients. In R, we set up the four hazard rates as follows:

# Simulate data
lambda.mutant.0 <- 0.03 # mutant control
lambda.mutant.1 <- 0.03*0.55 # mutant treatment
lambda.wt.0 <- 0.03*0.2 # wildtype control
lambda.wt.1 <- 0.03*0.2*0.55 # wildtype treatment

set.seed(4321)
# survival times
tt.control.mutant <- rexp(25, rate = lambda.mutant.0)
tt.treat.mutant <- rexp(125, rate = lambda.mutant.1)
tt.control.wt <- rexp(125, rate = lambda.wt.0)
tt.treat.wt <- rexp(25, rate = lambda.wt.1)
ttAll <- c(tt.control.mutant, tt.treat.mutant, tt.control.wt, tt.treat.wt)
# no censoring
status <- rep(1, length(ttAll))
# genotype
genotype <- c(rep("mutant", 150), rep("wt", 150))
# treatment
trt <- c(rep(0, 25), rep(1, 125), rep(0, 125), rep(1, 25))

Cox PH model without adjusting by genotype

Cox proportional hazards model of the effect of treatment on survival unadjusted for the genetic mutation status of the patients:

\[h(t; trt) = h_0(t) exp(\beta_1 \times trt)\]

library(survival)
coxph(Surv(ttAll, status) ~ trt)

Call:
coxph(formula = Surv(ttAll, status) ~ trt)

      coef exp(coef) se(coef)     z        p
trt 0.4639    1.5902   0.1173 3.955 7.64e-05

Likelihood ratio test=15.51  on 1 df, p=8.198e-05
n= 300, number of events= 300

Estimate of the log hazard ratio treatment effect \(\hat \beta\) is 0.464. Since this is positive, higher hazards are associated with the treatment than with the control. That is, treatment appears to reduce survival. \(exp(\hat \beta)= 1.59\) suggests that treatment is associated with a 59% additional risk of death over the control.

Cox PH model adjusting by genotype

We can adjust for genotype and now the coefficient is negative indicating that the treatment is effective. We also see that the wild type genotype has lower hazard than the reference (mutant) genotype, and thus that the mutant genotype incurs additional risk of death.

\[h(t; trt, genotype) = h_0(t) exp(\beta_1 \times trt + \beta_2 \times genotype)\]

coxph(Surv(ttAll, status) ~ trt + genotype)

Call:
coxph(formula = Surv(ttAll, status) ~ trt + genotype)

              coef exp(coef) se(coef)      z       p
trt        -0.4523    0.6361   0.1634 -2.768 0.00563
genotypewt -1.5676    0.2085   0.1825 -8.589 < 2e-16

Likelihood ratio test=93.39  on 2 df, p=< 2.2e-16
n= 300, number of events= 300

28.2.2 Example 2

Effect of race and age on survival using simulated data of 60 patients.

Simulated data from example in page 77 of the book Applied survival analysis using R by Moore.

age <- runif(n = 60, min = 40, max = 80)
race <- factor(c(rep("white", 20), rep("black", 20), rep("other", 20)))
# white reference category
race <- relevel(race, ref = "white")

Create a matrix of dummy variables for race and also a column for age. We remove the first column of 1s for the intercept because Cox PH has no intercept. If there were one, it would appear in both the numerator and denominator of the partial likelihood, and cancel out just as the baseline hazard canceled out. Another way to think of it is that any intercept term would be absorbed into the baseline hazard.

model.matrix(~ race + age)[, -1]

   raceblack raceother      age
1          0         0 76.25483
2          0         0 66.76425
3          0         0 59.18786
4          0         0 41.46953
5          0         0 78.98840
6          0         0 60.77425
7          0         0 72.31459
8          0         0 43.01703
9          0         0 67.67498
10         0         0 41.88624
11         0         0 46.76925
12         0         0 60.94440
13         0         0 46.23898
14         0         0 73.57508
15         0         0 71.48633
16         0         0 66.88909
17         0         0 70.53877
18         0         0 75.33032
19         0         0 59.62092
20         0         0 79.63690
21         1         0 75.75531
22         1         0 68.85418
23         1         0 41.31479
24         1         0 54.27050
25         1         0 55.58030
26         1         0 64.97783
27         1         0 57.24783
28         1         0 45.04763
29         1         0 79.69357
30         1         0 65.89334
31         1         0 54.90009
32         1         0 74.28672
33         1         0 70.34818
34         1         0 50.20797
35         1         0 73.76730
36         1         0 48.70321
37         1         0 76.58092
38         1         0 75.19160
39         1         0 72.53127
40         1         0 61.84747
41         0         1 44.57445
42         0         1 60.94711
43         0         1 46.22843
44         0         1 48.31371
45         0         1 53.03069
46         0         1 79.40954
47         0         1 43.31025
48         0         1 78.38583
49         0         1 40.91891
50         0         1 48.01046
51         0         1 73.97797
52         0         1 53.97391
53         0         1 66.34509
54         0         1 57.37401
55         0         1 63.78242
56         0         1 62.09238
57         0         1 62.79754
58         0         1 54.98288
59         0         1 73.46790
60         0         1 64.49439

The survival variables in our simulated data will be exponentially distributed with a particular rate parameter that depends on the covariates. Specifically, we set the log rate parameter to have baseline -4.5, and the race variable to take the values 1 and 2 for “black” and “other” respectively, when compared to “white”. Finally, we let “age” increase the log rate by 0.05 per year:

log.rate.vec <- -4.5 + c(rep(0, 20), rep(1, 20), rep(2, 20)) + age*0.05

Exponential survival variables, with no censoring:

# survival times
tt <- rexp(n = 60, rate = exp(log.rate.vec))
# no censoring
status <- rep(1, 60)

Fit Cox PH model:

\[h(t; race, age) = h_0(t) exp(\beta_1 \times raceblack + \beta_2 \times raceother + \beta_3 \times age)\]

library(survival)
coxph(Surv(tt, status) ~ race + age)

Call:
coxph(formula = Surv(tt, status) ~ race + age)

             coef exp(coef) se(coef)     z        p
raceblack 1.03603   2.81799  0.35622 2.908  0.00363
raceother 1.88924   6.61437  0.41042 4.603 4.16e-06
age       0.05451   1.05603  0.01384 3.939 8.18e-05

Likelihood ratio test=30.57  on 3 df, p=1.049e-06
n= 60, number of events= 60

We see that the coefficient estimates, 1.04, 1.89, and 0.05, are close to the true values from the simulation, (1, 2, and 0.05). These estimates are log hazard ratios. To describe the estimated effect of, say, “black race” compared to “white race”, we can look at the “exp(coef)” column, and conclude that blacks have exp(1.04) = 2.83 times the risk of death as do whites.