Simulating Common, Censored, Outcome Variables as Dependent Variables in R [part 2]

Laure Ngouanfo Fopa
4 min readAug 13, 2021

[Under Review]

Some types of “times-to-event” are classified as “failure times”. Their log-transformation has the classical linear form, for each individual i:

The error terms ei, are considered independent and identically distributed (iid), with a specified distribution. When unspecified, the model is considered semi-parametric. Ti is independent of the censored time Ci conditional on a set of covariates Xi; ei is also assumed independent of Xi.

We simulate failure times from this class of accelerated failure times (AFT) models mentioned earlier here, and show how they can be fitted in R. We consider the linear predictor to be the same as the one simlated in our first article on this series of 3 articles about simulations:

linear.pred <- b[1]+b[2]*(Xage-mean(Xage))+b[3]*(Xbmi-mean(Xbmi))+b[4]*Xfemale+b[5]*Xexsmoker+b[6]*Xnonsmoker+b[7]*XB+b[8]*XC+b[9]*XD+b[10]*XE

We apply the same process of simulating right-censored times: the observed times Yi are the minimum between Ti and Ci. An indicator binary variable (status) is needed to distinguish the failure times from the censored times, in Yi.

Scenario 1: The error term e is normally distributed — Log-normal AFT

set.seed(10082021)
C <- runif(1000, 0, 37)
e.norm <- rnorm(1000, mean=0, sd=2)
T.norm <- exp(linear.pred + e.norm)
Y.norm <- pmin(T.norm, C)
status.norm <- as.numeric(T.norm < C)
ATFlnorm.survdat<-data.frame(Y = Y.norm, status = status.norm)

The failure times are said to be log-normally distributed in this case.

Scenario 2: The error term e is logistic distributed — Log-logistic AFT

set.seed(10082021)
C <- runif(1000, 0, 37)
e.logis <- rlogis(1000, location = 3, scale = 1)
T.logis <- exp(linear.pred + e.logis)
Y.logis <- pmin(T.logis, C)
status.logis <- as.numeric(T.logis < C)
ATFllogis.survdat<-data.frame(Y = Y.logis, status = status.logis)

The failure times are said to be log-logistic distributed in this case.

Scenario 3: The error term e is weibull distributed — Log-weibull AFT

set.seed(10082021)
C <- runif(1000, 0, 37)
e.weib <- rweibull(1000, shape=1.5, scale=2.7)
T.weib <- exp(linear.pred + e.weib)
Y.weib <- pmin(T.weib, C)
status.weib <- as.numeric(T.weib < C)
ATFlweib.survdat<-data.frame(Y = Y.weib, status = status.weib)

A shape of 1 gives a log-exponential AFT model. In effect, there is an exhaustive list of AFT models.

Fitting failure times

It is often recommended to verify the PH assumption is not satisfied before considering an AFT model. We illustrate this with the log-weibull AFT data because its gobal cox.zph() test is in fact statistically significant.

fitAFT.coxPH<-coxph(formula = Surv(Y, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+
XB+XC+XD+XE, data = ATFlnorm.survdat)

The PH assumption is violated, so we go ahead and fit an AFT model with the survreg() R function. fitAFT.survreg<-survreg(formula = Surv(Y, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+XB+XC+XD+XE, data = ATFlweib.survdat). Note, the R function survreg() is not built on a proportional hazards modelling assumption. We compare both models with the Akaike Information Criteria (AIC), computed manually with the following code:

AIC.coxPH <- -2*(fitAFT.coxPH$loglik[2]) + 2*length(fitAFT.coxPH$coefficients)
AIC.survreg <- -2*(fitAFT.survreg$loglik[2]) + 2*length(fitAFT.survreg$coefficients)

The AIC for the PH modelling assumption is 10598.24, and 3213,71 from the survreg model, making the survreg() model estimation preferable, as expected.

Lets make a similar comparison with simulated failure times where the PH is not violated, say the loglogistic data for example. We run the following 2 codes:

fitAFT.coxPHlogis<-coxph(formula = Surv(Y, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+ XB+XC+XD+XE, data = ATFllogis.survdat)
fitAFT.survreglogis<-survreg(formula = Surv(Y, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+ XB+XC+XD+XE, data = ATFllogis.survdat)

And the corresponding AIC are computed as

AIClogis.coxPH <- -2*(fitAFT.coxPHlogis$loglik[2]) + 2*length(fitAFT.coxPHlogis$coefficients)
AIClogis.survreg <- -2*(fitAFT.survreglogis$loglik[2]) + 2*length(fitAFT.survreglogis$coefficients)

The values are 5379.39 and 3662,7 respectively, still making survreg() preferable. So assuming a logistic distribution on the baseline hazard function h0(), provides a better estimation of the coefficients in the model than ignoring that baseline distribution.

Alternatives

The AFT data are also usually fitted with the aftreg() and the phreg() R functions from the eha R package. One component in the aftreg() function, is param which can take two entries “lifeAcc” and “lifeExp”. The later uses similar parametrization with the survreg() R function. when param is specified as “lifeAcc”, the aftreg() model function estimates true acceleration of time.

The parameterization in phreg() is similar to the one in coxph(), and naturally different from the one used by survreg().

There seems to be little literature and reliable computing methods about AFT. For example, authors of the phreg() function, highlight the following warning about the function: “The lognormal and loglogistic distributions are included on an experimental basis for the moment. Use with care, results may be unreliable!”. These journals may guide you on understanding more on this and how to interpret parameter estimates from the R outputs.

--

--

Laure Ngouanfo Fopa

Teaching/Research Assistant in Mathematics/(Bayesian) Statistics, Writer