Indirect comparison with Propensity Score Weighting (PSW) technique and the Bucher approach, in synthesizing evidence from two Individual Level Data— A simulation study.

[Under review]

[Under review]


Clinical trials often compare distinct treatments with little in common. In this short note, we are interested in comparing two treatments A and B, from two independent trials (the AC trial and the BC trial), that have a single treatment C in common and, patients characteristics of the same nature (e.g. age, gender, race, comorbidities, etc). Such characteristics, are considered to be recorded before treatment assignment. Thus, A and B are said to be indirectly compared through the anchor node C using the direct estimates of the relative effects of A vs C and B vs C.

Direct evidence from Randomized Controlled Trials (RCTs), are highly recommended by the national and international healthcare bodies such as the WHO, NICE in the UK or the FDA in the USA[1]. Indirect comparison maybe susceptible to bias [1], but regularly, is the only resort to evaluate the relative effectiveness of two treatments. This often occurs: when a treatment is newly manufactured and requires regulatory approval and the manufacturer has to convince the regulators that his/her treatment is better than the existing one already on the market place, but for some reasons, can not conduct a randomized trial to show that; or the treatment manufacturer needs subsidies from the government for his/her product to be affordable by any patients with any social status.

Several methods have been developed to find appropriate ways to indirectly compare treatments under specific circumstances. It started with Bucher et al.[2], then later on extended to propensity scores weighting (psw) techniques and outcome regression approaches[3]. Here we apply the psw on survival outcomes from the AC and the BC trials, and compare the results with the Bucher (naive) approach. The randomization technique employed in a RCT is to aim for balanced treatment groups in the trial, so that the estimated relative effects, reflect the true difference between the groups and so the estimates are not affected by any other factor than the difference in the nature of the treatments themselves. However, this is often far from what happens, in non randomized studies as well as in randomized studies, and so adjustment methods should be applied when factors that may affect a treatment outcome are recorded and made available to the analyst.

Background on the outcome of interest: Survival data

How long will a patient survive after a surgery? He may die at some fixed point time or may not die by that fixed point time. Survival data are collected from asking such questions?. The scenario sounds like in a binary data case, but the difference here is that, the outcome variable consists of two components and “no event” is recorded as “censored”. More clarifications about this follows.

In a survival settings, patients are followed up until, they either experience the event of interest or they fail in doing so. The length of time in the first scenario is recorded as survival time or time-to-event (tte), while in the second scenario it is recorded as censored time. For less confusion, we refer to both times as follow up times.

Censored times, cover a lot of situations: something may happen and the investigator looses track of a patient, or the patient may willingly withdraw from the study, or may dropout due to adverse events or severity of the disease. A patient may still be followed up until the end of the study without the event taking place in the patient. Such “follow up time” is recorded as censored, but marked with an indication to differentiate them from tte (see the analysis section).

Some examples of event of interest are: death, remission, first pregnancy after miscarriage, HIV infection from mother to infant, first occurrence of seizure, etc.

We denote the random variable T, the survival time. The quantity S(t) = P(T>t), is the probability of “surviving” beyond time t. The term “survival” should be used with respect to your event of interest. It’s important to specify the time scale: t could be in seconds, days, weeks, months, years, etc.

Background on the model of interest: The cox proportional hazard (ph) model

The Kaplan Maeir(km) analysis, is a basic way of making inference about the event rate from a survival data. It’s estimated in terms of the median. The median survival time t, is the time at which S(t) = 0.5. In other words, the probability to “survive” beyond that time point t, is 50%.

However, km is only limited to categorical variables (a treatment variable with 2, 3 or more random treatment groups/arms, or a variable about gender with a male group and a female group, etc). It can not handle other types of variables and probably does not support more than one variable. On the other hand, it does tell us things about the significant difference between groups through the Log rank test that produces a p-value. However, we need to know more. Here we are interested in the effect size (relative effects) between treatments. A cox proportional model is more insightful. Recall, a basic linear model looks like this:

In contrast, a cox ph model has the following from:

The hazard function is an important function in drawing inference from survival data. Hazard function and survival function, are related the following way:

The hazard function helps to find the event rate at t among those at risk for an event, the incidence rate per unit of time, or in other words, the instantaneous risk. The increment unit should then be define. E.g. h(t)=1% at t=12 weeks, means “at 12 weeks, patients show the event of interest at a rate of
1% per week”.
The increment unit here is in terms of “weeks”. That rate may indeed change depending on the scale. More clarifications to understand the interpretation of the coefficients in a cox ph model, are below:

In other words, bk estimates, in log hazard ratio, the change in the median response, per unit increase in Xk, holding other predictors constant. If Xk is a categorical variable with two categories such as a treatment variable with two groups A and B, bk is that effect size we are looking for, the effect of treatment A on the survival outcome of interest, compared to the control treatment B. A could be coded as 1 and B-0, or A-2 and B-1.

A required assumption in cox ph model, is a constant hazard ratio over time, or proportional hazards.

Background on the Bucher approach:

Given X = {Xi} i in I, a finite set of observed covariates in each arm, Bucher et al.[1], define the effect size of A vs B in subgroup as:

Recall, the A vs B effect size is not obtained directly in this case, and so we are using the AC effect and the BC direct estimates, and the Mantel-Haenszel formula, to derive that, the overall A vs B effect is:

wi and wi’ are weights reflecting the variances of HR_CA(i) and HR_CB(i) respectively in the corresponding trials. Both weights generally differ, as the distributions of the subgroups generally differ across trials as well, on average. HR_CA and HR_CB are referred to as the direct effects because derived directly from (randomized, observational) conducted trials.

When Bucher et al.[2], first introduced the Indirect Comparison idea, they made quite a strong assumption that,

Meaning, the influence the subgroups/covariates may have on the relative effects, is completely ignored. From this, the A vs B effect is simply reduced to:

Competing methods such as the PSW, the Matching Adjusted Indirect Comparison (MAIC), and the Simulated treatment comparison (STC), have reasonably suggested that, for the Indirect estimate of A vs B to be consistent with its true value, treatment by covariate interactions, must be taken into account in the estimation process, contrasting the above assumption set by Bucher et al.[2].

Background on the PSW approach:

Most often, patients are not randomly assigned to intervention groups.
They are rather observed and their reaction to treatment, measured and assessed on how it affects the outcome of interest. A patient may respond favorably to a treatment because the treatment is new, or because he/she was not that at risk (was not that sick). The age of the patient may matter, its geographical or cultural situation may be favorable to his/her recovery or maybe rather a cause for resistance to treatment, etc. These are some of the characteristics, recorded before treatment, that may alter results. Propensity score weighting, suggests a way to adjust for these pre-treatment characteristics.

There are four common ways to derive weights in this setting, and this is with respect to the target population[9]:

The weights are based on the propensity score ei (a value between 0 and 1), which is the probability of patient i being assigned to a treatment, conditional on some covariates[8]. Zi is a binary vector of 1's for one group, usually termed as the treated/exposed group, and of 0's in another group, usually termed as the control/unexposed group.

Background setting for simulation:

A patient is followed up until its last time is recorded, and the status of its “follow up time” is specified as well. We use status as a variable which takes the value 1 if the observation time is T, and 0 if the observation time is < T.

Example: An Exponential distribution well describes the amount of time until something happens. We could use this to generate both times “fuTime1” and “fuTime2” for 7 individuals and then produce a survival data set. The response variable y is formed by two components, the resulting fuTime variable and the status variable.

In general, survival times, are not sampled directly from a standard distribution such as the exponential distribution. In this article, following Bender et al. [4] suggestion, we simulate survival times build on a Weibull pdf for T. In fact, the exponential distribution, with its single scale parameter, is limited in applicability. The weibull, with an additional shape parameter, is more flexible to fit various types of data.

Recall, the cox ph model assumption stipulates that, the hazard function is a constant multiple of a baseline hazard, conditional on a set of covariates. Using other equations, we derive the following:

By choosing the weibull baseline hazard, we have explicitly:


We simulate survival individual level data (ild) for the AC and the BC comparisons. We set a common shape and scale parameters for both trials. Follow up times assumed to be censored times, are sampled from exponential distributions with different scale parameters to emphasize on different rate of censoring in the trials. Then we employ random right censoring as in the example above, to generate a vector of follow up times, in which survival times and censored times are explicitly incorporated. We generate data for three covariates, in each trial: 1 binomial, 1 negative binomial and 1 uniform. Two of the covariates are considered effect modifiers, meaning their interaction with treatments affect the outcome of interest. So we have interaction terms in our cox ph model to account for these effect modifiers.

Overall, parameters for the simulation, are chosen in such a way that, we have a balanced A-C and B-C groups, to reflect randomization within the AC and the BC trials. However, we make sure that, covariates differ (on average) between the two trials. If they are similar (on average), the estimate of the A vs B effect, would reflect the true A vs B effect. With simulations, we know this true value, which is not the case with a real world data, we never know in real practice the true value of what we are estimating.

The parameter values we set, are relatively low. This reflects empirical research in the medical field.

The number of patients at risk, at time t = 0, is varied from this set N = {60, 90, 120, 150, 180, 350, 800}, in each trial.

Some basic analysis within the trials:

We generate individual patient data in the AC trial, the following way. A couple of R packages required are: “Hmisc”, “tidyverse”, “sandwich”, etc.

As you can read from the linear part of the cox ph model, the true A vs C effect is -2. The true effects of the prognostic variables are respectively: .2, 1, .3. The negative binomial and the uniform variables, are the effect modifiers. The true effects of their interactions with treatment are: -1.3, 1.5 respectively.

Randomization check: Are treatment groups balanced within the trial?

We can confidently say so by looking at the output below:

What about balance with respect to the survival outcomes?

The kaplan Maier curve, tell us that, both groups are statistically significantly different. The p-value is based on a Logrank test with the Chisq statistic. Censored times are marked with “+” on the graph. This output requires the “survminer” package in R.

And in fact, patients in the A arm have a better survival prospect than patients in the C arm, based on their median survival times (time t at which S(t) = .5).

Difference in outcomes is not really our concern, and it is expected somehow. We generate the BC trial similarly, but with slightly different parameter values: the true B vs C effect is 3, cens_rate = 2*cens_rateAC, weibull parameters and prognostic true effects remain the same, probability of success for xbinBC is .65, xunif~U[-5,12], and xnbin~NegBinom(size=1, mu=3). We cross check as above whether we have a good randomization in the BC trial.

Next we put ACild and BCild in the same data set, for better manipulation. After randomization check within trials, the next thing we are interested in is, differences of patients characteristics across trials.

xnbin and xunif look quite unbalanced across trials, on average, and in terms of variability as well. We employ the ATT technique described earlier, to derive weights in an attempt to reduce these differences and produce reliable results. The Zi values are considered here to be 0 for the AC trial, and 1 for the BC trial. The weights are subsequently derived as 1 for the BC trial, and pscore/1-pscore, where p.scores are the fitted values from the logistic model Z~effect.modifiers. These weights are stored in the data frame “data4apsw”. Let’s see how these EMs are balanced across trials after implementing the weights:

Reasonably balanced than before. The row for BC in the table is unchanged, because with the weights of 1, the BC observations are practically the same, the BC population remains the same, BC is our treated and target population. [pooling 1: att1psw]

Another less interesting scenario [pooling 2: att2psw], still applying the ATT, would be to weight the population within trials, A and B are the treated arms and C is the control arm. It will be less insightlful mostly because randomization looks perfect within the trials, so the groups are fairly balanced within the trials. Again, our concern is about reducing the impact of differences between the trials, in the indirect estimation of A vs B.

We compare these two pooling methods of estimating the A vs B effect, with the Bucher approach[pooling 3: bucher].

Analysis for Indirect Comparison:

The bucher A vs C estimates is simply derived from the cox ph model: coxph(Surv(fuTime, status) ~ trt, data = ACild). Similarly the bucher B vs C estimate is derived with the BCild. This results to the estimate of A vs B as A vs C estimate minus B vs C estimate as we explained earlier.

The A vs C estimates, based on the propensity score weighting by ATT, is derived from a weighted cox ph model: coxph(Surv(fuTime, status) ~ trt,
data = ACild, weights = weight, robust = TRUE)
. The B vs C is derived similarly and the resulting A vs B obtained as explained above. Weighting observations may violate some model assumptions and results in greater variance in the estimates, “robust = TRUE” is included here to fix this in order to get a robust variance of the estimates.

The three estimates, att1psw, att2psw and bucher, and their corresponding 95% confidence intervals (CIs), are summarized in a ggplot graph below, for each sample sizes.

The line of true effect is not seen because the true A vs B effect is -5. The estimates are all over-biased. The bucher and att2psw synchronize in each one of the scenario. Naturally, uncertainty around the estimates reduces as sample sizes increase.

The bias plot above, confirms how biased the estimates are. The zero horizontal line of no bias is failed to be seen. “att1psw” is way more biased. The next plot, tells the negative impact of these bias estimates, as all the impact bias values are above the 50% threshold.

The situation gets worst with sample size.

The empirical standard error(empse), measures the precision of an estimate and so it is comparable to the CIs-plot above. As the sample size increases, each one of the intervals shrinks towards the corresponding estimate. A low empse indicates high/good efficiency/precision of the estimate.

The root mean square error(rmse), offers a balance between bias and efficiency as both bias and variance define it. Low bias and low empse imply accurate estimate. rmse indicates particularly how consistent an estimator is, the lower the better as well.

All the three methods have a good power to produce statistically significant estimates. “att1psw” shows some exceptional results when n=90 and 180.

Discussion and conclusion

We did not carry out these simulations following some strict recommendations from the literature such as choosing the parameters in such a way that results can be generalized to any other population, or choosing covariates that are clinically valid in the real world setting. This article is simply like a tutorial and your questions and comments are welcome.

The loss of effective sample size (ESS =sum(weight)²/sum(weight²)) in the AC trial, when reweighting the population across trials (att1psw), may explain the greater difference in the estimates. The BC population remains unchanged in this case, in each of the scenario.

The ESS values are similar in both trials with the att2psw technique, this explains why its performance is similar to the (unweighted) bucher approach.

So results may be so different and biased because, we did not obtain good balance in reweigthing, or some covariates are actually non-linearly associated with the outcome, or the unbalance is due to some unmeasured factors, etc. We may also be wrong about the choice of the effect modifiers we included in our propensity score estimation. This kind of misspecification, may results in invalid and bias inferences. Appropriate variable selection is necessary. Doubly robust models may offer safer measures against this mispecification. Researchers also recommend the use of non parametric models such as the generalized boosted models, which offers more flexibility in estimating the necessary propensity score weights[10].

Propensity score matching approach, is only applied when individual level data are available, which is often not the case for several reasons, sometimes for ethical reasons. The MAIC is an extension to psw when ild data are available in some treatment comparisons, and other comparisons only have aggregated data.

Survival analysis is the oldest sub-field in statistics and as such, the programming software R has an extensive list of functions for its applications.
Censoring information is an important feature of survival data. Right-censoring is the most common type of censoring. Type 1 and type 2 censoring are particular types of right censoring. Left censoring is tricky to handle.

The cox ph is a richer framework for making inference about treatment and covariate effects, in contrast to the kaplan-maier analysis. When the baseline hazard is not specified, the cox ph model is referred to as semi parametric. In any case, the proportional hazard assumption must be tested. It is recommended to model the Hazard Ratio, in case it shows not to be constant over time. Cox ph model can support multivariate analysis and accommodate continuous and categorical variables simultaneously.

The Weibull Accelerated Failure Time (WAFT) regression, tends to be more
reliable, more valid, and offer interesting benefits since the distribution for the survival time is specified and the PH assumption is not required [5].

Overall, based on this base case analysis, we would recommend bucher and the att1psw approaches, as they provide better accuracy and their bias is less dangerous.

You can find formulas, for the performance measures employed in this article, and many more other relevant measures such as the power, the coverage probability, the monte carlo error estimation, in [6], or [7].


  1. Dias S, Ades AE, Welton NJ, Jansen JP, Sutton AJ. Network meta-analysis for decision-making. John Wiley & Sons . 2018.

2. Bucher HC, Guyatt GH, Griffith LE, Walter SD.: The results of direct and indirect treatment comparisons in meta-analysis of randomized controlled trials. Journal of Clinical Epidemiology 1997;50:683–91

3. Phillippo DM, Ades AE, Dias S, Palmer S, Abrams KR, Welton NJ. Methods for population-adjusted indirect comparisons in health technology appraisal. Medical Decision Making 2018; 38(2): 200–211

4. Bender, Augustin et al.: Generating Survival Times to Simulate Cox Proportional Hazards Models

5. Ali ZARE, Mostafa HOSSEINI et al.:A Comparison between Accelerated Failure-time and Cox Proportional Hazard Models in Analyzing the Survival of Gastric Cancer Patients. Iran J Public Health. 2015 Aug; 44(8): 1095–1102.

6. Tim P. Morris et al.: Using simulation studies to evaluate statistical methods. Statistics in medicine 2019; 10.1002

7. Analyses of simulation studies including Monte Carlo error

8. Rosenbaum PR, Rubin DB. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika. 1983;70:41–55.

9. Understanding propensity score weighting

10. A Practical Guide for Using Propensity Score Weighting in R

11. Generating survival times to simulate Cox proportional hazards models with time-varying covariates

Teaching/Research Assistant in Mathematics/(Bayesian) Statistics, Writer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store