Generalised Logistic Model (glm) vs Generalized Boosted (logistic) Model (gbm), in estimating propensity scores: An application to indirect comparison, in comparison to the Bucher naive estimation.

Laure Ngouanfo Fopa
6 min readNov 2, 2020

[Under Review]

Introduction

Following what we did here, we apply one of the recommendations about using a boosted logistic regression, implemented in the generalized boosted modeling (gbm) package in R [7]. The goal, is to get better propensity scores for a fairer balance of pretreatment covariate distributions across the two trials: the AC population is considered the control population, and the BC population is considered the treated population, just as previously set.

We are interested in, what would be the anchored A vs B effect, if:

The AC population + BC population were all part of the BC trial vs the AC population + BC population were all part of the AC trial?— The A vs B estimand derived here is called the average treatment effect (ATE).

Only the BC population were part of the BC trial vs the BC population were part of the AC trial?— The A vs B estimand derived here is called the average treatment effect for the treated (ATT).

Similar interpretation are given to the estimands ATC, ATM and ATO. Their formula are given below:

Model coefficients are slightly modified from what we did previously. Survival data for the AC trial are generated following Bender et al. such that the linear part looks like this:

The same coefficients bi and the ones involve in the interaction terms, are used to generate survival data in the BC trial.

simple connected network of three treatments

In this way, results are more generalisable, the effect of A on the outcome, compared to C, in the AC population, is the same(the true A vs C effect value is 3 as specified in the model), in the BC population. Additionally, this results to the true A vs B effect to be CA minus CB = 3–3=0, this time.

Let’s review how unbalanced the covariates distributions are, across the trials, xnbin and xunif in particular:

Only xnbin and xunif variables are included in the propensity score models. There are differing ways of estimating propensity scores in the literature: e.g. logistic regression, generalized boosted regression.

a/ The logistic regression: glm(trial.ID~effect.modifiers, data = ., family = binomial)

b/ The generalized boosted regression: gbm(trial.ID~effect.modifiers, data = ., distribution = “bernoulli”, n.trees=100, interaction.depth=4, train.fraction=.8, shrinkage=.0005)

The argument values specified in the gbm() function, are default values, except “n.trees”. Kindly read [7] in the reference section, for more details about the “gbm” package in R.

We employ these two propensity scores generating mechanisms, and compare results.

Confidence intervals from logistic model vs gbm model

results from glm scores
results from gbm scores

Bias plots from logistic model vs gbm model

results from glm scores
results from gbm scores

Impact bias plots from logistic model vs gbm model

results from glm scores
results from gbm scores

The empirical standard error plots from logistic model vs gbm model

results from glm scores
results from gbm scores

The root mean squared error plots from logistic model vs gbm model

results from glm scores
results from gbm scores

The type 1 error plots from logistic model vs gbm model

results from glm scores
results from gbm scores

In summary, both propensity scores processes, yield quite different results in estimating the effect size of A vs B. The gbm technique does not balance the factors “xnbin” and “xunif”, across trials, in this base case analysis, this is probably why, the four estimands synchronize with the bucher estimand that does not account for patient characteristics. Looking at the rmse plot alone, that offers a balance between bias and efficiency, we can say that, estimands based on the gbm() perform very poorly. The type 1 error plots, tells us that, the estimands with the gbm scores, are susceptible to finding a statistical difference between A vs B, when there is actually none, except when n=60, where the type 1 error is minimal to the normal threshold of 5%.

The effective sample sizes are at their maximum values in each trial, under the gbm() process. This indicates the weights from gbm do not provide a good balance between trials, the distributions of the effect modifiers remain unchanged on average across trials.

Estimands from the logistic scores, are quite unstable, this may be due to small effective sample sizes yield by this estimation process. The distribution of weights should be looked at in particular, to examine other issues like lack of population overlap and overly influential individuals[*].

Effective Sample Sizes under the Generalized Boosted Model

Conclusion:

We recall that, a propensity score is the probability that a patient would be assigned or exposed to an intervention, with respect to a set of covarates. Rosenbaum & Rubin (1983) showed that knowing the propensity score is sufficient to separate the effect of a treatment on an outcome from observed confounding factors that influence both treatment assignment and outcomes, provided the necessary conditions hold.

The five estimands measure the treatment effect differently, based on the same requirement though, which is balancing out differences across trials[2] in our context of indirect comparison. Results based on gbm show that these pooling methods may be worthless and interchangeable, in contrast to what the logistic pscores results suggest.

Anyway, this is a very simple base case analysis, to derive any informative conclusion. This is just a simple guide to do such analysis. There is indeed, a lot of work done and ongoing in understanding the theoretical and statistical basis for these methods[1]. However, results from the logistic pscores, suggest that “att”, followed by “atm” and “ato”, are of good relevance compared to “ate” and the bucher estimates.

Some other recommendations to achieve a good balance of measured and unmeasured factors, include: the entropy balancing, the generalized linear modeling with Difference in Difference regression estimation[6], etc.

Reference

  1. Borah BJ , Moriarty JP , Crown WH , Doshi JA . Applications of propensity score methods in observational comparative effectiveness and safety research: where have we come and where should we go? J. Comp. Eff. Res. 3(1), 63–78 (2014).
  2. Glynn RJ , Schneeweiss S , Sturmer T . Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin. Pharmacol. Toxicol. 98(3), 253–259 (2006)
  3. Elze MC , Gregson J , Baber U et al. Comparison of propensity score methods and covariate adjustment: evaluation in 4 cardiovascular studies. J. Am. Coll. Cardiol. 69(3), 345–357 (2017).
  4. Austin PC . An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav. Res. 46(3), 399–424 (2011).
  5. Austin PC . Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. J. Thoracic Cardiovasc. Surg. 134(5), 1128–1135 (2007)
  6. Huanxue Zhou et al. Difference-in-Differences Method in Comparative Effectiveness Research: Utility with Unbalanced Groups. 2016 Aug;14(4):419–429.
  7. Ridgeway, G. (2005). GBM 1.5 package manual. http://cran.r-project.org/doc/packages/gbm. pdf

--

--

Laure Ngouanfo Fopa

Teaching/Research Assistant in Mathematics/(Bayesian) Statistics, Writer