Python: PLM and IRM for Multiple Treatments#

In this example, we show the behavior of the PLM and IRM estimators when the treatment effect is heterogeneous across different treatments. We show that since the PLM estimates a convex-weighted average of treatment effects, it is not able to recover the true treatment effects when they are heterogeneous. On the other hand, the IRM estimator is able to recover the true treatment effects when they are heterogeneous. This is shown to matter substantially when ranking treatments by their estimated treatment effects.

We assume basic knowledge of the potential outcomes framework and the assumptions of the PLM and IRM estimators.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from doubleml import DoubleMLData, DoubleMLPLR, DoubleMLIRM
from xgboost import XGBRegressor, XGBClassifier

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Ranking Treatment Effects under Treatment Propensity and Treatment Effect Heterogeneity#

We consider a setting with two treatments \(D_1\) and \(D_2\), a single binary covariate \(X\) (with \(p=0.5\)), and a continuous outcome \(Y\). The propensity scores for the two treatments are given by:

X  D

\(D_1\) = 1

\(D_2\) = 1

X = 0

0.01

0.5

X = 1

0.5

0.01

And the underlying heterogeneous treatment effects are given by:

X  \(\tau\)

\(\tau_1\)

\(\tau_2\)

X = 0

-3

-2

X = 1

3

3

ATE

0

0.5

The researcher seeks to rank the two treatments based on their average treatment effects. We compare the behavior of the PLM and IRM estimators in this setting.

We implement the DGP below.

[2]:

def dgp( n = 100_000, treat1_params = { "p_x0": 0.01, "p_x1": 0.50, "τ_x0": -3, "τ_x1": 3, }, treat2_params = { "p_x0": 0.5, "p_x1": 0.01, "τ_x0": -2, "τ_x1": 3, }, ): # utility functions to transform between the probability scale and the logit scale sigmoid = lambda x: 1 / (1 + np.exp(-x)) inv_sigmoid = lambda p: np.log(p / (1 - p)) df = pd.DataFrame() # noise terms eta1 = np.random.normal(0, 0.1, size=n) eta2 = np.random.normal(0, 0.1, size=n) eps = np.random.normal(0, 1, size=n) # binary covariate df["x"] = np.random.randint(2, size=n) # half and half df[["x0", "x1"]] = pd.get_dummies(df.x) # treatment 1 propensity score pscore1 = df.x0 * (inv_sigmoid(treat1_params["p_x0"]) + eta1) + df.x1 * ( inv_sigmoid(treat1_params["p_x1"]) + eta2 ) # treatment 2 propensity score pscore2= df.x0 * (inv_sigmoid(treat2_params["p_x0"]) + eta2) + df.x1 * ( inv_sigmoid(treat2_params["p_x1"]) + eta2 ) df['d1'] = np.random.binomial(1, sigmoid(pscore1)) df['d2'] = np.random.binomial(1, sigmoid(pscore2)) # outcome 1 df["y"] = ( 0 + 1 * df.x0 # arbitrary covariate coefs - 2 * df.x1 + df.x0 * df.d1 * (treat1_params["τ_x0"]) + df.x1 * df.d1 * (treat1_params["τ_x1"]) + df.x0 * df.d2 * (treat2_params["τ_x0"]) + df.x1 * df.d2 * (treat2_params["τ_x1"]) + eps ) return df.drop(["x0", "x1"], axis=1) df = dgp() df.head()
[2]:
x d1 d2 y
0 0 0 0 2.011310
1 1 1 0 0.154774
2 0 0 0 1.344913
3 1 1 0 0.010224
4 0 0 1 -0.208417
[3]:
print(pd.crosstab(df.x, df.d1, normalize='index'))
print(pd.crosstab(df.x, df.d2, normalize='index'))
d1         0         1
x
0   0.990322  0.009678
1   0.502160  0.497840
d2         0         1
x
0   0.502900  0.497100
1   0.990659  0.009341

The Partially Linear Model performs overlap weighting#

The partially linear model inherits the properties of linear regression under treatment effect heterogeneity (Angrist (1998), Angrist and Krueger(1999), Aronow and Samii (2016)). The probability limit of the treatment coefficient in the partially linear model \(Y_i = \tau D_i + g(X_i) + \epsilon_i\) is given by:

\[p\!\!-\!\!\lim \hat{\tau} = \mathbb{E} [\gamma(X) \tau(X)]\]

where

\[\gamma(X) = \frac{\mathbb{V}(D \mid X)}{\mathbb{E} [\mathbb{V}(D \mid X)]}\]

This simplifies to

\[\gamma(X) = \frac{p(X)(1-p(X))}{\mathbb{E}[p(X)(1-p(X))]}\]

when \(D\) is binary.

We have deliberately generated a dataset with varying propensity scores across treatments and covariates. Since we have a single binary covariate, we can analytically estimate the plim of the above estimator. The regression coefficients are equal to:

\[\begin{split}\begin{align*} \tilde{\tau}_1 & = \frac{-3 \cdot 0.01 \cdot 0.99 + 3 \cdot 0.5 \cdot 0.5}{ 0.01 \cdot 0.99 + 0.5 \cdot 0.5 } = 2.7714 \\ \tilde{\tau}_2 & = \frac{-2 \cdot 0.5 \cdot 0.5 + 3 \cdot 0.01 \cdot 0.99}{0.01 \cdot 0.99 + 0.5 \cdot 0.5} = -1.8095 \end{align*}\end{split}\]

So, we see that the PLM gets the ranking wrong; even though the treatment effect of \(D_1\) (0) is lower than that of \(D_2\) (0.5), the PLM estimates the treatment effect of \(D_1\) to be higher than that of \(D_2\) because it upweights the stratum with large positive effects when estimating the effect for \(D_1\) and vice versa for \(D_2\).

We verify this numerically with the DoubleML.

[4]:
df = dgp()
dml_data = DoubleMLData(df, 'y', ['d1', 'd2'], 'x')
plr_obj = DoubleMLPLR(dml_data,
                      ml_l =  XGBRegressor(),
                      ml_m = XGBClassifier(),
                      n_folds=5)
plr_obj.fit()
plr_est = plr_obj.summary
print(plr_est)
        coef   std err           t  P>|t|     2.5 %    97.5 %
d1  2.778485  0.013089  212.276716    0.0  2.752831  2.804139
d2 -1.807853  0.012012 -150.505729    0.0 -1.831396 -1.784310

The Augmented Inverse Propensity Weighting Model estimates the ATE under arbitrary effect and propensity score heterogeneity#

The AIPW estimator is given by:

\[\tau_{\text{AIPW}} = \frac{1}{n} \sum_i \left( \left[ \hat{g}(1, X_i) - \hat{g}(0, X_i) \right] + \frac{D_i(Y_i - \hat{g}(1, X_i))}{\hat{m}(X_i)} - \frac{(1-D_i)(Y_i - \hat{g}(0, X_i))}{1 - \hat{m}(X_i)} \right)\]

and estimates the ATE consistently under arbitrary treatment effect and propensity score heterogeneity, and therefore ranks the two treatments correctly (i.e. \(D_2\) has a higher treatment effect than \(D_1\)).

[5]:
df = dgp()
aipw_obj_1 = DoubleMLIRM(DoubleMLData(df, 'y', ['d1'], 'x'),
                      ml_g = XGBRegressor(),
                      ml_m = XGBClassifier(),
                      n_folds=5)
aipw_obj_1.fit()
aipw_obj_2 = DoubleMLIRM(DoubleMLData(df, 'y', ['d2'], 'x'),
                      ml_g = XGBRegressor(),
                      ml_m = XGBClassifier(),
                      n_folds=5)
aipw_obj_2.fit()
aipw_est_1, aipw_est_2 = aipw_obj_1.summary, aipw_obj_2.summary
print(aipw_est_1)
print(aipw_est_2)
        coef   std err         t     P>|t|     2.5 %    97.5 %
d1 -0.011431  0.032401 -0.352805  0.724235 -0.074935  0.052073
        coef   std err          t         P>|t|    2.5 %    97.5 %
d2  0.586475  0.040881  14.346043  1.127922e-46  0.50635  0.666599

Summary Figure#

We summarize the true treatment effects, the PLM estimates, and the IRM estimates in a figure.

[6]:

colors = ['#da1440', '#e45228', '#ec973f', '#a4a147', '#008e80', '#008dbd', '#f6ebce', '#838235', '#3f5d93', '#50768b', '#133f5a', '#afd9e4'] # Plot the data x = np.array([1, 2]) y = [(-3+3)/2, (-2+3)/2] fig, ax = plt.subplots(dpi = 100, figsize = (6, 4)) # plm plt.errorbar(x, plr_est.coef, yerr=1.96 * plr_est['std err'], markersize = 1, fmt='o', color=colors[0], ecolor=colors[0], capsize=5, label = 'PLM Estimate') plt.errorbar(x+.05, np.array([aipw_est_1.coef, aipw_est_2.coef]).flatten(), yerr = 1.96 * np.array([aipw_est_1.iloc[:,1].values[0], aipw_est_2.iloc[:,1].values[0]]), markersize = 3, fmt='o', color=colors[8], ecolor=colors[8], capsize=5, label = 'AIPW Estimate' ) # true values plt.scatter(x-.05, y, color = colors[2], label = 'ATE') plt.scatter(x-.1, [-3, -2], color = colors[5], label=r"CATE: $x=0$") plt.scatter(x-.12, [3, 3], color = colors[4], label=r"CATE: $x=1$") plt.legend(loc = 0) plt.title("PLMs do not estimate ATEs \n This is bad for ranking treatments") plt.axis((.5, 3, -3.5, 4)) ax.set_xticks([]) string_labels = ['Treatment 1', 'Treatment 2'] ax.set_xticks(x) ax.set_xticklabels(string_labels) ax.grid(True) ax.set_ylabel("Treatment Effect") plt.show()
../_images/examples_py_double_ml_plm_irm_hetfx_10_0.png

Additional Results: CATE estimates#

As an additional comparison, we can add the CATE estimates based on the previous PLR and IRM examples.

[7]:
# pandas data frame with one-hot encoded values for x
groups = pd.get_dummies(df.x)
[8]:
# Repeat estimation for each treatment var separately
# d1
dml_data = DoubleMLData(df, 'y', 'd1', ['d2', 'x'])
plr_obj_1 = DoubleMLPLR(dml_data,
                      ml_l =  XGBRegressor(),
                      ml_m = XGBClassifier(),
                      n_folds=5)
plr_obj_1.fit()
plr_est1 = plr_obj_1.summary
plr_obj_1.gate(groups=groups).summary.round(3)

[8]:
coef std err t P>|t| [0.025 0.975]
0 -3.072 0.044 -70.540 0.0 -3.157 -2.987
1 2.986 0.009 333.833 0.0 2.969 3.004
[9]:
# d2
dml_data = DoubleMLData(df, 'y', 'd2', ['d1', 'x'])
plr_obj_2 = DoubleMLPLR(dml_data,
                      ml_l =  XGBRegressor(),
                      ml_m = XGBClassifier(),
                      n_folds=5)
plr_obj_2.fit()
plr_est2 = plr_obj_2.summary
plr_obj_2.gate(groups=groups).summary.round(3)
[9]:
coef std err t P>|t| [0.025 0.975]
0 -1.994 0.009 -223.107 0.0 -2.011 -1.976
1 3.058 0.046 66.954 0.0 2.969 3.148
[10]:
# IRM models
# d1
aipw_obj_1.gate(groups=groups).summary.round(3)
[10]:
coef std err t P>|t| [0.025 0.975]
0 -3.015 0.061 -49.222 0.0 -3.135 -2.895
1 2.991 0.009 318.933 0.0 2.972 3.009
[11]:
# IRM models
# d2
aipw_obj_2.gate(groups=groups).summary.round(3)
[11]:
coef std err t P>|t| [0.025 0.975]
0 -1.990 0.009 -212.937 0.0 -2.009 -1.972
1 3.162 0.080 39.746 0.0 3.006 3.318

References#

  • Lal, A., Chou, W., & Schaefer, J. (2024). Using Double Machine Learning to Rank Treatments, Working Paper. Poster available at https://apoorvalal.github.io/files/slides/acic_2024_poster.pdf. Thread available at https://x.com/Apoorva__Lal/status/1798913180930109556.

  • Angrist, J. D. (1998). Estimating the labor market impact of voluntary military service using social security data on military applicants. Econometrica, 66(2), 249-288.

  • Angrist, J. D., & Krueger, A. B. (1999). Empirical strategies in labor economics. Handbook of labor economics, 3, 1277-1366.

  • Aronow, P. M., & Samii, C. (2016). Does regression produce representative estimates of causal effects?. American Journal of Political Science, 60(1), 250-267.