Note
-
Download Jupyter notebook:
https://docs.doubleml.org/stable/examples/py_double_ml_gate.ipynb.
Python: Group Average Treatment Effects (GATEs)#
In this simple example, we illustrate how the DoubleML package can be used to estimate group average treatment effects.
Data#
The data will be generated with a simple data generating process to enable us to know the true group effects.
[1]:
import numpy as np
import pandas as pd
import doubleml as dml
Here, we consider three different groups which depend on the first covariate, which could correspond to e.g. three different age groups in the sample. For simplicity, the treatment effect within each group is generated to be constant, such that it corresponds to the group average treatment effect.
[2]:
def group_effect(x):
if x[0] <= -0.7:
te = 3
elif (x[0] > -0.7) & (x[0] <= 0.2):
te = 1
else:
te = 0
return te
The data is generated as
$ \begin{align} Y_i & = g(W_1)T_i + \langle W_i,\gamma_0\rangle + \epsilon_i \\ T_i & = \langle W_i,\beta_0\rangle +\eta_i, \end{align} $
where \(W_i\sim\mathcal{N}(0,I_{d_w})\) and \(\epsilon_i,\eta_i\sim\mathcal{U}[0,1]\). The coefficient vectors \(\gamma_0\) and \(\beta_0\) both have small random support which values are drawn independently from \(\mathcal{U}[0,1]\). Further, \(g(w_1)\) defines the conditional treatment effect, which is generated depending on \(w_1\).
[3]:
def create_synthetic_group_data(n_samples=200, n_w=10, support_size=5):
"""
Creates a simple synthetic example for group effects.
Parameters
----------
n_samples : int
Number of samples.
Default is ``200``.
n_w : int
Dimension of covariates.
Default is ``10``.
support_size : int
Number of relevant covariates.
Default is ``5``.
Returns
-------
data : pd.DataFrame
A data frame.
"""
# Outcome support
support_w = np.random.choice(np.arange(n_w), size=support_size, replace=False)
coefs_w = np.random.uniform(0, 1, size=support_size)
# Define the function to generate the noise
epsilon_sample = lambda n: np.random.uniform(-1, 1, size=n_samples)
# Treatment support
# Assuming the matrices gamma and beta have the same number of non-zero components
support_t = np.random.choice(np.arange(n_w), size=support_size, replace=False)
coefs_t = np.random.uniform(0, 1, size=support_size)
# Define the function to generate the noise
eta_sample = lambda n: np.random.uniform(-1, 1, size=n_samples)
# Generate controls, covariates, treatments and outcomes
w = np.random.normal(0, 1, size=(n_samples, n_w))
# Group treatment effect
te = np.apply_along_axis(group_effect, axis=1, arr=w)
# Define treatment
log_odds = np.dot(w[:, support_t], coefs_t) + eta_sample(n_samples)
t_sigmoid = 1 / (1 + np.exp(-log_odds))
t = np.array([np.random.binomial(1, p) for p in t_sigmoid])
# Define the outcome
y = te * t + np.dot(w[:, support_w], coefs_w) + epsilon_sample(n_samples)
# Now we build the dataset
y_df = pd.DataFrame({'y': y})
t_df = pd.DataFrame({'t': t})
w_df = pd.DataFrame(data=w, index=np.arange(w.shape[0]), columns=[f'w_{i}' for i in range(w.shape[1])])
data = pd.concat([y_df, t_df, w_df], axis=1)
covariates = list(w_df.columns.values)
return data, covariates
We will consider a quite small number of covariates to ensure fast calcualtion.
[4]:
# DGP constants
np.random.seed(42)
n_samples = 500
n_w = 10
support_size = 5
# Create data
data, covariates = create_synthetic_group_data(n_samples=n_samples, n_w=n_w, support_size=support_size)
data_dml_base = dml.DoubleMLData(data,
y_col='y',
d_cols='t',
x_cols=covariates)
Interactive Regression Model (IRM)#
The first step is to fit a DoubleML IRM Model to the data.
[5]:
# First stage estimation
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
randomForest_reg = RandomForestRegressor(n_estimators=500)
randomForest_class = RandomForestClassifier(n_estimators=500)
np.random.seed(42)
dml_irm = dml.DoubleMLIRM(data_dml_base,
ml_g=randomForest_reg,
ml_m=randomForest_class,
trimming_threshold=0.01,
n_folds=5)
print("Training IRM Model")
dml_irm.fit(store_predictions=True)
Training IRM Model
[5]:
<doubleml.double_ml_irm.DoubleMLIRM at 0x7f45cab88340>
Group Average Treatment Effects (GATEs)#
Next, we can specify the groups as DataFrame with boolean columns.
[6]:
groups = pd.DataFrame(np.column_stack([data['w_0'] <= -0.7,
(data['w_0'] > -0.7) & (data['w_0'] <= 0.2),
data['w_0'] > 0.2]),
columns=['Group 1', 'Group 2', 'Group 3'])
[7]:
groups.head()
[7]:
Group 1 | Group 2 | Group 3 | |
---|---|---|---|
0 | False | True | False |
1 | False | True | False |
2 | True | False | False |
3 | True | False | False |
4 | True | False | False |
To calculate GATEs just call the gate()
method and supply the DataFrame with the group definitions and the level
(with default of 0.95
).
[8]:
gate = dml_irm.gate(groups=groups)
print(gate.confint(level=0.95))
2.5 % effect 97.5 %
Group 1 2.795964 3.126939 3.457914
Group 2 0.936490 1.234062 1.531634
Group 3 -0.168960 0.026815 0.222590
The confidence intervals above are point-wise, but by setting the option joint
and providing a number of bootstrap repetitions n_rep_boot
.
[9]:
ci = gate.confint(level=0.95, joint=True, n_rep_boot=1000)
print(ci)
2.5 % effect 97.5 %
Group 1 2.501505 3.126939 3.752373
Group 2 0.671748 1.234062 1.796376
Group 3 -0.343135 0.026815 0.396765
Finally, let us plot the estimates together with the true effect within each group.
[10]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 10., 7.5
errors = np.full((2, ci.shape[0]), np.nan)
errors[0, :] = ci['effect'] - ci['2.5 %']
errors[1, :] = ci['97.5 %'] - ci['effect']
plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors, label='Estimated Effect (with joint CI)')
#add true effect
ax = plt.subplot(1, 1, 1)
ax.scatter(x=['Group 1', 'Group 2', 'Group 3'], y=[3, 1, 0], c='red', label='True Effect')
plt.title('GATEs')
plt.xlabel('Groups')
plt.legend()
_ = plt.ylabel('Effect and 95%-CI')

It is also possible to supply disjoint groups as a single vector (still as a data frame). Remark the slightly different name.
[11]:
groups = pd.DataFrame(columns=['Group'], index=range(data['w_0'].shape[0]), dtype=str)
for i, w_i in enumerate(data['w_0']):
if w_i <= -0.7:
groups['Group'][i] = '1'
elif (w_i > -0.7) & (w_i <= 0.2):
groups['Group'][i] = '2'
else:
groups['Group'][i] = '3'
groups.head()
[11]:
Group | |
---|---|
0 | 2 |
1 | 2 |
2 | 1 |
3 | 1 |
4 | 1 |
[12]:
gate = dml_irm.gate(groups=groups)
ci = gate.confint()
print(ci)
2.5 % effect 97.5 %
Group_1 2.795964 3.126939 3.457914
Group_2 0.936490 1.234062 1.531634
Group_3 -0.168960 0.026815 0.222590
The coefficients of the best linear predictor can be seen via the summary (the values can be accessed through the underlying model .blp_model
).
[13]:
gate.summary
[13]:
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Group_1 | 3.126939 | 0.149758 | 20.879973 | 5.621183e-70 | 2.832703 | 3.421176 |
Group_2 | 1.234062 | 0.142020 | 8.689369 | 5.300034e-17 | 0.955029 | 1.513095 |
Group_3 | 0.026815 | 0.117683 | 0.227859 | 8.198497e-01 | -0.204402 | 0.258032 |
Remark that the confidence intervals are slightly smaller, since they are not based on the White’s heteroskedasticity robus standard errors.
[14]:
print(gate)
================== DoubleMLBLP Object ==================
------------------ Fit summary ------------------
coef std err t P>|t| [0.025 0.975]
Group_1 3.126939 0.149758 20.879973 5.621183e-70 2.832703 3.421176
Group_2 1.234062 0.142020 8.689369 5.300034e-17 0.955029 1.513095
Group_3 0.026815 0.117683 0.227859 8.198497e-01 -0.204402 0.258032
[15]:
errors = np.full((2, ci.shape[0]), np.nan)
errors[0, :] = ci['effect'] - ci['2.5 %']
errors[1, :] = ci['97.5 %'] - ci['effect']
#add true effect
ax = plt.subplot(1, 1, 1)
ax.scatter(x=['Group_1', 'Group_2', 'Group_3'], y=[3, 1, 0], c='red', label='True Effect')
plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors)
plt.title('GATEs')
plt.xlabel('Groups')
_ = plt.ylabel('Effect and 95%-CI')
