Python: Group Average Treatment Effects (GATEs)#

In this simple example, we illustrate how the DoubleML package can be used to estimate group average treatment effects.

Data#

The data will be generated with a simple data generating process to enable us to know the true group effects.

[1]:
import numpy as np
import pandas as pd
import doubleml as dml

Here, we consider three different groups which depend on the first covariate, which could correspond to e.g. three different age groups in the sample. For simplicity, the treatment effect within each group is generated to be constant, such that it corresponds to the group average treatment effect.

[2]:
def group_effect(x):
    if x[0] <= -0.7:
        te = 3
    elif (x[0] > -0.7) & (x[0] <= 0.2):
        te = 1
    else:
        te = 0
    return te

The data is generated as

$ \begin{align} Y_i & = g(W_1)T_i + \langle W_i,\gamma_0\rangle + \epsilon_i \\ T_i & = \langle W_i,\beta_0\rangle +\eta_i, \end{align} $

where \(W_i\sim\mathcal{N}(0,I_{d_w})\) and \(\epsilon_i,\eta_i\sim\mathcal{U}[0,1]\). The coefficient vectors \(\gamma_0\) and \(\beta_0\) both have small random support which values are drawn independently from \(\mathcal{U}[0,1]\). Further, \(g(w_1)\) defines the conditional treatment effect, which is generated depending on \(w_1\).

\[\begin{split}g(w_1) = \begin{cases}3\quad,\text{for } w_1\le -0.7\\ 1\quad,\text{for } -0.7 < w_1\le 0.2\\ 0\quad,\text{for } 0.2 < w_1\\ \end{cases}.\end{split}\]
[3]:
def create_synthetic_group_data(n_samples=200, n_w=10, support_size=5):
    """
    Creates a simple synthetic example for group effects.

    Parameters
    ----------
    n_samples : int
        Number of samples.
        Default is ``200``.

    n_w : int
        Dimension of covariates.
        Default is ``10``.

    support_size : int
        Number of relevant covariates.
        Default is ``5``.

    Returns
    -------
     data : pd.DataFrame
            A data frame.

    """
    # Outcome support
    support_w = np.random.choice(np.arange(n_w), size=support_size, replace=False)
    coefs_w = np.random.uniform(0, 1, size=support_size)
    # Define the function to generate the noise
    epsilon_sample = lambda n: np.random.uniform(-1, 1, size=n_samples)
    # Treatment support
    # Assuming the matrices gamma and beta have the same number of non-zero components
    support_t = np.random.choice(np.arange(n_w), size=support_size, replace=False)
    coefs_t = np.random.uniform(0, 1, size=support_size)
    # Define the function to generate the noise
    eta_sample = lambda n: np.random.uniform(-1, 1, size=n_samples)

    # Generate controls, covariates, treatments and outcomes
    w = np.random.normal(0, 1, size=(n_samples, n_w))
    # Group treatment effect
    te = np.apply_along_axis(group_effect, axis=1, arr=w)
    # Define treatment
    log_odds = np.dot(w[:, support_t], coefs_t) + eta_sample(n_samples)
    t_sigmoid = 1 / (1 + np.exp(-log_odds))
    t = np.array([np.random.binomial(1, p) for p in t_sigmoid])
    # Define the outcome
    y = te * t + np.dot(w[:, support_w], coefs_w) + epsilon_sample(n_samples)

    # Now we build the dataset
    y_df = pd.DataFrame({'y': y})
    t_df = pd.DataFrame({'t': t})
    w_df = pd.DataFrame(data=w, index=np.arange(w.shape[0]), columns=[f'w_{i}' for i in range(w.shape[1])])

    data = pd.concat([y_df, t_df, w_df], axis=1)
    covariates = list(w_df.columns.values)

    return data, covariates

We will consider a quite small number of covariates to ensure fast calcualtion.

[4]:
# DGP constants
np.random.seed(42)
n_samples = 500
n_w = 10
support_size = 5

# Create data
data, covariates = create_synthetic_group_data(n_samples=n_samples, n_w=n_w, support_size=support_size)
data_dml_base = dml.DoubleMLData(data,
                                 y_col='y',
                                 d_cols='t',
                                 x_cols=covariates)

Interactive Regression Model (IRM)#

The first step is to fit a DoubleML IRM Model to the data.

[5]:
# First stage estimation
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
randomForest_reg = RandomForestRegressor(n_estimators=500)
randomForest_class = RandomForestClassifier(n_estimators=500)

np.random.seed(42)

dml_irm = dml.DoubleMLIRM(data_dml_base,
                          ml_g=randomForest_reg,
                          ml_m=randomForest_class,
                          trimming_threshold=0.01,
                          n_folds=5)
print("Training IRM Model")
dml_irm.fit(store_predictions=True)
Training IRM Model
[5]:
<doubleml.double_ml_irm.DoubleMLIRM at 0x7f45cab88340>

Group Average Treatment Effects (GATEs)#

Next, we can specify the groups as DataFrame with boolean columns.

[6]:
groups = pd.DataFrame(np.column_stack([data['w_0'] <= -0.7,
                                       (data['w_0'] > -0.7) & (data['w_0'] <= 0.2),
                                       data['w_0'] > 0.2]),
             columns=['Group 1', 'Group 2', 'Group 3'])
[7]:
groups.head()
[7]:
Group 1 Group 2 Group 3
0 False True False
1 False True False
2 True False False
3 True False False
4 True False False

To calculate GATEs just call the gate() method and supply the DataFrame with the group definitions and the level (with default of 0.95).

[8]:
gate = dml_irm.gate(groups=groups)
print(gate.confint(level=0.95))
            2.5 %    effect    97.5 %
Group 1  2.795964  3.126939  3.457914
Group 2  0.936490  1.234062  1.531634
Group 3 -0.168960  0.026815  0.222590

The confidence intervals above are point-wise, but by setting the option joint and providing a number of bootstrap repetitions n_rep_boot.

[9]:
ci = gate.confint(level=0.95, joint=True, n_rep_boot=1000)
print(ci)
            2.5 %    effect    97.5 %
Group 1  2.501505  3.126939  3.752373
Group 2  0.671748  1.234062  1.796376
Group 3 -0.343135  0.026815  0.396765

Finally, let us plot the estimates together with the true effect within each group.

[10]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 10., 7.5

errors = np.full((2, ci.shape[0]), np.nan)
errors[0, :] = ci['effect'] - ci['2.5 %']
errors[1, :] = ci['97.5 %'] - ci['effect']

plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors, label='Estimated Effect (with joint CI)')

#add true effect
ax = plt.subplot(1, 1, 1)
ax.scatter(x=['Group 1', 'Group 2', 'Group 3'], y=[3, 1, 0], c='red', label='True Effect')

plt.title('GATEs')
plt.xlabel('Groups')
plt.legend()
_ = plt.ylabel('Effect and 95%-CI')
../_images/examples_py_double_ml_gate_19_0.png

It is also possible to supply disjoint groups as a single vector (still as a data frame). Remark the slightly different name.

[11]:
groups =  pd.DataFrame(columns=['Group'], index=range(data['w_0'].shape[0]), dtype=str)
for i, w_i in enumerate(data['w_0']):
    if w_i <= -0.7:
         groups['Group'][i] = '1'
    elif (w_i > -0.7) & (w_i <= 0.2):
         groups['Group'][i] = '2'
    else:
         groups['Group'][i] = '3'

groups.head()
[11]:
Group
0 2
1 2
2 1
3 1
4 1
[12]:
gate = dml_irm.gate(groups=groups)
ci = gate.confint()
print(ci)
            2.5 %    effect    97.5 %
Group_1  2.795964  3.126939  3.457914
Group_2  0.936490  1.234062  1.531634
Group_3 -0.168960  0.026815  0.222590

The coefficients of the best linear predictor can be seen via the summary (the values can be accessed through the underlying model .blp_model).

[13]:
gate.summary
[13]:
coef std err t P>|t| [0.025 0.975]
Group_1 3.126939 0.149758 20.879973 5.621183e-70 2.832703 3.421176
Group_2 1.234062 0.142020 8.689369 5.300034e-17 0.955029 1.513095
Group_3 0.026815 0.117683 0.227859 8.198497e-01 -0.204402 0.258032

Remark that the confidence intervals are slightly smaller, since they are not based on the White’s heteroskedasticity robus standard errors.

[14]:
print(gate)
================== DoubleMLBLP Object ==================

------------------ Fit summary ------------------
             coef   std err          t         P>|t|    [0.025    0.975]
Group_1  3.126939  0.149758  20.879973  5.621183e-70  2.832703  3.421176
Group_2  1.234062  0.142020   8.689369  5.300034e-17  0.955029  1.513095
Group_3  0.026815  0.117683   0.227859  8.198497e-01 -0.204402  0.258032
[15]:
errors = np.full((2, ci.shape[0]), np.nan)
errors[0, :] = ci['effect'] - ci['2.5 %']
errors[1, :] = ci['97.5 %'] - ci['effect']

#add true effect
ax = plt.subplot(1, 1, 1)
ax.scatter(x=['Group_1', 'Group_2', 'Group_3'], y=[3, 1, 0], c='red', label='True Effect')

plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors)
plt.title('GATEs')
plt.xlabel('Groups')
_ =  plt.ylabel('Effect and 95%-CI')
../_images/examples_py_double_ml_gate_27_0.png