4. Heterogeneous treatment effects#

All implemented solutions focus on the IRM or IIVM models, as for the PLR and PLIV models heterogeneous treatment effects can be usually modelled via feature construction.

4.1. Group average treatment effects (GATEs)#

Group Average Treatment Effects (GATEs) consider the target parameters

\[\theta_{0,k} = \mathbb{E}[Y(1) - Y(0)| G_k],\quad k=1,\dots, K.\]

where \(G_k\) denotes a group indicator and \(Y(d)\) the potential outcome with \(d \in \{0, 1\}\).

The DoubleMLIRM class contains the gate() method, which enables the estimation and construction of confidence intervals for GATEs after fitting the DoubleMLIRM object. To estimate GATEs, the user has to specify a pandas DataFrame containing the groups (dummy coded or one column with strings). This will construct and fit a DoubleMLBLP object. Confidence intervals can then be constructed via the confint() method. Jointly valid confidence intervals will be based on a gaussian multiplier bootstrap.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import doubleml as dml

In [4]: from doubleml.datasets import make_irm_data

In [5]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [6]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [7]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [8]: np.random.seed(3333)

In [9]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [10]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [11]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

In [12]: _ = dml_irm_obj.fit()

# define groups
In [13]: np.random.seed(42)

In [14]: groups = pd.DataFrame(np.random.choice(3, 500), columns=['Group'], dtype=str)

In [15]: print(groups.head())
  Group
0     2
1     0
2     2
3     2
4     0

In [16]: gate_obj = dml_irm_obj.gate(groups=groups)

In [17]: ci = gate_obj.confint()

In [18]: print(ci)
            2.5 %    effect    97.5 %
Group_0  0.231499  0.645472  1.059445
Group_1 -0.625735  0.363644  1.353023
Group_2  0.362585  0.771560  1.180534

A more detailed notebook on GATEs is available in the example gallery.

4.2. Conditional average treatment effects (CATEs)#

Conditional Average Treatment Effects (CATEs) consider the target parameters

\[\theta_{0}(x) = \mathbb{E}[Y(1) - Y(0)| X=x]\]

for a low-dimensional feature \(X\), where \(Y(d)\) the potential outcome with \(d \in \{0, 1\}\).

The DoubleMLIRM class contains the cate() method, which enables the estimation and construction of confidence intervals for CATEs after fitting the DoubleMLIRM object. To estimate CATEs, the user has to specify a pandas DataFrame containing the basis (e.g. B-splines) for the conditional treatment effects. This will construct and fit a DoubleMLBLP object. Confidence intervals can then be constructed via the confint() method. Jointly valid confidence intervals will be based on a gaussian multiplier bootstrap.

In [19]: import numpy as np

In [20]: import pandas as pd

In [21]: import patsy

In [22]: import doubleml as dml

In [23]: from doubleml.datasets import make_irm_data

In [24]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [25]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [26]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [27]: np.random.seed(3333)

In [28]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [29]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [30]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

In [31]: _ = dml_irm_obj.fit()

# define a basis with respect to the first variable
In [32]: design_matrix = patsy.dmatrix("bs(x, df=5, degree=2)", {"x":obj_dml_data.data["X1"]})

In [33]: spline_basis = pd.DataFrame(design_matrix)

In [34]: print(spline_basis.head())
     0         1         2         3         4    5
0  1.0  0.000000  0.191397  0.782646  0.025958  0.0
1  1.0  0.342467  0.653991  0.000000  0.000000  0.0
2  1.0  0.460535  0.511022  0.000000  0.000000  0.0
3  1.0  0.000000  0.456552  0.543358  0.000091  0.0
4  1.0  0.046405  0.778852  0.174743  0.000000  0.0

In [35]: cate_obj = dml_irm_obj.cate(basis=spline_basis)

In [36]: ci = cate_obj.confint()

In [37]: print(ci.head())
      2.5 %    effect    97.5 %
0  0.253713  0.740842  1.227971
1 -1.093402 -0.115726  0.861950
2 -1.937456 -0.406043  1.125370
3  0.269661  0.690793  1.111925
4  0.000336  0.570962  1.141588

A more detailed notebook on CATEs is available in the example gallery. The examples also include the construction of a two-dimensional basis with B-splines.

4.3. Quantiles#

The DoubleML package includes (local) quantile estimation for potential outcomes for IRM and IIVM models.

4.3.1. Potential quantiles (PQs)#

For a quantile \(\tau \in (0,1)\) the target parameters \(\theta_{\tau}(d)\) of interest are the potential quantiles (PQs),

\[P(Y(d) \le \theta_{\tau}(d)) = \tau,\]

and local potential quantiles (LPQs),

\[P(Y(d) \le \theta_{\tau}(d)|\text{Compliers}) = \tau.\]

where \(Y(d)\) denotes the potential outcome with \(d \in \{0, 1\}\).

DoubleMLPQ implements potential quantile estimation. Estimation is conducted via its fit() method:

In [38]: import numpy as np

In [39]: import doubleml as dml

In [40]: from doubleml.datasets import make_irm_data

In [41]: from sklearn.ensemble import RandomForestClassifier

In [42]: np.random.seed(3141)

In [43]: ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [44]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [45]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [46]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [47]: dml_pq_obj = dml.DoubleMLPQ(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)

In [48]: dml_pq_obj.fit().summary
Out[48]: 
       coef   std err         t     P>|t|     2.5 %    97.5 %
d  0.553878  0.149858  3.696011  0.000219  0.260161  0.847595

DoubleMLLPQ implements local potential quantile estimation, where the argument treatment indicates the potential outcome. Estimation is conducted via its fit() method:

In [49]: import numpy as np

In [50]: import doubleml as dml

In [51]: from doubleml.datasets import make_iivm_data

In [52]: from sklearn.ensemble import RandomForestClassifier

In [53]: np.random.seed(3141)

In [54]: ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [55]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [56]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, return_type='DataFrame')

In [57]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')

In [58]: dml_lpq_obj = dml.DoubleMLLPQ(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)

In [59]: dml_lpq_obj.fit().summary
Out[59]: 
       coef   std err       t     P>|t|     2.5 %    97.5 %
d  0.327341  0.548862  0.5964  0.550908 -0.748408  1.403091

4.3.2. Quantile treatment effects (QTEs)#

For a quantile \(\tau \in (0,1)\) the target parameter \(\theta_{\tau}\) of interest are the quantile treatment effect (QTE),

\[\theta_{\tau} = \theta_{\tau}(1) - \theta_{\tau}(0)\]

where \(\theta_{\tau}(d)\) denotes the corresponding potential quantile.

Analogously, the local quantile treatment effect (LQTE) can be defined as the difference of the corresponding local potential quantiles.

DoubleMLQTE implements quantile treatment effect estimation. Estimation is conducted via its fit() method:

In [60]: import numpy as np

In [61]: import doubleml as dml

In [62]: from doubleml.datasets import make_irm_data

In [63]: from sklearn.ensemble import RandomForestClassifier

In [64]: np.random.seed(3141)

In [65]: ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [66]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [67]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [68]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [69]: dml_qte_obj = dml.DoubleMLQTE(obj_dml_data, ml_g, ml_m, score='PQ', quantiles=[0.25, 0.5, 0.75])

In [70]: dml_qte_obj.fit().summary
Out[70]: 
          coef   std err         t     P>|t|     2.5 %    97.5 %
0.25  0.274825  0.347310  0.791297  0.428771 -0.405890  0.955541
0.50  0.449150  0.192539  2.332782  0.019660  0.071782  0.826519
0.75  0.709606  0.193308  3.670867  0.000242  0.330731  1.088482

To estimate local quantile effects the score argument has to be set to 'LPQ'. A detailed notebook on PQs and QTEs is available in the example gallery.

4.4. Conditional value at risk (CVaR)#

The DoubleML package includes conditional value at risk estimation for IRM models.

4.4.1. CVaR of potential outcomes#

For a quantile \(\tau \in (0,1)\) the target parameters \(\theta_{\tau}(d)\) of interest are the conditional values at risk (CVaRs) of the potential outcomes,

\[\theta_{\tau}(d) = \frac{\mathbb{E}[Y(d) 1\{F_{Y(d)}(Y(d) \ge \tau)]}{1-\tau},\]

where \(Y(d)\) denotes the potential outcome with \(d \in \{0, 1\}\) and \(F_{Y(d)}(x)\) the corresponding cdf of \(Y(d)\).

DoubleMLCVAR implements conditional value at risk estimation for potential outcomes, where the argument treatment indicates the potential outcome. Estimation is conducted via its fit() method:

In [71]: import numpy as np

In [72]: import doubleml as dml

In [73]: from doubleml.datasets import make_irm_data

In [74]: from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

In [75]: np.random.seed(3141)

In [76]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [77]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [78]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [79]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [80]: dml_cvar_obj = dml.DoubleMLCVAR(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)

In [81]: dml_cvar_obj.fit().summary
Out[81]: 
       coef   std err          t         P>|t|     2.5 %    97.5 %
d  1.585192  0.096897  16.359623  3.714182e-60  1.395278  1.775106

4.4.2. CVaR treatment effects#

For a quantile \(\tau \in (0,1)\) the target parameter \(\theta_{\tau}\) of interest are the treatment effects on the conditional value at risk,

\[\theta_{\tau} = \theta_{\tau}(1) - \theta_{\tau}(0)\]

where \(\theta_{\tau}(d)\) denotes the corresponding conditional values at risk of the potential outcomes.

DoubleMLQTE implements CVaR treatment effect estimation, if the score argument has been set to 'CVaR' (default is 'PQ'). Estimation is conducted via its fit() method:

In [82]: import numpy as np

In [83]: import doubleml as dml

In [84]: from doubleml.datasets import make_irm_data

In [85]: from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

In [86]: np.random.seed(3141)

In [87]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [88]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)

In [89]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [90]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [91]: dml_cvar_obj = dml.DoubleMLQTE(obj_dml_data, ml_g, ml_m, score='CVaR', quantiles=[0.25, 0.5, 0.75])

In [92]: dml_cvar_obj.fit().summary
Out[92]: 
          coef   std err         t         P>|t|     2.5 %    97.5 %
0.25  0.473467  0.244647  1.935309  5.295234e-02 -0.006032  0.952966
0.50  0.694429  0.142934  4.858391  1.183434e-06  0.414284  0.974575
0.75  1.001638  0.165765  6.042534  1.517128e-09  0.676745  1.326530

A detailed notebook on CVaR estimation for potential outcomes and treatment effects is available in the example gallery.