4. Heterogeneous treatment effects#
All implemented solutions focus on the IRM or IIVM models, as for the PLR and PLIV models heterogeneous treatment effects can be usually modelled via feature construction.
4.1. Group average treatment effects (GATEs)#
Group Average Treatment Effects (GATEs) consider the target parameters
where \(G_k\) denotes a group indicator and \(Y(d)\) the potential outcome with \(d \in \{0, 1\}\).
The DoubleMLIRM
class contains the gate()
method, which enables the estimation and construction of confidence intervals
for GATEs after fitting the DoubleMLIRM
object. To estimate GATEs, the user has to specify a pandas DataFrame
containing
the groups (dummy coded or one column with strings).
This will construct and fit a DoubleMLBLP
object. Confidence intervals can then be constructed via
the confint()
method. Jointly valid confidence intervals will be based on a gaussian multiplier bootstrap.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import doubleml as dml
In [4]: from doubleml.datasets import make_irm_data
In [5]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [6]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [7]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [8]: np.random.seed(3333)
In [9]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [10]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [11]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
In [12]: _ = dml_irm_obj.fit()
# define groups
In [13]: np.random.seed(42)
In [14]: groups = pd.DataFrame(np.random.choice(3, 500), columns=['Group'], dtype=str)
In [15]: print(groups.head())
Group
0 2
1 0
2 2
3 2
4 0
In [16]: gate_obj = dml_irm_obj.gate(groups=groups)
In [17]: ci = gate_obj.confint()
In [18]: print(ci)
2.5 % effect 97.5 %
Group_0 0.232827 0.646743 1.060660
Group_1 -0.628611 0.361209 1.351030
Group_2 0.364709 0.772396 1.180083
A more detailed notebook on GATEs is available in the example gallery.
4.2. Conditional average treatment effects (CATEs)#
Conditional Average Treatment Effects (CATEs) consider the target parameters
for a low-dimensional feature \(X\), where \(Y(d)\) the potential outcome with \(d \in \{0, 1\}\).
The DoubleMLIRM
class contains the cate()
method, which enables the estimation and construction of confidence intervals
for CATEs after fitting the DoubleMLIRM
object. To estimate CATEs, the user has to specify a pandas DataFrame
containing
the basis (e.g. B-splines) for the conditional treatment effects.
This will construct and fit a DoubleMLBLP
object. Confidence intervals can then be constructed via
the confint()
method. Jointly valid confidence intervals will be based on a gaussian multiplier bootstrap.
In [19]: import numpy as np
In [20]: import pandas as pd
In [21]: import patsy
In [22]: import doubleml as dml
In [23]: from doubleml.datasets import make_irm_data
In [24]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [25]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [26]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [27]: np.random.seed(3333)
In [28]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [29]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [30]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
In [31]: _ = dml_irm_obj.fit()
# define a basis with respect to the first variable
In [32]: design_matrix = patsy.dmatrix("bs(x, df=5, degree=2)", {"x":obj_dml_data.data["X1"]})
In [33]: spline_basis = pd.DataFrame(design_matrix)
In [34]: print(spline_basis.head())
0 1 2 3 4 5
0 1.0 0.000000 0.191397 0.782646 0.025958 0.0
1 1.0 0.342467 0.653991 0.000000 0.000000 0.0
2 1.0 0.460535 0.511022 0.000000 0.000000 0.0
3 1.0 0.000000 0.456552 0.543358 0.000091 0.0
4 1.0 0.046405 0.778852 0.174743 0.000000 0.0
In [35]: cate_obj = dml_irm_obj.cate(basis=spline_basis)
In [36]: ci = cate_obj.confint()
In [37]: print(ci.head())
2.5 % effect 97.5 %
0 0.253947 0.741145 1.228343
1 -1.091937 -0.114516 0.862905
2 -1.936616 -0.405075 1.126466
3 0.270069 0.691246 1.112424
4 0.001276 0.571684 1.142092
A more detailed notebook on CATEs is available in the example gallery. The examples also include the construction of a two-dimensional basis with B-splines.
4.3. Quantiles#
The DoubleML package includes (local) quantile estimation for potential outcomes for IRM and IIVM models.
4.3.1. Potential quantiles (PQs)#
For a quantile \(\tau \in (0,1)\) the target parameters \(\theta_{\tau}(d)\) of interest are the potential quantiles (PQs),
and local potential quantiles (LPQs),
where \(Y(d)\) denotes the potential outcome with \(d \in \{0, 1\}\).
DoubleMLPQ
implements potential quantile estimation. Estimation is conducted via its fit()
method:
In [38]: import numpy as np
In [39]: import doubleml as dml
In [40]: from doubleml.datasets import make_irm_data
In [41]: from sklearn.ensemble import RandomForestClassifier
In [42]: np.random.seed(3141)
In [43]: ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [44]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [45]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [46]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [47]: dml_pq_obj = dml.DoubleMLPQ(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)
In [48]: dml_pq_obj.fit().summary
Out[48]:
coef std err t P>|t| 2.5 % 97.5 %
d 0.553878 0.149858 3.696011 0.000219 0.260161 0.847595
DoubleMLLPQ
implements local potential quantile estimation, where the argument treatment
indicates the potential outcome.
Estimation is conducted via its fit()
method:
In [49]: import numpy as np
In [50]: import doubleml as dml
In [51]: from doubleml.datasets import make_iivm_data
In [52]: from sklearn.ensemble import RandomForestClassifier
In [53]: np.random.seed(3141)
In [54]: ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [55]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [56]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, return_type='DataFrame')
In [57]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')
In [58]: dml_lpq_obj = dml.DoubleMLLPQ(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)
In [59]: dml_lpq_obj.fit().summary
Out[59]:
coef std err t P>|t| 2.5 % 97.5 %
d 0.217244 0.636453 0.341336 0.73285 -1.03018 1.464668
4.3.2. Quantile treatment effects (QTEs)#
For a quantile \(\tau \in (0,1)\) the target parameter \(\theta_{\tau}\) of interest are the quantile treatment effect (QTE),
where \(\theta_{\tau}(d)\) denotes the corresponding potential quantile.
Analogously, the local quantile treatment effect (LQTE) can be defined as the difference of the corresponding local potential quantiles.
DoubleMLQTE
implements quantile treatment effect estimation. Estimation is conducted via its fit()
method:
In [60]: import numpy as np
In [61]: import doubleml as dml
In [62]: from doubleml.datasets import make_irm_data
In [63]: from sklearn.ensemble import RandomForestClassifier
In [64]: np.random.seed(3141)
In [65]: ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [66]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [67]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [68]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [69]: dml_qte_obj = dml.DoubleMLQTE(obj_dml_data, ml_g, ml_m, score='PQ', quantiles=[0.25, 0.5, 0.75])
In [70]: dml_qte_obj.fit().summary
Out[70]:
coef std err t P>|t| 2.5 % 97.5 %
0.25 0.274825 0.347310 0.791297 0.428771 -0.405890 0.955541
0.50 0.449150 0.192539 2.332782 0.019660 0.071782 0.826519
0.75 0.709606 0.193308 3.670867 0.000242 0.330731 1.088482
To estimate local quantile effects the score
argument has to be set to 'LPQ'
.
A detailed notebook on PQs and QTEs is available in the example gallery.
4.4. Conditional value at risk (CVaR)#
The DoubleML package includes conditional value at risk estimation for IRM models.
4.4.1. CVaR of potential outcomes#
For a quantile \(\tau \in (0,1)\) the target parameters \(\theta_{\tau}(d)\) of interest are the conditional values at risk (CVaRs) of the potential outcomes,
where \(Y(d)\) denotes the potential outcome with \(d \in \{0, 1\}\) and \(F_{Y(d)}(x)\) the corresponding cdf of \(Y(d)\).
DoubleMLCVAR
implements conditional value at risk estimation for potential outcomes, where the argument treatment
indicates the potential outcome.
Estimation is conducted via its fit()
method:
In [71]: import numpy as np
In [72]: import doubleml as dml
In [73]: from doubleml.datasets import make_irm_data
In [74]: from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
In [75]: np.random.seed(3141)
In [76]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [77]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [78]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [79]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [80]: dml_cvar_obj = dml.DoubleMLCVAR(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)
In [81]: dml_cvar_obj.fit().summary
Out[81]:
coef std err t P>|t| 2.5 % 97.5 %
d 1.587273 0.097174 16.334375 5.620494e-60 1.396816 1.77773
4.4.2. CVaR treatment effects#
For a quantile \(\tau \in (0,1)\) the target parameter \(\theta_{\tau}\) of interest are the treatment effects on the conditional value at risk,
where \(\theta_{\tau}(d)\) denotes the corresponding conditional values at risk of the potential outcomes.
DoubleMLQTE
implements CVaR treatment effect estimation, if the score
argument has been set to 'CVaR'
(default is 'PQ'
).
Estimation is conducted via its fit()
method:
In [82]: import numpy as np
In [83]: import doubleml as dml
In [84]: from doubleml.datasets import make_irm_data
In [85]: from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
In [86]: np.random.seed(3141)
In [87]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [88]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
In [89]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [90]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [91]: dml_cvar_obj = dml.DoubleMLQTE(obj_dml_data, ml_g, ml_m, score='CVaR', quantiles=[0.25, 0.5, 0.75])
In [92]: dml_cvar_obj.fit().summary
Out[92]:
coef std err t P>|t| 2.5 % 97.5 %
0.25 0.468496 0.248803 1.883002 5.970008e-02 -0.019148 0.956140
0.50 0.688914 0.143697 4.794224 1.633053e-06 0.407274 0.970554
0.75 0.994769 0.165652 6.005186 1.911132e-09 0.670097 1.319440
A detailed notebook on CVaR estimation for potential outcomes and treatment effects is available in the example gallery.
4.5. Policy Learning with Trees#
Policy Learning considers to find an optimal decision policy. We consider deterministic binary policies, which are defined as mapping
Using the score component \(\psi_b(W_i,\hat{\eta})\) of the IRM score, we can find the optimal treatment policy by solving the weighted classification problem
where \(\Pi\) denotes a policy class, which we define as depth-\(m\) classification trees. Thus, we estimate splits in the features \(X\) that reflect the heterogeneity of the treatment effect and consequently maximize the sum of the estimated individual treatment effects of all individuals by assigning different treatments.
The DoubleMLIRM
class contains the policy_tree()
method, which enables the estimation of a policy tree
using weighted classification after fitting the DoubleMLIRM
object. To estimate a policy tree, the user has to specify a pandas DataFrame
containing
the covariates on based on which the policy will make treatment decisions. These can be either the original covariates used in the
DoubleMLIRM
estimation, or a subset, or new covariates.
This will construct and fit a DoubleMLPolicyTree
object. A plot of the decision rules can be displayed by the
plot_tree()
method. The predict()
method enables the application of the estimated policy on new data.
The depth
parameter, which defaults to 2
, can be used to adjust the maximum depth of the tree.
In [93]: import numpy as np
In [94]: import pandas as pd
In [95]: import doubleml as dml
In [96]: from doubleml.datasets import make_irm_data
In [97]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [98]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [99]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [100]: np.random.seed(3333)
In [101]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [102]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [103]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
In [104]: _ = dml_irm_obj.fit()
# define features to learn policy on
In [105]: np.random.seed(42)
In [106]: features = data[["X1","X2","X3"]]
In [107]: print(features.head())
X1 X2 X3
0 0.305133 0.497298 -0.811398
1 -0.696770 -1.717860 -2.030087
2 -0.903135 0.174940 0.185585
3 0.095475 -0.653820 -1.800272
4 -0.198953 0.203893 0.204653
# fits a tree of depth 2
In [108]: policy_tree_obj = dml_irm_obj.policy_tree(features=features)
In [109]: policy_tree_obj.plot_tree();
A more detailed notebook on Policy Trees is available in the example gallery.