doubleml.DoubleMLIRM#
- class doubleml.DoubleMLIRM(obj_dml_data, ml_g, ml_m, n_folds=5, n_rep=1, score='ATE', dml_procedure='dml2', trimming_rule='truncate', trimming_threshold=1e-12, draw_sample_splitting=True, apply_cross_fitting=True)#
Double machine learning for interactive regression models
- Parameters
obj_dml_data (
DoubleMLData
object) – TheDoubleMLData
object providing the data and specifying the variables for the causal model.ml_g (estimator implementing
fit()
andpredict()
) – A machine learner implementingfit()
andpredict()
methods (e.g.sklearn.ensemble.RandomForestRegressor
) for the nuisance function \(g_0(D,X) = E[Y|X,D]\). For a binary outcome variable \(Y\) (with values 0 and 1), a classifier implementingfit()
andpredict_proba()
can also be specified. Ifsklearn.base.is_classifier()
returnsTrue
,predict_proba()
is used otherwisepredict()
.ml_m (classifier implementing
fit()
andpredict_proba()
) – A machine learner implementingfit()
andpredict_proba()
methods (e.g.sklearn.ensemble.RandomForestClassifier
) for the nuisance function \(m_0(X) = E[D|X]\).n_folds (int) – Number of folds. Default is
5
.n_rep (int) – Number of repetitons for the sample splitting. Default is
1
.score (str or callable) – A str (
'ATE'
or'ATTE'
) specifying the score function or a callable object / function with signaturepsi_a, psi_b = score(y, d, g_hat0, g_hat1, m_hat, smpls)
. Default is'ATE'
.dml_procedure (str) – A str (
'dml1'
or'dml2'
) specifying the double machine learning algorithm. Default is'dml2'
.trimming_rule (str) – A str (
'truncate'
is the only choice) specifying the trimming approach. Default is'truncate'
.trimming_threshold (float) – The threshold used for trimming. Default is
1e-12
.draw_sample_splitting (bool) – Indicates whether the sample splitting should be drawn during initialization of the object. Default is
True
.apply_cross_fitting (bool) – Indicates whether cross-fitting should be applied. Default is
True
.
Examples
>>> import numpy as np >>> import doubleml as dml >>> from doubleml.datasets import make_irm_data >>> from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier >>> np.random.seed(3141) >>> ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) >>> ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) >>> data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame') >>> obj_dml_data = dml.DoubleMLData(data, 'y', 'd') >>> dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m) >>> dml_irm_obj.fit().summary coef std err t P>|t| 2.5 % 97.5 % d 0.414073 0.238529 1.735941 0.082574 -0.053436 0.881581
Notes
Interactive regression (IRM) models take the form
\[ \begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\). We consider estimation of the average treatment effects when treatment effects are fully heterogeneous. Target parameters of interest in this model are the average treatment effect (ATE),
\[\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X)]\]and the average treatment effect of the treated (ATTE),
\[\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X) | D=1].\]Methods
bootstrap
([method, n_rep_boot])Multiplier bootstrap for DoubleML models.
confint
([joint, level])Confidence intervals for DoubleML models.
Draw sample splitting for DoubleML models.
fit
([n_jobs_cv, keep_scores, ...])Estimate DoubleML models.
get_params
(learner)Get hyperparameters for the nuisance model of DoubleML models.
p_adjust
([method])Multiple testing adjustment for DoubleML models.
set_ml_nuisance_params
(learner, treat_var, ...)Set hyperparameters for the nuisance models of DoubleML models.
set_sample_splitting
(all_smpls)Set the sample splitting for DoubleML models.
tune
(param_grids[, tune_on_folds, ...])Hyperparameter-tuning for DoubleML models.
Attributes
all_coef
Estimates of the causal parameter(s) for the
n_rep
different sample splits after callingfit()
.all_dml1_coef
Estimates of the causal parameter(s) for the
n_rep
xn_folds
different folds after callingfit()
withdml_procedure='dml1'
.all_se
Standard errors of the causal parameter(s) for the
n_rep
different sample splits after callingfit()
.apply_cross_fitting
Indicates whether cross-fitting should be applied.
boot_coef
Bootstrapped coefficients for the causal parameter(s) after calling
fit()
andbootstrap()
.boot_t_stat
Bootstrapped t-statistics for the causal parameter(s) after calling
fit()
andbootstrap()
.coef
Estimates for the causal parameter(s) after calling
fit()
.dml_procedure
The double machine learning algorithm.
learner
The machine learners for the nuisance functions.
learner_names
The names of the learners.
models
The fitted nuisance models.
n_folds
Number of folds.
n_rep
Number of repetitions for the sample splitting.
n_rep_boot
The number of bootstrap replications.
params
The hyperparameters of the learners.
params_names
The names of the nuisance models with hyperparameters.
predictions
The predictions of the nuisance models.
psi
Values of the score function \(\psi(W; \theta, \eta) = \psi_a(W; \eta) \theta + \psi_b(W; \eta)\) after calling
fit()
.psi_a
Values of the score function component \(\psi_a(W; \eta)\) after calling
fit()
.psi_b
Values of the score function component \(\psi_b(W; \eta)\) after calling
fit()
.pval
p-values for the causal parameter(s) after calling
fit()
.score
The score function.
se
Standard errors for the causal parameter(s) after calling
fit()
.smpls
The partition used for cross-fitting.
smpls_cluster
The partition of clusters used for cross-fitting.
summary
A summary for the estimated causal effect after calling
fit()
.t_stat
t-statistics for the causal parameter(s) after calling
fit()
.
- DoubleMLIRM.bootstrap(method='normal', n_rep_boot=500)#
Multiplier bootstrap for DoubleML models.
- DoubleMLIRM.confint(joint=False, level=0.95)#
Confidence intervals for DoubleML models.
- DoubleMLIRM.draw_sample_splitting()#
Draw sample splitting for DoubleML models.
The samples are drawn according to the attributes
n_folds
,n_rep
andapply_cross_fitting
.- Returns
self
- Return type
- DoubleMLIRM.fit(n_jobs_cv=None, keep_scores=True, store_predictions=False, store_models=False)#
Estimate DoubleML models.
- Parameters
n_jobs_cv (None or int) – The number of CPUs to use to fit the learners.
None
means1
. Default isNone
.keep_scores (bool) – Indicates whether the score function evaluations should be stored in
psi
,psi_a
andpsi_b
. Default isTrue
.store_predictions (bool) – Indicates whether the predictions for the nuisance functions should be stored in
predictions
. Default isFalse
.store_models (bool) – Indicates whether the fitted models for the nuisance functions should be stored in
models
. This allows to analyze the fitted models or extract information like variable importance. Default isFalse
.
- Returns
self
- Return type
- DoubleMLIRM.get_params(learner)#
Get hyperparameters for the nuisance model of DoubleML models.
- DoubleMLIRM.p_adjust(method='romano-wolf')#
Multiple testing adjustment for DoubleML models.
- Parameters
method (str) – A str (
'romano-wolf''
,'bonferroni'
,'holm'
, etc) specifying the adjustment method. In addition to'romano-wolf''
, all methods implemented instatsmodels.stats.multitest.multipletests()
can be applied. Default is'romano-wolf'
.- Returns
p_val – A data frame with adjusted p-values.
- Return type
pd.DataFrame
- DoubleMLIRM.set_ml_nuisance_params(learner, treat_var, params)#
Set hyperparameters for the nuisance models of DoubleML models.
- Parameters
learner (str) – The nuisance model / learner (see attribute
params_names
).treat_var (str) – The treatment variable (hyperparameters can be set treatment-variable specific).
params (dict or list) – A dict with estimator parameters (used for all folds) or a nested list with fold specific parameters. The outer list needs to be of length
n_rep
and the inner list of lengthn_folds
.
- Returns
self
- Return type
- DoubleMLIRM.set_sample_splitting(all_smpls)#
Set the sample splitting for DoubleML models.
The attributes
n_folds
andn_rep
are derived from the provided partition.- Parameters
- If nested list of lists of tuples:
The outer list needs to provide an entry per repeated sample splitting (length of list is set as
n_rep
). The inner list needs to provide a tuple (train_ind, test_ind) per fold (length of list is set asn_folds
). If tuples for more than one fold are provided, it must form a partition andapply_cross_fitting
is set to True. Otherwiseapply_cross_fitting
is set to False andn_folds=2
.- If list of tuples:
The list needs to provide a tuple (train_ind, test_ind) per fold (length of list is set as
n_folds
). If tuples for more than one fold are provided, it must form a partition andapply_cross_fitting
is set to True. Otherwiseapply_cross_fitting
is set to False andn_folds=2
.n_rep=1
is always set.- If tuple:
Must be a tuple with two elements train_ind and test_ind. No sample splitting is achieved if train_ind and test_ind are range(n_rep). Otherwise
n_folds=2
.apply_cross_fitting=False
andn_rep=1
is always set.
- Returns
self
- Return type
Examples
>>> import numpy as np >>> import doubleml as dml >>> from doubleml.datasets import make_plr_CCDDHNR2018 >>> from sklearn.ensemble import RandomForestRegressor >>> from sklearn.base import clone >>> np.random.seed(3141) >>> learner = RandomForestRegressor(max_depth=2, n_estimators=10) >>> ml_g = learner >>> ml_m = learner >>> obj_dml_data = make_plr_CCDDHNR2018(n_obs=10, alpha=0.5) >>> dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_g, ml_m) >>> # simple sample splitting with two folds and without cross-fitting >>> smpls = ([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]) >>> dml_plr_obj.set_sample_splitting(smpls) >>> # sample splitting with two folds and cross-fitting >>> smpls = [([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]), >>> ([5, 6, 7, 8, 9], [0, 1, 2, 3, 4])] >>> dml_plr_obj.set_sample_splitting(smpls) >>> # sample splitting with two folds and repeated cross-fitting with n_rep = 2 >>> smpls = [[([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]), >>> ([5, 6, 7, 8, 9], [0, 1, 2, 3, 4])], >>> [([0, 2, 4, 6, 8], [1, 3, 5, 7, 9]), >>> ([1, 3, 5, 7, 9], [0, 2, 4, 6, 8])]] >>> dml_plr_obj.set_sample_splitting(smpls)
- DoubleMLIRM.tune(param_grids, tune_on_folds=False, scoring_methods=None, n_folds_tune=5, search_mode='grid_search', n_iter_randomized_search=100, n_jobs_cv=None, set_as_params=True, return_tune_res=False)#
Hyperparameter-tuning for DoubleML models.
The hyperparameter-tuning is performed using either an exhaustive search over specified parameter values implemented in
sklearn.model_selection.GridSearchCV
or via a randomized search implemented insklearn.model_selection.RandomizedSearchCV
.- Parameters
param_grids (dict) – A dict with a parameter grid for each nuisance model / learner (see attribute
learner_names
).tune_on_folds (bool) – Indicates whether the tuning should be done fold-specific or globally. Default is
False
.scoring_methods (None or dict) – The scoring method used to evaluate the predictions. The scoring method must be set per nuisance model via a dict (see attribute
learner_names
for the keys). If None, the estimator’s score method is used. Default isNone
.n_folds_tune (int) – Number of folds used for tuning. Default is
5
.search_mode (str) – A str (
'grid_search'
or'randomized_search'
) specifying whether hyperparameters are optimized viasklearn.model_selection.GridSearchCV
orsklearn.model_selection.RandomizedSearchCV
. Default is'grid_search'
.n_iter_randomized_search (int) – If
search_mode == 'randomized_search'
. The number of parameter settings that are sampled. Default is100
.n_jobs_cv (None or int) – The number of CPUs to use to tune the learners.
None
means1
. Default isNone
.set_as_params (bool) – Indicates whether the hyperparameters should be set in order to be used when
fit()
is called. Default isTrue
.return_tune_res (bool) – Indicates whether detailed tuning results should be returned. Default is
False
.
- Returns
self (object) – Returned if
return_tune_res
isFalse
.tune_res (list) – A list containing detailed tuning results and the proposed hyperparameters. Returned if
return_tune_res
isTrue
.