# doubleml.DoubleMLIIVM¶

class doubleml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r, n_folds=5, n_rep=1, score='LATE', subgroups=None, dml_procedure='dml2', trimming_rule='truncate', trimming_threshold=1e-12, draw_sample_splitting=True, apply_cross_fitting=True)

Double machine learning for interactive IV regression models

Parameters
• obj_dml_data (DoubleMLData object) – The DoubleMLData object providing the data and specifying the variables for the causal model.

• ml_g (estimator implementing fit() and predict()) – A machine learner implementing fit() and predict() methods (e.g. sklearn.ensemble.RandomForestRegressor) for the nuisance function $$g_0(Z,X) = E[Y|X,Z]$$.

• ml_m (classifier implementing fit() and predict()) – A machine learner implementing fit() and predict() methods (e.g. sklearn.ensemble.RandomForestClassifier) for the nuisance function $$m_0(X) = E[Z|X]$$.

• ml_r (classifier implementing fit() and predict()) – A machine learner implementing fit() and predict() methods (e.g. sklearn.ensemble.RandomForestClassifier) for the nuisance function $$r_0(Z,X) = E[D|X,Z]$$.

• n_folds (int) – Number of folds. Default is 5.

• n_rep (int) – Number of repetitons for the sample splitting. Default is 1.

• score (str or callable) – A str ('LATE' is the only choice) specifying the score function or a callable object / function with signature psi_a, psi_b = score(y, z, d, g_hat0, g_hat1, m_hat, r_hat0, r_hat1, smpls). Default is 'LATE'.

• subgroups (dict or None) – Dictionary with options to adapt to cases with and without the subgroups of always-takers and never-takes. The logical item always_takers speficies whether there are always takers in the sample. The logical item never_takers speficies whether there are never takers in the sample. Default is {'always_takers': True, 'never_takers': True}.

• dml_procedure (str) – A str ('dml1' or 'dml2') specifying the double machine learning algorithm. Default is 'dml2'.

• trimming_rule (str) – A str ('truncate' is the only choice) specifying the trimming approach. Default is 'truncate'.

• trimming_threshold (float) – The threshold used for trimming. Default is 1e-12.

• draw_sample_splitting (bool) – Indicates whether the sample splitting should be drawn during initialization of the object. Default is True.

• apply_cross_fitting (bool) – Indicates whether cross-fitting should be applied. Default is True.

Examples

>>> import numpy as np
>>> import doubleml as dml
>>> from doubleml.datasets import make_iivm_data
>>> from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
>>> np.random.seed(3141)
>>> ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
>>> ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
>>> ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
>>> data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame')
>>> obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')
>>> dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)
>>> dml_iivm_obj.fit().summary
coef   std err         t     P>|t|     2.5 %    97.5 %
d  0.378351  0.190648  1.984551  0.047194  0.004688  0.752015


Notes

Interactive IV regression (IIVM) models take the form

\begin{align}\begin{aligned}Y = \ell_0(D, X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align}

where the treatment variable is binary, $$D \in \lbrace 0,1 \rbrace$$ and the instrument is binary, $$Z \in \lbrace 0,1 \rbrace$$. Consider the functions $$g_0$$, $$r_0$$ and $$m_0$$, where $$g_0$$ maps the support of $$(Z,X)$$ to $$\mathbb{R}$$ and $$r_0$$ and $$m_0$$ respectively map the support of $$(Z,X)$$ and $$X$$ to $$(\varepsilon, 1-\varepsilon)$$ for some $$\varepsilon \in (0, 1/2)$$, such that

\begin{align}\begin{aligned}Y = g_0(Z, X) + \nu, & &\mathbb{E}(\nu| Z, X) = 0,\\D = r_0(Z, X) + U, & &\mathbb{E}(U | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align}

The target parameter of interest in this model is the local average treatment effect (LATE),

$\theta_0 = \frac{\mathbb{E}[g_0(1, X)] - \mathbb{E}[g_0(0,X)]}{\mathbb{E}[r_0(1, X)] - \mathbb{E}[r_0(0,X)]}.$

Methods

 bootstrap([method, n_rep_boot]) Multiplier bootstrap for DoubleML models. confint([joint, level]) Confidence intervals for DoubleML models. Draw sample splitting for DoubleML models. fit([n_jobs_cv, keep_scores, store_predictions]) Estimate DoubleML models. get_params(learner) Get hyperparameters for the nuisance model of DoubleML models. p_adjust([method]) Multiple testing adjustment for DoubleML models. set_ml_nuisance_params(learner, treat_var, …) Set hyperparameters for the nuisance models of DoubleML models. set_sample_splitting(all_smpls) Set the sample splitting for DoubleML models. tune(param_grids[, tune_on_folds, …]) Hyperparameter-tuning for DoubleML models.

Attributes

 all_coef Estimates of the causal parameter(s) for the n_rep different sample splits after calling fit(). all_dml1_coef Estimates of the causal parameter(s) for the n_rep x n_folds different folds after calling fit() with dml_procedure='dml1'. all_se Standard errors of the causal parameter(s) for the n_rep different sample splits after calling fit(). apply_cross_fitting Indicates whether cross-fitting should be applied. boot_coef Bootstrapped coefficients for the causal parameter(s) after calling fit() and bootstrap(). boot_t_stat Bootstrapped t-statistics for the causal parameter(s) after calling fit() and bootstrap(). coef Estimates for the causal parameter(s) after calling fit(). dml_procedure The double machine learning algorithm. learner The machine learners for the nuisance functions. learner_names The names of the learners. n_folds Number of folds. n_rep Number of repetitions for the sample splitting. n_rep_boot The number of bootstrap replications. params The hyperparameters of the learners. params_names The names of the nuisance models with hyperparameters. predictions The predictions of the nuisance models. psi Values of the score function $$\psi(W; \theta, \eta) = \psi_a(W; \eta) \theta + \psi_b(W; \eta)$$ after calling fit(). psi_a Values of the score function component $$\psi_a(W; \eta)$$ after calling fit(). psi_b Values of the score function component $$\psi_b(W; \eta)$$ after calling fit(). pval p-values for the causal parameter(s) after calling fit(). score The score function. se Standard errors for the causal parameter(s) after calling fit(). smpls The partition used for cross-fitting. summary A summary for the estimated causal effect after calling fit(). t_stat t-statistics for the causal parameter(s) after calling fit().
DoubleMLIIVM.bootstrap(method='normal', n_rep_boot=500)

Multiplier bootstrap for DoubleML models.

Parameters
• method (str) – A str ('Bayes', 'normal' or 'wild') specifying the multiplier bootstrap method. Default is 'normal'

• n_rep_boot (int) – The number of bootstrap replications.

Returns

self

Return type

object

DoubleMLIIVM.confint(joint=False, level=0.95)

Confidence intervals for DoubleML models.

Parameters
• joint (bool) – Indicates whether joint confidence intervals are computed. Default is False

• level (float) – The confidence level. Default is 0.95.

Returns

df_ci – A data frame with the confidence interval(s).

Return type

pd.DataFrame

DoubleMLIIVM.draw_sample_splitting()

Draw sample splitting for DoubleML models.

The samples are drawn according to the attributes n_folds, n_rep and apply_cross_fitting.

Returns

self

Return type

object

DoubleMLIIVM.fit(n_jobs_cv=None, keep_scores=True, store_predictions=False)

Estimate DoubleML models.

Parameters
• n_jobs_cv (None or int) – The number of CPUs to use to fit the learners. None means 1. Default is None.

• keep_scores (bool) – Indicates whether the score function evaluations should be stored in psi, psi_a and psi_b. Default is True.

• store_predictions (bool) – Indicates whether the predictions for the nuisance functions should be be stored in predictions. Default is False.

Returns

self

Return type

object

DoubleMLIIVM.get_params(learner)

Get hyperparameters for the nuisance model of DoubleML models.

Parameters

learner (str) – The nuisance model / learner (see attribute params_names).

Returns

params – Parameters for the nuisance model / learner.

Return type

dict

Multiple testing adjustment for DoubleML models.

Parameters

method (str) – A str ('romano-wolf'', 'bonferroni', 'holm', etc) specifying the adjustment method. In addition to 'romano-wolf'', all methods implemented in statsmodels.stats.multitest.multipletests() can be applied. Default is 'romano-wolf'.

Returns

p_val – A data frame with adjusted p-values.

Return type

pd.DataFrame

DoubleMLIIVM.set_ml_nuisance_params(learner, treat_var, params)

Set hyperparameters for the nuisance models of DoubleML models.

Parameters
• learner (str) – The nuisance model / learner (see attribute params_names).

• treat_var (str) – The treatment variable (hyperparameters can be set treatment-variable specific).

• params (dict or list) – A dict with estimator parameters (used for all folds) or a nested list with fold specific parameters. The outer list needs to be of length n_rep and the inner list of length n_folds.

Returns

self

Return type

object

DoubleMLIIVM.set_sample_splitting(all_smpls)

Set the sample splitting for DoubleML models.

The attributes n_folds and n_rep are derived from the provided partition.

Parameters

all_smpls (list or tuple) –

If nested list of lists of tuples:

The outer list needs to provide an entry per repeated sample splitting (length of list is set as n_rep). The inner list needs to provide a tuple (train_ind, test_ind) per fold (length of list is set as n_folds). If tuples for more than one fold are provided, it must form a partition and apply_cross_fitting is set to True. Otherwise apply_cross_fitting is set to False and n_folds=2.

If list of tuples:

The list needs to provide a tuple (train_ind, test_ind) per fold (length of list is set as n_folds). If tuples for more than one fold are provided, it must form a partition and apply_cross_fitting is set to True. Otherwise apply_cross_fitting is set to False and n_folds=2. n_rep=1 is always set.

If tuple:

Must be a tuple with two elements train_ind and test_ind. No sample splitting is achieved if train_ind and test_ind are range(n_rep). Otherwise n_folds=2. apply_cross_fitting=False and n_rep=1 is always set.

Returns

self

Return type

object

Examples

>>> import numpy as np
>>> import doubleml as dml
>>> from doubleml.datasets import make_plr_CCDDHNR2018
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.base import clone
>>> np.random.seed(3141)
>>> learner = RandomForestRegressor(max_depth=2, n_estimators=10)
>>> ml_g = learner
>>> ml_m = learner
>>> obj_dml_data = make_plr_CCDDHNR2018(n_obs=10, alpha=0.5)
>>> dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_g, ml_m)
>>> # simple sample splitting with two folds and without cross-fitting
>>> smpls = ([0, 1, 2, 3, 4], [5, 6, 7, 8, 9])
>>> dml_plr_obj.set_sample_splitting(smpls)
>>> # sample splitting with two folds and cross-fitting
>>> smpls = [([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]),
>>>          ([5, 6, 7, 8, 9], [0, 1, 2, 3, 4])]
>>> dml_plr_obj.set_sample_splitting(smpls)
>>> # sample splitting with two folds and repeated cross-fitting with n_rep = 2
>>> smpls = [[([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]),
>>>           ([5, 6, 7, 8, 9], [0, 1, 2, 3, 4])],
>>>          [([0, 2, 4, 6, 8], [1, 3, 5, 7, 9]),
>>>           ([1, 3, 5, 7, 9], [0, 2, 4, 6, 8])]]
>>> dml_plr_obj.set_sample_splitting(smpls)

DoubleMLIIVM.tune(param_grids, tune_on_folds=False, scoring_methods=None, n_folds_tune=5, search_mode='grid_search', n_iter_randomized_search=100, n_jobs_cv=None, set_as_params=True, return_tune_res=False)

Hyperparameter-tuning for DoubleML models.

The hyperparameter-tuning is performed using either an exhaustive search over specified parameter values implemented in sklearn.model_selection.GridSearchCV or via a randomized search implemented in sklearn.model_selection.RandomizedSearchCV.

Parameters
• param_grids (dict) – A dict with a parameter grid for each nuisance model / learner (see attribute learner_names).

• tune_on_folds (bool) – Indicates whether the tuning should be done fold-specific or globally. Default is False.

• scoring_methods (None or dict) – The scoring method used to evaluate the predictions. The scoring method must be set per nuisance model via a dict (see attribute learner_names for the keys). If None, the estimator’s score method is used. Default is None.

• n_folds_tune (int) – Number of folds used for tuning. Default is 5.

• search_mode (str) – A str ('grid_search' or 'randomized_search') specifying whether hyperparameters are optimized via sklearn.model_selection.GridSearchCV or sklearn.model_selection.RandomizedSearchCV. Default is 'grid_search'.

• n_iter_randomized_search (int) – If search_mode == 'randomized_search'. The number of parameter settings that are sampled. Default is 100.

• n_jobs_cv (None or int) – The number of CPUs to use to tune the learners. None means 1. Default is None.

• set_as_params (bool) – Indicates whether the hyperparameters should be set in order to be used when fit() is called. Default is True.

• return_tune_res (bool) – Indicates whether detailed tuning results should be returned. Default is False.

Returns

• self (object) – Returned if return_tune_res is False.

• tune_res (list) – A list containing detailed tuning results and the proposed hyperparameters. Returned if return_tune_res is False.