doubleml.DoubleMLLPQ#

class doubleml.DoubleMLLPQ(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5, n_folds=5, n_rep=1, score='LPQ', dml_procedure='dml2', normalize_ipw=True, kde=None, trimming_rule='truncate', trimming_threshold=0.01, draw_sample_splitting=True, apply_cross_fitting=True)#

Double machine learning for local potential quantiles

Parameters
  • obj_dml_data (DoubleMLData object) – The DoubleMLData object providing the data and specifying the variables for the causal model.

  • ml_g (classifier implementing fit() and predict()) – A machine learner implementing fit() and predict_proba() methods (e.g. sklearn.ensemble.RandomForestClassifier) for the nuisance elements which depend on priliminary estimation.

  • ml_m (classifier implementing fit() and predict()) – A machine learner implementing fit() and predict_proba() methods (e.g. sklearn.ensemble.RandomForestClassifier) for the treatment propensity nuisance functions.

  • treatment (int) – Binary treatment indicator. Has to be either 0 or 1. Determines the potential outcome to be considered. Default is 1.

  • quantile (float) – Quantile of the potential outcome. Has to be between 0 and 1. Default is 0.5.

  • n_folds (int) – Number of folds. Default is 5.

  • n_rep (int) – Number of repetitons for the sample splitting. Default is 1.

  • score (str) – A str ('PQ' is the only choice) specifying the score function for potential quantiles. Default is 'PQ'.

  • dml_procedure (str) – A str ('dml1' or 'dml2') specifying the double machine learning algorithm. Default is 'dml2'.

  • normalize_ipw (bool) – Indicates whether the inverse probability weights are normalized. Default is True.

  • kde (callable or None) – A callable object / function with signature deriv = kde(u, weights) for weighted kernel density estimation. Here deriv should evaluate the density in 0. Default is 'None', which uses statsmodels.nonparametric.kde.KDEUnivariate with a gaussian kernel and silverman for bandwidth determination.

  • trimming_rule (str) – A str ('truncate' is the only choice) specifying the trimming approach. Default is 'truncate'.

  • trimming_threshold (float) – The threshold used for trimming. Default is 1e-2.

  • draw_sample_splitting (bool) – Indicates whether the sample splitting should be drawn during initialization of the object. Default is True.

  • apply_cross_fitting (bool) – Indicates whether cross-fitting should be applied(True is the only choice). Default is True.

Examples

>>> import numpy as np
>>> import doubleml as dml
>>> from doubleml.datasets import make_iivm_data
>>> from sklearn.ensemble import RandomForestClassifier
>>> np.random.seed(3141)
>>> ml_g = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
>>> ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=10, min_samples_leaf=2)
>>> data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, return_type='DataFrame')
>>> obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')
>>> dml_lpq_obj = dml.DoubleMLLPQ(obj_dml_data, ml_g, ml_m, treatment=1, quantile=0.5)
>>> dml_lpq_obj.fit().summary
       coef   std err         t    P>|t|    2.5 %    97.5 %
d  0.217244  0.636453  0.341336  0.73285 -1.03018  1.464668

Methods

bootstrap([method, n_rep_boot])

Multiplier bootstrap for DoubleML models.

confint([joint, level])

Confidence intervals for DoubleML models.

draw_sample_splitting()

Draw sample splitting for DoubleML models.

evaluate_learners([learners, metric])

Evaluate fitted learners for DoubleML models on cross-validated predictions.

fit([n_jobs_cv, store_predictions, store_models])

Estimate DoubleML models.

get_params(learner)

Get hyperparameters for the nuisance model of DoubleML models.

p_adjust([method])

Multiple testing adjustment for DoubleML models.

set_ml_nuisance_params(learner, treat_var, ...)

Set hyperparameters for the nuisance models of DoubleML models.

set_sample_splitting(all_smpls)

Set the sample splitting for DoubleML models.

tune(param_grids[, tune_on_folds, ...])

Hyperparameter-tuning for DoubleML models.

Attributes

all_coef

Estimates of the causal parameter(s) for the n_rep different sample splits after calling fit().

all_dml1_coef

Estimates of the causal parameter(s) for the n_rep x n_folds different folds after calling fit() with dml_procedure='dml1'.

all_se

Standard errors of the causal parameter(s) for the n_rep different sample splits after calling fit().

apply_cross_fitting

Indicates whether cross-fitting should be applied.

boot_coef

Bootstrapped coefficients for the causal parameter(s) after calling fit() and bootstrap().

boot_method

The method to construct the bootstrap replications.

boot_t_stat

Bootstrapped t-statistics for the causal parameter(s) after calling fit() and bootstrap().

coef

Estimates for the causal parameter(s) after calling fit().

dml_procedure

The double machine learning algorithm.

kde

The kernel density estimation of the derivative.

learner

The machine learners for the nuisance functions.

learner_names

The names of the learners.

models

The fitted nuisance models.

n_folds

Number of folds.

n_rep

Number of repetitions for the sample splitting.

n_rep_boot

The number of bootstrap replications.

normalize_ipw

Indicates whether the inverse probability weights are normalized.

nuisance_targets

The outcome of the nuisance models.

params

The hyperparameters of the learners.

params_names

The names of the nuisance models with hyperparameters.

predictions

The predictions of the nuisance models.

psi

Values of the score function after calling fit(); For models (e.g., PLR, IRM, PLIV, IIVM) with linear score (in the parameter) \(\psi(W; \theta, \eta) = \psi_a(W; \eta) \theta + \psi_b(W; \eta)\).

psi_deriv

Values of the derivative of the score function with respect to the parameter \(\theta\) after calling fit(); For models (e.g., PLR, IRM, PLIV, IIVM) with linear score (in the parameter) \(\psi_a(W; \eta)\).

psi_elements

Values of the score function components after calling fit(); For models (e.g., PLR, IRM, PLIV, IIVM) with linear score (in the parameter) a dictionary with entries psi_a and psi_b for \(\psi_a(W; \eta)\) and \(\psi_b(W; \eta)\).

pval

p-values for the causal parameter(s) after calling fit().

quantile

Quantile for potential outcome.

rmses

The root-mean-squared-errors of the nuisance models.

score

The score function.

se

Standard errors for the causal parameter(s) after calling fit().

smpls

The partition used for cross-fitting.

smpls_cluster

The partition of clusters used for cross-fitting.

summary

A summary for the estimated causal effect after calling fit().

t_stat

t-statistics for the causal parameter(s) after calling fit().

treatment

Treatment indicator for potential outcome.

trimming_rule

Specifies the used trimming rule.

trimming_threshold

Specifies the used trimming threshold.

DoubleMLLPQ.bootstrap(method='normal', n_rep_boot=500)#

Multiplier bootstrap for DoubleML models.

Parameters
  • method (str) – A str ('Bayes', 'normal' or 'wild') specifying the multiplier bootstrap method. Default is 'normal'

  • n_rep_boot (int) – The number of bootstrap replications.

Returns

self

Return type

object

DoubleMLLPQ.confint(joint=False, level=0.95)#

Confidence intervals for DoubleML models.

Parameters
  • joint (bool) – Indicates whether joint confidence intervals are computed. Default is False

  • level (float) – The confidence level. Default is 0.95.

Returns

df_ci – A data frame with the confidence interval(s).

Return type

pd.DataFrame

DoubleMLLPQ.draw_sample_splitting()#

Draw sample splitting for DoubleML models.

The samples are drawn according to the attributes n_folds, n_rep and apply_cross_fitting.

Returns

self

Return type

object

DoubleMLLPQ.evaluate_learners(learners=None, metric=<function _rmse>)#

Evaluate fitted learners for DoubleML models on cross-validated predictions.

Parameters
  • learners (list) – A list of strings which correspond to the nuisance functions of the model.

  • metric (callable) – A callable function with inputs y_pred and y_true of shape (1, n), where n specifies the number of observations. Remark that some models like IRM are not able to provide all values for y_true for all learners and might contain some nan values in the target vector. Default is the euclidean distance.

Returns

dist – A dictionary containing the evaluated metric for each learner.

Return type

dict

Examples

>>> import numpy as np
>>> import doubleml as dml
>>> from sklearn.metrics import mean_absolute_error
>>> from doubleml.datasets import make_irm_data
>>> from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
>>> np.random.seed(3141)
>>> ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
>>> ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
>>> data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
>>> obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
>>> dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
>>> dml_irm_obj.fit()
>>> dml_irm_obj.evaluate_learners(metric=mean_absolute_error)
{'ml_g0': array([[1.13318973]]),
 'ml_g1': array([[0.91659939]]),
 'ml_m': array([[0.36350912]])}
DoubleMLLPQ.fit(n_jobs_cv=None, store_predictions=True, store_models=False)#

Estimate DoubleML models.

Parameters
  • n_jobs_cv (None or int) – The number of CPUs to use to fit the learners. None means 1. Default is None.

  • store_predictions (bool) – Indicates whether the predictions for the nuisance functions should be stored in predictions. Default is False.

  • store_models (bool) – Indicates whether the fitted models for the nuisance functions should be stored in models. This allows to analyze the fitted models or extract information like variable importance. Default is False.

Returns

self

Return type

object

DoubleMLLPQ.get_params(learner)#

Get hyperparameters for the nuisance model of DoubleML models.

Parameters

learner (str) – The nuisance model / learner (see attribute params_names).

Returns

params – Parameters for the nuisance model / learner.

Return type

dict

DoubleMLLPQ.p_adjust(method='romano-wolf')#

Multiple testing adjustment for DoubleML models.

Parameters

method (str) – A str ('romano-wolf'', 'bonferroni', 'holm', etc) specifying the adjustment method. In addition to 'romano-wolf'', all methods implemented in statsmodels.stats.multitest.multipletests() can be applied. Default is 'romano-wolf'.

Returns

p_val – A data frame with adjusted p-values.

Return type

pd.DataFrame

DoubleMLLPQ.set_ml_nuisance_params(learner, treat_var, params)#

Set hyperparameters for the nuisance models of DoubleML models.

Parameters
  • learner (str) – The nuisance model / learner (see attribute params_names).

  • treat_var (str) – The treatment variable (hyperparameters can be set treatment-variable specific).

  • params (dict or list) – A dict with estimator parameters (used for all folds) or a nested list with fold specific parameters. The outer list needs to be of length n_rep and the inner list of length n_folds.

Returns

self

Return type

object

DoubleMLLPQ.set_sample_splitting(all_smpls)#

Set the sample splitting for DoubleML models.

The attributes n_folds and n_rep are derived from the provided partition.

Parameters

all_smpls (list or tuple) –

If nested list of lists of tuples:

The outer list needs to provide an entry per repeated sample splitting (length of list is set as n_rep). The inner list needs to provide a tuple (train_ind, test_ind) per fold (length of list is set as n_folds). If tuples for more than one fold are provided, it must form a partition and apply_cross_fitting is set to True. Otherwise apply_cross_fitting is set to False and n_folds=2.

If list of tuples:

The list needs to provide a tuple (train_ind, test_ind) per fold (length of list is set as n_folds). If tuples for more than one fold are provided, it must form a partition and apply_cross_fitting is set to True. Otherwise apply_cross_fitting is set to False and n_folds=2. n_rep=1 is always set.

If tuple:

Must be a tuple with two elements train_ind and test_ind. No sample splitting is achieved if train_ind and test_ind are range(n_rep). Otherwise n_folds=2. apply_cross_fitting=False and n_rep=1 is always set.

Returns

self

Return type

object

Examples

>>> import numpy as np
>>> import doubleml as dml
>>> from doubleml.datasets import make_plr_CCDDHNR2018
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.base import clone
>>> np.random.seed(3141)
>>> learner = RandomForestRegressor(max_depth=2, n_estimators=10)
>>> ml_g = learner
>>> ml_m = learner
>>> obj_dml_data = make_plr_CCDDHNR2018(n_obs=10, alpha=0.5)
>>> dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_g, ml_m)
>>> # simple sample splitting with two folds and without cross-fitting
>>> smpls = ([0, 1, 2, 3, 4], [5, 6, 7, 8, 9])
>>> dml_plr_obj.set_sample_splitting(smpls)
>>> # sample splitting with two folds and cross-fitting
>>> smpls = [([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]),
>>>          ([5, 6, 7, 8, 9], [0, 1, 2, 3, 4])]
>>> dml_plr_obj.set_sample_splitting(smpls)
>>> # sample splitting with two folds and repeated cross-fitting with n_rep = 2
>>> smpls = [[([0, 1, 2, 3, 4], [5, 6, 7, 8, 9]),
>>>           ([5, 6, 7, 8, 9], [0, 1, 2, 3, 4])],
>>>          [([0, 2, 4, 6, 8], [1, 3, 5, 7, 9]),
>>>           ([1, 3, 5, 7, 9], [0, 2, 4, 6, 8])]]
>>> dml_plr_obj.set_sample_splitting(smpls)
DoubleMLLPQ.tune(param_grids, tune_on_folds=False, scoring_methods=None, n_folds_tune=5, search_mode='grid_search', n_iter_randomized_search=100, n_jobs_cv=None, set_as_params=True, return_tune_res=False)#

Hyperparameter-tuning for DoubleML models.

The hyperparameter-tuning is performed using either an exhaustive search over specified parameter values implemented in sklearn.model_selection.GridSearchCV or via a randomized search implemented in sklearn.model_selection.RandomizedSearchCV.

Parameters
  • param_grids (dict) – A dict with a parameter grid for each nuisance model / learner (see attribute learner_names).

  • tune_on_folds (bool) – Indicates whether the tuning should be done fold-specific or globally. Default is False.

  • scoring_methods (None or dict) – The scoring method used to evaluate the predictions. The scoring method must be set per nuisance model via a dict (see attribute learner_names for the keys). If None, the estimator’s score method is used. Default is None.

  • n_folds_tune (int) – Number of folds used for tuning. Default is 5.

  • search_mode (str) – A str ('grid_search' or 'randomized_search') specifying whether hyperparameters are optimized via sklearn.model_selection.GridSearchCV or sklearn.model_selection.RandomizedSearchCV. Default is 'grid_search'.

  • n_iter_randomized_search (int) – If search_mode == 'randomized_search'. The number of parameter settings that are sampled. Default is 100.

  • n_jobs_cv (None or int) – The number of CPUs to use to tune the learners. None means 1. Default is None.

  • set_as_params (bool) – Indicates whether the hyperparameters should be set in order to be used when fit() is called. Default is True.

  • return_tune_res (bool) – Indicates whether detailed tuning results should be returned. Default is False.

Returns

  • self (object) – Returned if return_tune_res is False.

  • tune_res (list) – A list containing detailed tuning results and the proposed hyperparameters. Returned if return_tune_res is True.