Introduction to DoubleML for Python

Tools for Causality
Grenoble, Sept 25 - 29, 2023
Philipp Bach, Sven Klaassen

Introduction to DoubleML for Python

Introduction to Double Machine Learning

  • So far, we have focused on DML for the interactive regression model using the doubly robust score for estimation of ATE

  • However, the DML framework of Chernozhukov et al. (2018) is much more general and works with any causal model with an orthogonal score

  • We will now have a look at more examples and its implementation

Introduction to Double Machine Learning

  • DML can also be used for (partially) linear models and for instrumental variables, quantile regression, difference-in-difference models, \(\ldots\)

Causal models, DoubleML

Introduction to Double Machine Learning

  • All of these models are based on the three key ingredients of DML and, hence, share a common structure
    • Neyman-orthogonal score
    • High-quality ML learners
    • Sample splitting
  • We exploit the common structure of the causal models in the implementation of DoubleML (Bach et al. 2021, 2022)

Introduction to Double Machine Learning

  • The general statistical procedures for estimation of the causal parameter, confidence intervals, \(\ldots\) apply to all causal models within the DML framework

  • These methods are implemented in the abstract base class DoubleML

  • Model-specific parts are implemented in model classes which are subclasses of DoubleML

    • Example: DoubleMLIRM has an implementation of the doubly robust score for estimating the ATE in the interactive regression model

Introduction to Double Machine Learning

Class structure, DoubleML

Implementation and Theoretical Framwork of DML

Key ingredients

  • Orthogonal Score
    • Object-oriented implementation
    • Exploit common structure being centered around a (linear) score function \(\psi(\cdot)\)
  • High-quality ML
    • State-of-the-art ML prediction and tuning methods
    • Provided by scikit-learn and sckit-learn-like learners
  • Sample Splitting
    • General implementation of sample splitting

Papers, User Guide, Resources


Papers and Materials

  • R package - with a nontechnical introduction to DML: Bach et al. (2021)

  • Python package: Bach et al. (2022)

  • API Docu and examples: docs.doubleml.org


Software implementation:

Leave a 🌟 if you like 😄

DoubleML: Installation

pip install -U DoubleML
conda install -c conda-forge doubleml
  • Development version from GitHub
git clone git@github.com:DoubleML/doubleml-for-py.git
cd doubleml-for-py
pip install –editable .

Object Orientation

  • DoubleML gives the user a high flexibility with regard to specifications of DML models
    • Choice of ML learners for approximation of nuisance parameters
    • Different resampling schemes
    • DML algorithms (DML1, DML2)
    • Different Neyman-orthogonal score functions
  • DoubleML can be easily extended
    • New model classes with appropriate Neyman-orthogonal score functions can be inherited from abstract base class DoubleML
    • Score functions can be provided as callables (functions in R)
    • Resampling schemes are customizable in a flexible way

Getting Started with
DoubleML

DoubleML Workflow Example

Workflow

0. Problem Formulation

  1. Data-Backend

  2. Causal Model

  3. ML Methods

  4. DML Specification

  5. Estimation

  6. Inference

0. Problem Formulation

  • 401(k) Example

  • Goal: Estimate ATE of eligibility in 401(k) pension plans on employees’ net financial assets

DAG, 401(k) example

DoubleML Workflow Example

Workflow

  1. Problem Formulation

1. Data-Backend

  1. Causal Model

  2. ML Methods

  3. DML Specification

  4. Estimation

  5. Inference

1. Data-Backend

  • Declare the roles for the treatment variable, the outcome variable and controls
from doubleml import DoubleMLData
from doubleml.datasets import fetch_401K

data = fetch_401K(return_type='DataFrame')

# Construct DoubleMLData object
dml_data = DoubleMLData(data, 
                        y_col='net_tfa',
                        d_cols='e401',
                        x_cols=['age', 'inc', 'educ', 'fsize', 'marr',
                                'twoearn', 'db', 'pira', 'hown'])

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

2. Causal Model

  1. ML Methods

  2. DML Specification

  3. Estimation

  4. Inference

2. Causal Model

Choice of causal model

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

  3. Causal Model

3. ML Methods

  1. DML Specification

  2. Estimation

  3. Inference

3. ML Methods

  • Initialize the learners with hyperparameters

  • Internal tuning is optional

# Random forest learners
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

ml_l_rf = RandomForestRegressor(n_estimators = 500, max_depth = 7,
                                max_features = 3, min_samples_leaf = 3)

ml_m_rf = RandomForestClassifier(n_estimators = 500, max_depth = 5,
                                max_features = 4, min_samples_leaf = 7)
# Xgboost learners
from xgboost import XGBClassifier, XGBRegressor

ml_l_xgb = XGBRegressor(objective = "reg:squarederror", eta = 0.1,
                        n_estimators =35)

ml_m_xgb = XGBClassifier(objective = "binary:logistic", eta = 0.1, n_estimators = 34, 
                         use_label_encoder = False, eval_metric = "logloss")

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

  3. Causal Model

  4. ML Methods

4. DML Specification

  1. Estimation

  2. Inference

4. DML Specifications

  • Initialize the model DoubleML object for the causal model, here DoubleMLIRM
import numpy as np
import pandas as pd
from doubleml import DoubleMLIRM

np.random.seed(42)
# Default values
dml_irm_rf = DoubleMLIRM(dml_data,
                         ml_g = ml_l_rf,
                         ml_m = ml_m_rf)

np.random.seed(42)
# Parametrized by user
dml_irm_rf = DoubleMLIRM(dml_data,
                         ml_g = ml_l_rf,
                         ml_m = ml_m_rf,
                         n_folds = 3,
                         n_rep = 1,
                         score = 'ATE',
                         dml_procedure = 'dml2')
import numpy as np
import pandas as pd
from doubleml import DoubleMLIRM

np.random.seed(42)
# Default values
dml_irm_xgb = DoubleMLIRM(dml_data,
                         ml_g = ml_l_xgb,
                         ml_m = ml_m_xgb)

np.random.seed(42)
# Parametrized by user
dml_irm_xgb = DoubleMLIRM(dml_data,
                         ml_g = ml_l_xgb,
                         ml_m = ml_m_xgb,
                         n_folds = 3,
                         n_rep = 1,
                         score = 'ATE',
                         dml_procedure = 'dml2')

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

  3. Causal Model

  4. ML Methods

  5. DML Specification

  6. Estimation

  7. Inference

5. Estimation

  • Use the fit() method to estimate the model
# Estimation
dml_irm_rf.fit()

# Coefficient estimate
dml_irm_rf.coef
array([8121.56476394])
# Standard error
dml_irm_rf.se
array([1106.55324897])
# Summary
dml_irm_rf.summary.round(2)
coef std err t P>|t| 2.5 % 97.5 %
e401 8121.56 1106.55 7.34 0.0 5952.76 10290.37
# Estimation
dml_irm_xgb.fit()

# Coefficient estimate
dml_irm_xgb.coef
array([8278.72210592])
# Standard error
dml_irm_xgb.se
array([1210.25410979])
# Summary
dml_irm_xgb.summary.round(2)
coef std err t P>|t| 2.5 % 97.5 %
e401 8278.72 1210.25 6.84 0.0 5906.67 10650.78

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

  3. Causal Model

  4. ML Methods

  5. DML Specification

  6. Estimation

  7. Inference

5. Estimation

  • For an overview on DoubleML objects use the print() method
# Estimation
print(dml_irm_rf)
================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=7, max_features=3, min_samples_leaf=3,
                      n_estimators=500)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=4, min_samples_leaf=7,
                       n_estimators=500)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[47342.82318784]]
Learner ml_g1 RMSE: [[64073.50906324]]
Learner ml_m RMSE: [[0.44281669]]

------------------ Resampling        ------------------
No. folds: 3
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
             coef      std err         t         P>|t|        2.5 %  \
e401  8121.564764  1106.553249  7.339516  2.143684e-13  5952.760249   

            97.5 %  
e401  10290.369279  
print(dml_irm_xgb)
================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eta=0.1, eval_metric=None,
             feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=35, n_jobs=None,
             num_parallel_tree=None, predictor=None, ...)
Learner ml_m: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eta=0.1, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=34, n_jobs=None,
              num_parallel_tree=None, predictor=None, ...)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[47198.39376508]]
Learner ml_g1 RMSE: [[65926.53453118]]
Learner ml_m RMSE: [[0.44567814]]

------------------ Resampling        ------------------
No. folds: 3
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
             coef     std err         t         P>|t|        2.5 %  \
e401  8278.722106  1210.25411  6.840483  7.892687e-12  5906.667639   

            97.5 %  
e401  10650.776573  

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

  3. Causal Model

  4. ML Methods

  5. DML Specification

  6. Estimation

  7. Inference

6. Inference

  • For confidence intervals use the confint() method
# Summary
dml_irm_rf.summary
coef std err t P>|t| 2.5 % 97.5 %
e401 8121.564764 1106.553249 7.339516 2.143684e-13 5952.760249 10290.369279
# Confidence intervals
dml_irm_rf.confint(level=0.95)
2.5 % 97.5 %
e401 5952.760249 10290.369279
# Summary
dml_irm_xgb.summary
coef std err t P>|t| 2.5 % 97.5 %
e401 8278.722106 1210.25411 6.840483 7.892687e-12 5906.667639 10650.776573
# Confidence intervals
dml_irm_xgb.confint(level=0.95)
2.5 % 97.5 %
e401 5906.667639 10650.776573

DoubleML Workflow Example

Workflow

  1. Problem Formulation

  2. Data-Backend

  3. Causal Model

  4. ML Methods

  5. DML Specification

  6. Estimation

  7. Inference

6. Inference

  • For confidence intervals use the confint() method
# Multiplier bootstrap (relevant in case with multiple treatment variables)
_ = dml_irm_rf.bootstrap()

dml_irm_rf.confint(joint = True)
2.5 % 97.5 %
e401 6001.807407 10241.322121
# Multiplier bootstrap (relevant in case with multiple treatment variables)
_ = dml_irm_xgb.bootstrap()

dml_irm_xgb.confint(joint = True)
2.5 % 97.5 %
e401 6024.960385 10532.483827

Full Example: 401(k) Data


A more detailed version of workflow example is available on


docs.doubleml.org

and in the

DoubleML example gallery

Hands-On Example


Now it’s your turn!


… Open the Uplift Modeling Notebook and follow the workflow…

Appendix

FAQs

Relation to other libraries for Causal ML
  • DoubleML provides a general implementation of the Double Machine Learning approach by Chernozhukov et al. (2018) in Python and R

  • There are also other open source libraries available for causal machine learning

    • CausalML (uber, https://github.com/uber/causalml, Chen et al. (2020)) - variety of causal ML learners, i.a. with focus on uplift modeling, CATEs and IATEs

    • EconML (microsoft research, https://github.com/microsoft/EconML, Microsoft Research (2019)) - various causal estimators based on machine learning, among others based on the double machine learning approach

FAQs

Relation to other libraries for Causal ML

CausalML and EconML have a focus on heterogeneity of treatment effects from their start on

DoubleML focuses on implementing the DML approach and its extensions (example: heterogeneity, diff-in-diff, quantile regression, …)

\(\rightarrow\) Object-orientated implementation based on orthogonal score

\(\rightarrow\) Extendibility and flexibility

References

References

Bach, Philipp, Victor Chernozhukov, Malte S Kurz, and Martin Spindler. 2022. “DoubleML-an Object-Oriented Implementation of Double Machine Learning in Python.” Journal of Machine Learning Research 23: 53–51.
Bach, Philipp, Victor Chernozhukov, Malte S Kurz, Martin Spindler, and Sven Klaassen. 2021. DoubleMLAn Object-Oriented Implementation of Double Machine Learning in R.” https://arxiv.org/abs/2103.09603.
Chen, Huigang, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. 2020. “Causalml: Python Package for Causal Machine Learning.” arXiv Preprint arXiv:2002.11631.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097.
Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. forthcoming. Applied Causal Inference Powered by ML and AI. online.
Microsoft Research. 2019. EconML: A Python package for ML-based heterogeneous treatment effects estimation.” https://github.com/microsoft/EconML.