Introduction to `DoubleML` for Python

Tools for Causality
Grenoble, Sept 25 - 29, 2023
Philipp Bach, Sven Klaassen

Introduction to `DoubleML` for Python

Introduction to Double Machine Learning

So far, we have focused on DML for the interactive regression model using the doubly robust score for estimation of ATE
However, the DML framework of Chernozhukov et al. (2018) is much more general and works with any causal model with an orthogonal score
We will now have a look at more examples and its implementation

Introduction to Double Machine Learning

DML can also be used for (partially) linear models and for instrumental variables, quantile regression, difference-in-difference models, \(\ldots\)

Causal models, DoubleML

Introduction to Double Machine Learning

All of these models are based on the three key ingredients of DML and, hence, share a common structure
- Neyman-orthogonal score
- High-quality ML learners
- Sample splitting
We exploit the common structure of the causal models in the implementation of DoubleML (Bach et al. 2021, 2022)

Introduction to Double Machine Learning

The general statistical procedures for estimation of the causal parameter, confidence intervals, \(\ldots\) apply to all causal models within the DML framework
These methods are implemented in the abstract base class DoubleML
Model-specific parts are implemented in model classes which are subclasses of DoubleML
- Example: DoubleMLIRM has an implementation of the doubly robust score for estimating the ATE in the interactive regression model

Introduction to Double Machine Learning

Class structure, DoubleML

Implementation and Theoretical Framwork of DML

Key ingredients

Orthogonal Score
- Object-oriented implementation
- Exploit common structure being centered around a (linear) score function \(\psi(\cdot)\)
High-quality ML
- State-of-the-art ML prediction and tuning methods
- Provided by scikit-learn and sckit-learn-like learners
Sample Splitting
- General implementation of sample splitting

Papers, User Guide, Resources

Papers and Materials

R package - with a nontechnical introduction to DML: Bach et al. (2021)
Python package: Bach et al. (2022)
API Docu and examples: docs.doubleml.org

Software implementation:

Leave a 🌟 if you like 😄

DoubleML: Installation

DoubleML (Bach et al. 2021, 2022) is available on PyPI and can be installed via pip or conda
Latest PyPi release

pip install -U DoubleML

Latest conda-forge release

conda install -c conda-forge doubleml

Development version from GitHub

git clone git@github.com:DoubleML/doubleml-for-py.git
cd doubleml-for-py
pip install –editable .

Object Orientation

DoubleML gives the user a high flexibility with regard to specifications of DML models
- Choice of ML learners for approximation of nuisance parameters
- Different resampling schemes
- DML algorithms (DML1, DML2)
- Different Neyman-orthogonal score functions
DoubleML can be easily extended
- New model classes with appropriate Neyman-orthogonal score functions can be inherited from abstract base class DoubleML
- Score functions can be provided as callables (functions in R)
- Resampling schemes are customizable in a flexible way

Getting Started with
DoubleML

DoubleML Workflow Example

Workflow

0. Problem Formulation

Data-Backend
Causal Model
ML Methods
DML Specification
Estimation
Inference

0. Problem Formulation

401(k) Example
Goal: Estimate ATE of eligibility in 401(k) pension plans on employees’ net financial assets

DoubleML Workflow Example

Workflow

Problem Formulation

1. Data-Backend

Causal Model
ML Methods
DML Specification
Estimation
Inference

1. Data-Backend

Declare the roles for the treatment variable, the outcome variable and controls

from doubleml import DoubleMLData
from doubleml.datasets import fetch_401K

data = fetch_401K(return_type='DataFrame')

# Construct DoubleMLData object
dml_data = DoubleMLData(data, 
                        y_col='net_tfa',
                        d_cols='e401',
                        x_cols=['age', 'inc', 'educ', 'fsize', 'marr',
                                'twoearn', 'db', 'pira', 'hown'])

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend

2. Causal Model

ML Methods
DML Specification
Estimation
Inference

2. Causal Model

Choose the causal model in DoubleML

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend
Causal Model

3. ML Methods

DML Specification
Estimation
Inference

3. ML Methods

Initialize the learners with hyperparameters
Internal tuning is optional

Random forest
XGBoost

# Random forest learners
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

ml_l_rf = RandomForestRegressor(n_estimators = 500, max_depth = 7,
                                max_features = 3, min_samples_leaf = 3)

ml_m_rf = RandomForestClassifier(n_estimators = 500, max_depth = 5,
                                max_features = 4, min_samples_leaf = 7)

# Xgboost learners
from xgboost import XGBClassifier, XGBRegressor

ml_l_xgb = XGBRegressor(objective = "reg:squarederror", eta = 0.1,
                        n_estimators =35)

ml_m_xgb = XGBClassifier(objective = "binary:logistic", eta = 0.1, n_estimators = 34, 
                         use_label_encoder = False, eval_metric = "logloss")

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend
Causal Model
ML Methods

4. DML Specification

Estimation
Inference

4. DML Specifications

Initialize the model DoubleML object for the causal model, here DoubleMLIRM

Random forest
XGBoost

import numpy as np
import pandas as pd
from doubleml import DoubleMLIRM

np.random.seed(42)
# Default values
dml_irm_rf = DoubleMLIRM(dml_data,
                         ml_g = ml_l_rf,
                         ml_m = ml_m_rf)

np.random.seed(42)
# Parametrized by user
dml_irm_rf = DoubleMLIRM(dml_data,
                         ml_g = ml_l_rf,
                         ml_m = ml_m_rf,
                         n_folds = 3,
                         n_rep = 1,
                         score = 'ATE',
                         dml_procedure = 'dml2')

import numpy as np
import pandas as pd
from doubleml import DoubleMLIRM

np.random.seed(42)
# Default values
dml_irm_xgb = DoubleMLIRM(dml_data,
                         ml_g = ml_l_xgb,
                         ml_m = ml_m_xgb)

np.random.seed(42)
# Parametrized by user
dml_irm_xgb = DoubleMLIRM(dml_data,
                         ml_g = ml_l_xgb,
                         ml_m = ml_m_xgb,
                         n_folds = 3,
                         n_rep = 1,
                         score = 'ATE',
                         dml_procedure = 'dml2')

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend
Causal Model
ML Methods
DML Specification
Estimation
Inference

5. Estimation

Use the fit() method to estimate the model

Random forest
XGBoost

# Estimation
dml_irm_rf.fit()

# Coefficient estimate
dml_irm_rf.coef

array([8121.56476394])

# Standard error
dml_irm_rf.se

array([1106.55324897])

# Summary
dml_irm_rf.summary.round(2)

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
e401	8121.56	1106.55	7.34	0.0	5952.76	10290.37

# Estimation
dml_irm_xgb.fit()

# Coefficient estimate
dml_irm_xgb.coef

array([8278.72210592])

# Standard error
dml_irm_xgb.se

array([1210.25410979])

# Summary
dml_irm_xgb.summary.round(2)

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
e401	8278.72	1210.25	6.84	0.0	5906.67	10650.78

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend
Causal Model
ML Methods
DML Specification
Estimation
Inference

5. Estimation

For an overview on DoubleML objects use the print() method

Random forest
XGBoost

# Estimation
print(dml_irm_rf)

================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=7, max_features=3, min_samples_leaf=3,
                      n_estimators=500)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=4, min_samples_leaf=7,
                       n_estimators=500)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[47342.82318784]]
Learner ml_g1 RMSE: [[64073.50906324]]
Learner ml_m RMSE: [[0.44281669]]

------------------ Resampling        ------------------
No. folds: 3
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
             coef      std err         t         P>|t|        2.5 %  \
e401  8121.564764  1106.553249  7.339516  2.143684e-13  5952.760249   

            97.5 %  
e401  10290.369279

print(dml_irm_xgb)

================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eta=0.1, eval_metric=None,
             feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=35, n_jobs=None,
             num_parallel_tree=None, predictor=None, ...)
Learner ml_m: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eta=0.1, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=34, n_jobs=None,
              num_parallel_tree=None, predictor=None, ...)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[47198.39376508]]
Learner ml_g1 RMSE: [[65926.53453118]]
Learner ml_m RMSE: [[0.44567814]]

------------------ Resampling        ------------------
No. folds: 3
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
             coef     std err         t         P>|t|        2.5 %  \
e401  8278.722106  1210.25411  6.840483  7.892687e-12  5906.667639   

            97.5 %  
e401  10650.776573

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend
Causal Model
ML Methods
DML Specification
Estimation
Inference

6. Inference

For confidence intervals use the confint() method

Random forest
XGBoost

# Summary
dml_irm_rf.summary

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
e401	8121.564764	1106.553249	7.339516	2.143684e-13	5952.760249	10290.369279

# Confidence intervals
dml_irm_rf.confint(level=0.95)

	2.5 %	97.5 %
e401	5952.760249	10290.369279

# Summary
dml_irm_xgb.summary

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
e401	8278.722106	1210.25411	6.840483	7.892687e-12	5906.667639	10650.776573

# Confidence intervals
dml_irm_xgb.confint(level=0.95)

	2.5 %	97.5 %
e401	5906.667639	10650.776573

DoubleML Workflow Example

Workflow

Problem Formulation
Data-Backend
Causal Model
ML Methods
DML Specification
Estimation
Inference

6. Inference

For confidence intervals use the confint() method

Random forest
XGBoost

# Multiplier bootstrap (relevant in case with multiple treatment variables)
_ = dml_irm_rf.bootstrap()

dml_irm_rf.confint(joint = True)

	2.5 %	97.5 %
e401	6001.807407	10241.322121

# Multiplier bootstrap (relevant in case with multiple treatment variables)
_ = dml_irm_xgb.bootstrap()

dml_irm_xgb.confint(joint = True)

	2.5 %	97.5 %
e401	6024.960385	10532.483827

Full Example: 401(k) Data

A more detailed version of workflow example is available on

docs.doubleml.org

and in the

DoubleML example gallery

Hands-On Example

Now it’s your turn!

… Open the Uplift Modeling Notebook and follow the workflow…

Appendix

FAQs

Relation to other libraries for Causal ML

DoubleML provides a general implementation of the Double Machine Learning approach by Chernozhukov et al. (2018) in Python and R
There are also other open source libraries available for causal machine learning
- CausalML (uber, https://github.com/uber/causalml, Chen et al. (2020)) - variety of causal ML learners, i.a. with focus on uplift modeling, CATEs and IATEs
- EconML (microsoft research, https://github.com/microsoft/EconML, Microsoft Research (2019)) - various causal estimators based on machine learning, among others based on the double machine learning approach
- …

FAQs

Relation to other libraries for Causal ML

CausalML and EconML have a focus on heterogeneity of treatment effects from their start on

DoubleML focuses on implementing the DML approach and its extensions (example: heterogeneity, diff-in-diff, quantile regression, …)

\(\rightarrow\) Object-orientated implementation based on orthogonal score

\(\rightarrow\) Extendibility and flexibility

References

Bach, Philipp, Victor Chernozhukov, Malte S Kurz, and Martin Spindler. 2022. “DoubleML-an Object-Oriented Implementation of Double Machine Learning in Python.” Journal of Machine Learning Research 23: 53–51.

Bach, Philipp, Victor Chernozhukov, Malte S Kurz, Martin Spindler, and Sven Klaassen. 2021. “DoubleML – An Object-Oriented Implementation of Double Machine Learning in R.” https://arxiv.org/abs/2103.09603.

Chen, Huigang, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. 2020. “Causalml: Python Package for Causal Machine Learning.” arXiv Preprint arXiv:2002.11631.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097.

Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. forthcoming. Applied Causal Inference Powered by ML and AI. online.

Microsoft Research. 2019. “EconML: A Python package for ML-based heterogeneous treatment effects estimation.” https://github.com/microsoft/EconML.

Introduction to DoubleML for Python

Introduction to DoubleML for Python

Introduction to Double Machine Learning

Introduction to Double Machine Learning

Introduction to Double Machine Learning

Introduction to Double Machine Learning

Introduction to Double Machine Learning

Implementation and Theoretical Framwork of DML

Key ingredients

Papers, User Guide, Resources

Papers and Materials

Software implementation:

DoubleML: Installation

Object Orientation

Getting Started with DoubleML

DoubleML Workflow Example

Workflow

0. Problem Formulation

DoubleML Workflow Example

Workflow

1. Data-Backend

DoubleML Workflow Example

Workflow

2. Causal Model

DoubleML Workflow Example

Workflow

3. ML Methods

DoubleML Workflow Example

Workflow

4. DML Specifications

DoubleML Workflow Example

Workflow

5. Estimation

DoubleML Workflow Example

Workflow

5. Estimation

DoubleML Workflow Example

Workflow

6. Inference

DoubleML Workflow Example

Workflow

6. Inference

Full Example: 401(k) Data

Hands-On Example

Appendix

FAQs

Relation to other libraries for Causal ML

FAQs

Relation to other libraries for Causal ML

References

References

Introduction to `DoubleML` for Python

Introduction to `DoubleML` for Python

Getting Started with
DoubleML