Getting Started - Causal ML with DoubleML

Installation: DoubleML for Python

Please read the installation instructions and make sure you installed the latest release of DoubleML on your local machine prior to our tutorial.

In case you want to learn more about DoubleML upfront, feel free to read through our user guide.

Quick start

To install DoubleML via pip or conda without a virtual environment type

pip install -U DoubleML

conda install -c conda-forge doubleml

More detailed installation instructions

For more information on installing DoubleML read our online installation guide docs.doubleml.org.

Load DoubleML

Load the DoubleML package after completed installation.

import doubleml as dml

Getting Ready for the Tutorial

To check whether you are ready for the tutorial, run the following example.

Load the Bonus data set.

from doubleml.datasets import fetch_bonus

# Load bonus data
df_bonus = fetch_bonus('DataFrame')
print(df_bonus.head(5))

##    index   abdt  tg  inuidur1  inuidur2  ...  lusd  husd  muld  dep1  dep2
## 0      0  10824   0  2.890372        18  ...     0     1     0   0.0   1.0
## 1      3  10824   0  0.000000         1  ...     1     0     0   0.0   0.0
## 2      4  10747   0  3.295837        27  ...     1     0     0   0.0   0.0
## 3     11  10607   1  2.197225         9  ...     0     0     1   0.0   0.0
## 4     12  10831   0  3.295837        27  ...     1     0     0   1.0   0.0
## 
## [5 rows x 26 columns]

Create a data backend.

# Specify the data and variables for the causal model
from doubleml import DoubleMLData

dml_data_bonus = DoubleMLData(df_bonus,
                                   y_col='inuidur1',
                                   d_cols='tg',
                                   x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
                                           'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
                                           'durable', 'lusd', 'husd'])
print(dml_data_bonus)

## ================== DoubleMLData Object ==================
## 
## ------------------ Data summary      ------------------
## Outcome variable: inuidur1
## Treatment variable(s): ['tg']
## Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
## Instrument variable(s): None
## No. Observations: 5099
## 
## ------------------ DataFrame info    ------------------
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5099 entries, 0 to 5098
## Columns: 26 entries, index to dep2
## dtypes: float64(3), int64(23)
## memory usage: 1.0 MB

Create two learners for the nuisance components using scikit-learn.

from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV

learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 5)

ml_l_bonus = clone(learner)
ml_m_bonus = clone(learner)

Create a new instance of a causal model, here a partially linear regression model via DoubleMLPLR.

import numpy as np
from doubleml import DoubleMLPLR

np.random.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus)
obj_dml_plr_bonus.fit();
print(obj_dml_plr_bonus)

## ================== DoubleMLPLR Object ==================
## 
## ------------------ Data summary      ------------------
## Outcome variable: inuidur1
## Treatment variable(s): ['tg']
## Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
## Instrument variable(s): None
## No. Observations: 5099
## 
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
## 
## ------------------ Machine learner   ------------------
## Learner ml_g: RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
##                       max_depth=5, max_features='sqrt', max_leaf_nodes=None,
##                       max_samples=None, min_impurity_decrease=0.0,
##                       min_impurity_split=None, min_samples_leaf=1,
##                       min_samples_split=2, min_weight_fraction_leaf=0.0,
##                       n_estimators=500, n_jobs=None, oob_score=False,
##                       random_state=None, verbose=0, warm_start=False)
## Learner ml_m: RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
##                       max_depth=5, max_features='sqrt', max_leaf_nodes=None,
##                       max_samples=None, min_impurity_decrease=0.0,
##                       min_impurity_split=None, min_samples_leaf=1,
##                       min_samples_split=2, min_weight_fraction_leaf=0.0,
##                       n_estimators=500, n_jobs=None, oob_score=False,
##                       random_state=None, verbose=0, warm_start=False)
## 
## ------------------ Resampling        ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: True
## 
## ------------------ Fit summary       ------------------
##         coef   std err         t     P>|t|     2.5 %    97.5 %
## tg -0.076691  0.035411 -2.165731  0.030332 -0.146096 -0.007286

Ready to go :-)

Once you are able to run this code, you are ready for our tutorial!