Installation: DoubleML for Python
Please read the installation instructions and make sure you installed the latest release of DoubleML
on your local machine prior to our tutorial.
In case you want to learn more about DoubleML
upfront, feel free to read through our user guide.
Quick start
To install DoubleML
via pip or conda without a virtual environment type
pip install -U DoubleML
or
conda install -c conda-forge doubleml
More detailed installation instructions
For more information on installing DoubleML
read our online installation guide docs.doubleml.org.
Load DoubleML
Load the DoubleML
package after completed installation.
import doubleml as dml
Getting Ready for the Tutorial
To check whether you are ready for the tutorial, run the following example.
Load the Bonus data set.
from doubleml.datasets import fetch_bonus
# Load bonus data
df_bonus = fetch_bonus('DataFrame')
print(df_bonus.head(5))
## index abdt tg inuidur1 inuidur2 ... lusd husd muld dep1 dep2
## 0 0 10824 0 2.890372 18 ... 0 1 0 0.0 1.0
## 1 3 10824 0 0.000000 1 ... 1 0 0 0.0 0.0
## 2 4 10747 0 3.295837 27 ... 1 0 0 0.0 0.0
## 3 11 10607 1 2.197225 9 ... 0 0 1 0.0 0.0
## 4 12 10831 0 3.295837 27 ... 1 0 0 1.0 0.0
##
## [5 rows x 26 columns]
Create a data backend.
# Specify the data and variables for the causal model
from doubleml import DoubleMLData
dml_data_bonus = DoubleMLData(df_bonus,
y_col='inuidur1',
d_cols='tg',
x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
'durable', 'lusd', 'husd'])
print(dml_data_bonus)
## ================== DoubleMLData Object ==================
##
## ------------------ Data summary ------------------
## Outcome variable: inuidur1
## Treatment variable(s): ['tg']
## Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
## Instrument variable(s): None
## No. Observations: 5099
##
## ------------------ DataFrame info ------------------
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5099 entries, 0 to 5098
## Columns: 26 entries, index to dep2
## dtypes: float64(3), int64(23)
## memory usage: 1.0 MB
Create two learners for the nuisance components using scikit-learn
.
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 5)
ml_l_bonus = clone(learner)
ml_m_bonus = clone(learner)
Create a new instance of a causal model, here a partially linear regression model via DoubleMLPLR
.
import numpy as np
from doubleml import DoubleMLPLR
np.random.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus)
obj_dml_plr_bonus.fit();
print(obj_dml_plr_bonus)
## ================== DoubleMLPLR Object ==================
##
## ------------------ Data summary ------------------
## Outcome variable: inuidur1
## Treatment variable(s): ['tg']
## Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
## Instrument variable(s): None
## No. Observations: 5099
##
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
##
## ------------------ Machine learner ------------------
## Learner ml_g: RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
## max_depth=5, max_features='sqrt', max_leaf_nodes=None,
## max_samples=None, min_impurity_decrease=0.0,
## min_impurity_split=None, min_samples_leaf=1,
## min_samples_split=2, min_weight_fraction_leaf=0.0,
## n_estimators=500, n_jobs=None, oob_score=False,
## random_state=None, verbose=0, warm_start=False)
## Learner ml_m: RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
## max_depth=5, max_features='sqrt', max_leaf_nodes=None,
## max_samples=None, min_impurity_decrease=0.0,
## min_impurity_split=None, min_samples_leaf=1,
## min_samples_split=2, min_weight_fraction_leaf=0.0,
## n_estimators=500, n_jobs=None, oob_score=False,
## random_state=None, verbose=0, warm_start=False)
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: True
##
## ------------------ Fit summary ------------------
## coef std err t P>|t| 2.5 % 97.5 %
## tg -0.076691 0.035411 -2.165731 0.030332 -0.146096 -0.007286
Ready to go :-)
Once you are able to run this code, you are ready for our tutorial!