Getting started#

The purpose of the following case-studies is to demonstrate the core functionalities of DoubleML.

Data#

For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model.

In [1]: import numpy as np

In [2]: from doubleml.datasets import fetch_bonus

# Load bonus data
In [3]: df_bonus = fetch_bonus('DataFrame')

In [4]: print(df_bonus.head(5))
   index   abdt  tg  inuidur1  inuidur2  ...  lusd  husd  muld  dep1  dep2
0      0  10824   0  2.890372        18  ...     0     1     0   0.0   1.0
1      3  10824   0  0.000000         1  ...     1     0     0   0.0   0.0
2      4  10747   0  3.295837        27  ...     1     0     0   0.0   0.0
3     11  10607   1  2.197225         9  ...     0     0     1   0.0   0.0
4     12  10831   0  3.295837        27  ...     1     0     0   1.0   0.0

[5 rows x 26 columns]

# Simulate data
In [5]: np.random.seed(3141)

In [6]: n_obs = 500

In [7]: n_vars = 100

In [8]: theta = 3

In [9]: X = np.random.normal(size=(n_obs, n_vars))

In [10]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

In [11]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
library(DoubleML)

# Load bonus data
df_bonus = fetch_bonus(return_type="data.table")
head(df_bonus)

# Simulate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars)
d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs)
y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs)
A data.table: 6 × 17
inuidur1femaleblackothracedep1dep2q2q3q4q5q6agelt35agegt54durablelusdhusdtg
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2.8903720000100010000010
0.0000000000000010000100
3.2958370000000100000100
2.1972250000001000100001
3.2958370001000010011100
3.2958371000000010010100

The causal model#

Exemplarily we specify a partially linear regression model (PLR). Partially linear regression (PLR) models take the form

\[ \begin{align}\begin{aligned}Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where \(Y\) is the outcome variable and \(D\) is the policy variable of interest. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors. For details about the implemented models in the DoubleML package we refer to the user guide Models.

The data-backend DoubleMLData#

DoubleML provides interfaces to dataframes as well as arrays. Details on the data-backend and the interfaces can be found in the user guide. The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg' serving as treatment variable \(D\) and the columns x_cols specifying the confounders. Alternatively an array interface can be used as shown below for the simulated data.

In [12]: from doubleml import DoubleMLData

# Specify the data and the variables for the causal model
In [13]: dml_data_bonus = DoubleMLData(df_bonus,
   ....:                                 y_col='inuidur1',
   ....:                                 d_cols='tg',
   ....:                                 x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
   ....:                                         'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
   ....:                                         'durable', 'lusd', 'husd'])
   ....: 

In [14]: print(dml_data_bonus)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB


# array interface to DoubleMLData
In [15]: dml_data_sim = DoubleMLData.from_arrays(X, y, d)

In [16]: print(dml_data_sim)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
# Specify the data and variables for the causal model
dml_data_bonus = DoubleMLData$new(df_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                        "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                        "durable", "lusd", "husd"))
print(dml_data_bonus)

# matrix interface to DoubleMLData
dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d)
dml_data_sim
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
No. Observations: 5099
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
No. Observations: 500

Learners to estimate the nuisance models#

To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify learners to estimate \(m_0\) and \(g_0\). For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of DoubleML is based on the meta-packages scikit-learn (Pedregosa et al., 2011) for Python and mlr3 (Lang et al, 2019) for R. For details on the specification of learners and their hyperparameters we refer to the user guide Learners, hyperparameters and hyperparameter tuning.

In [17]: from sklearn.base import clone

In [18]: from sklearn.ensemble import RandomForestRegressor

In [19]: from sklearn.linear_model import LassoCV

In [20]: learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 5)

In [21]: ml_l_bonus = clone(learner)

In [22]: ml_m_bonus = clone(learner)

In [23]: learner = LassoCV()

In [24]: ml_l_sim = clone(learner)

In [25]: ml_m_sim = clone(learner)
library(mlr3)
library(mlr3learners)
# surpress messages from mlr3 package during fitting
lgr::get_logger("mlr3")$set_threshold("warn")

learner = lrn("regr.ranger", num.trees=500, max.depth=5, min.node.size=2)
ml_l_bonus = learner$clone()
ml_m_bonus = learner$clone()

learner = lrn("regr.cv_glmnet", s="lambda.min")
ml_l_sim = learner$clone()
ml_m_sim = learner$clone()

Cross-fitting, DML algorithms and Neyman-orthogonal score functions#

When initializing the object for PLR models DoubleMLPLR, we can further set parameters specifying the resampling: The number of folds used for cross-fitting n_folds (defaults to n_folds = 5) as well as the number of repetitions when applying repeated cross-fitting n_rep (defaults to n_rep = 1). Additionally, one can choose between the algorithms 'dml1' and 'dml2' via dml_procedure (defaults to 'dml2'). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. For the PLR model the default score is 'partialling out', i.e.,

\[\psi(W; \theta, \eta) := [Y - \ell(X) - \theta (D - m(X))] [D - m(X)].\]

Note that with this score, we do not estimate $g_0(X)$ directly, but the conditional expectation of \(Y\) given \(X\), \(\ell = \mathbb{E}[Y|X]\). The user guide provides details about the Sample-splitting, cross-fitting and repeated cross-fitting, the Double machine learning algorithms and the Score functions.

Estimate double/debiased machine learning models#

We now initialize DoubleMLPLR objects for our examples using default parameters. The models are estimated by calling the fit() method and we can for example inspect the estimated treatment effect using the summary property. A more detailed result summary can be obtained via the string-representation of the object. Besides the fit() method DoubleML model classes also provide functionalities to perform statistical inference like bootstrap(), confint() and p_adjust(), for details see the user guide Variance estimation and confidence intervals.

In [26]: from doubleml import DoubleMLPLR

In [27]: np.random.seed(3141)

In [28]: obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus)

In [29]: obj_dml_plr_bonus.fit();

In [30]: print(obj_dml_plr_bonus)
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features='sqrt', n_estimators=500)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features='sqrt', n_estimators=500)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.200303]]
Learner ml_m RMSE: [[0.47419634]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
        coef   std err         t     P>|t|     2.5 %   97.5 %
tg -0.076684  0.035411 -2.165549  0.030346 -0.146087 -0.00728

In [31]: obj_dml_plr_sim = DoubleMLPLR(dml_data_sim, ml_l_sim, ml_m_sim)

In [32]: obj_dml_plr_sim.fit();

In [33]: print(obj_dml_plr_sim)
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: LassoCV()
Learner ml_m: LassoCV()
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[3.24469564]]
Learner ml_m RMSE: [[1.02016117]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
      coef   std err          t  P>|t|     2.5 %    97.5 %
d  3.02092  0.045379  66.570722    0.0  2.931978  3.109861
set.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_l=ml_l_bonus, ml_m=ml_m_bonus)
obj_dml_plr_bonus$fit()
print(obj_dml_plr_bonus)

obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_l=ml_l_sim, ml_m=ml_m_sim)
obj_dml_plr_sim$fit()
print(obj_dml_plr_sim)
================= DoubleMLPLR Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
No. Observations: 5099

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
   Estimate. Std. Error t value Pr(>|t|)  
tg  -0.07561    0.03536  -2.139   0.0325 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


================= DoubleMLPLR Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.cv_glmnet
ml_m: regr.cv_glmnet

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   3.03411    0.04486   67.64   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


References#

  • Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019), mlr3: A modern object-oriented machine learing framework in R. Journal of Open Source Software, doi:10.21105/joss.01903.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830, https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html.