3. Models#

The DoubleML includes the following models.

3.1. Partially linear regression model (PLR)#

Partially linear regression (PLR) models take the form

\[ \begin{align}\begin{aligned}Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where \(Y\) is the outcome variable and \(D\) is the policy variable of interest. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.

digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }
     Y -> D -> V [dir="back"];
     X -> D;
     Y -> X [dir="back"];
}

Causal diagram#

DoubleMLPLR implements PLR models. Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.datasets import make_plr_CCDDHNR2018

In [4]: from sklearn.ensemble import RandomForestRegressor

In [5]: from sklearn.base import clone

In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [7]: ml_l = clone(learner)

In [8]: ml_m = clone(learner)

In [9]: np.random.seed(1111)

In [10]: data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [11]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [12]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

In [13]: print(dml_plr_obj.fit())
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Learner ml_l RMSE: [[1.18346131]]
Learner ml_m RMSE: [[1.0600811]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
       coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.514216  0.044875  11.458956  2.120543e-30  0.426263  0.602168
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn")

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
set.seed(1111)
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table')
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m)
dml_plr_obj$fit()
print(dml_plr_obj)
================= DoubleMLPLR Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.47659    0.04166   11.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.2. Partially linear IV regression model (PLIV)#

Partially linear IV regression (PLIV) models take the form

\[ \begin{align}\begin{aligned}Y - D \theta_0 = g_0(X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align} \]

where \(Y\) is the outcome variable, \(D\) is the policy variable of interest and \(Z\) denotes one or multiple instrumental variables. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.

digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#56B4E9"]
       Z [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }

     Z -> V [dir="back"];
     D -> X [dir="back"];
     Y -> D [dir="both"];
     X -> Y;
     Z -> X [dir="back"];
     Z -> D;

     { rank=same; Y D }
     { rank=same; Z X }
         { rank=same; V }
}

Causal diagram#

DoubleMLPLIV implements PLIV models. Estimation is conducted via its fit() method:

In [14]: import numpy as np

In [15]: import doubleml as dml

In [16]: from doubleml.datasets import make_pliv_CHS2015

In [17]: from sklearn.ensemble import RandomForestRegressor

In [18]: from sklearn.base import clone

In [19]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [20]: ml_l = clone(learner)

In [21]: ml_m = clone(learner)

In [22]: ml_r = clone(learner)

In [23]: np.random.seed(2222)

In [24]: data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type='DataFrame')

In [25]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='Z1')

In [26]: dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)

In [27]: print(dml_pliv_obj.fit())
================== DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['Z1']
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Learner ml_l RMSE: [[1.48355347]]
Learner ml_m RMSE: [[0.53188141]]
Learner ml_r RMSE: [[1.25181913]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
       coef   std err        t         P>|t|     2.5 %    97.5 %
d  0.481267  0.084945  5.66564  1.464764e-08  0.314778  0.647756
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
ml_r = learner$clone()
set.seed(2222)
data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_col = "d", z_cols= "Z1")
dml_pliv_obj = DoubleMLPLIV$new(obj_dml_data, ml_l, ml_m, ml_r)
dml_pliv_obj$fit()
print(dml_pliv_obj)
================= DoubleMLPLIV Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): Z1
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger
ml_r: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.66184    0.07786     8.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.3. Interactive regression model (IRM)#

Interactive regression (IRM) models take the form

\[ \begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\). We consider estimation of the average treatment effects when treatment effects are fully heterogeneous. Target parameters of interest in this model are the average treatment effect (ATE),

\[\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X)]\]

and the average treatment effect of the treated (ATTE),

\[\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X) | D=1].\]
digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }
     Y -> D -> V [dir="back"];
     X -> D;
     Y -> X [dir="back"];
}

Causal diagram#

DoubleMLIRM implements IRM models. Estimation is conducted via its fit() method:

In [28]: import numpy as np

In [29]: import doubleml as dml

In [30]: from doubleml.datasets import make_irm_data

In [31]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [32]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [33]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [34]: np.random.seed(3333)

In [35]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [36]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [37]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

In [38]: print(dml_irm_obj.fit())
================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[1.11796234]]
Learner ml_g1 RMSE: [[1.10851361]]
Learner ml_m RMSE: [[0.41907525]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
       coef   std err         t     P>|t|     2.5 %   97.5 %
d  0.593036  0.195602  3.031848  0.002431  0.209663  0.97641
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

set.seed(3333)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m)
dml_irm_obj$fit()
print(dml_irm_obj)
================= DoubleMLIRM Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: classif.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)   
d    0.6695     0.2097   3.192  0.00141 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.4. Interactive IV model (IIVM)#

Interactive IV regression (IIVM) models take the form

\[ \begin{align}\begin{aligned}Y = \ell_0(D, X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\) and the instrument is binary, \(Z \in \lbrace 0,1 \rbrace\). Consider the functions \(g_0\), \(r_0\) and \(m_0\), where \(g_0\) maps the support of \((Z,X)\) to \(\mathbb{R}\) and \(r_0\) and \(m_0\) respectively map the support of \((Z,X)\) and \(X\) to \((\varepsilon, 1-\varepsilon)\) for some \(\varepsilon \in (0, 1/2)\), such that

\[ \begin{align}\begin{aligned}Y = g_0(Z, X) + \nu, & &\mathbb{E}(\nu | Z, X) = 0,\\D = r_0(Z, X) + U, & &\mathbb{E}(U | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align} \]

The target parameter of interest in this model is the local average treatment effect (LATE),

\[\theta_0 = \frac{\mathbb{E}[g_0(1, X)] - \mathbb{E}[g_0(0,X)]}{\mathbb{E}[r_0(1, X)] - \mathbb{E}[r_0(0,X)]}.\]
digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#56B4E9"]
       Z [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }

     Z -> V [dir="back"];
     D -> X [dir="back"];
     Y -> D [dir="both"];
     X -> Y;
     Z -> X [dir="back"];
     Z -> D;

     { rank=same; Y D }
     { rank=same; Z X }
         { rank=same; V }
}

Causal diagram#

DoubleMLIIVM implements IIVM models. Estimation is conducted via its fit() method:

In [39]: import numpy as np

In [40]: import doubleml as dml

In [41]: from doubleml.datasets import make_iivm_data

In [42]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [43]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [44]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [45]: ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [46]: np.random.seed(4444)

In [47]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame')

In [48]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')

In [49]: dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)

In [50]: print(dml_iivm_obj.fit())
================== DoubleMLIIVM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['z']
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[1.12169423]]
Learner ml_g1 RMSE: [[1.1264581]]
Learner ml_m RMSE: [[0.49825336]]
Learner ml_r0 RMSE: [[0.50420135]]
Learner ml_r1 RMSE: [[0.36566158]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
      coef   std err         t     P>|t|     2.5 %   97.5 %
d  0.44928  0.224423  2.001934  0.045292  0.009419  0.88914
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

set.seed(4444)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_r = ml_m$clone()
data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d", z_cols="z")
dml_iivm_obj = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)
dml_iivm_obj$fit()
print(dml_iivm_obj)
================= DoubleMLIIVM Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Instrument(s): z
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: classif.ranger
ml_r: classif.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)  
d    0.3569     0.1990   1.793    0.073 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.5. Difference-in-Differences Models (DID)#

Difference-in-Differences Models (DID) implemented in the package focus on the the binary treatment case with with two treatment periods.

Adopting the notation from Sant’Anna and Zhao (2020), let \(Y_{it}\) be the outcome of interest for unit \(i\) at time \(t\). Further, let \(D_{it}=1\) indicate if unit \(i\) is treated before time \(t\) (otherwise \(D_{it}=0\)). Since all units start as untreated (\(D_{i0}=0\)), define \(D_{i}=D_{i1}.\) Relying on the potential outcome notation, denote \(Y_{it}(0)\) as the outcome of unit \(i\) at time \(t\) if the unit did not receive treatment up until time \(t\) and analogously for \(Y_{it}(1)\) with treatment. Consequently, the observed outcome for unit is \(i\) at time \(t\) is \(Y_{it}=D_{it} Y_{it}(1) + (1-D_{it}) Y_{it}(0)\). Further, let \(X_i\) be a vector of pre-treatment covariates.

Target parameter of interest is the average treatment effect on the treated (ATTE)

\[\theta_0 = \mathbb{E}[Y_{i1}(1)- Y_{i1}(0)|D_i=1].\]

The corresponding identifying assumptions are

  • (Cond.) Parallel Trends: \(\mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=1] = \mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=0]\quad a.s.\)

  • Overlap: \(\exists\epsilon > 0\): \(P(D_i=1) > \epsilon\) and \(P(D_i=1|X_i) \le 1-\epsilon\quad a.s.\)

Note

For a more detailed introduction and recent developments of the difference-in-differences literature see e.g. Roth et al. (2022).

3.5.1. Panel data#

If panel data are available, the observations are assumed to be iid. of form \((Y_{i0}, Y_{i1}, D_i, X_i)\). Remark that the difference \(\Delta Y_i= Y_{i1}-Y_{i0}\) has to be defined as the outcome y in the DoubleMLData object.

DoubleMLIDID implements difference-in-differences models for panel data. Estimation is conducted via its fit() method:

In [51]: import numpy as np

In [52]: import doubleml as dml

In [53]: from doubleml.datasets import make_did_SZ2020

In [54]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [55]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [56]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [57]: np.random.seed(42)

In [58]: data = make_did_SZ2020(n_obs=500, return_type='DataFrame')

# y is already defined as the difference of observed outcomes
In [59]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [60]: dml_did_obj = dml.DoubleMLDID(obj_dml_data, ml_g, ml_m)

In [61]: print(dml_did_obj.fit())
================== DoubleMLDID Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[16.1683004]]
Learner ml_g1 RMSE: [[14.14910752]]
Learner ml_m RMSE: [[0.48467874]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
       coef   std err         t     P>|t|     2.5 %    97.5 %
d -3.116908  2.029543 -1.535769  0.124595 -7.094739  0.860922

3.5.2. Repeated cross-sections#

For repeated cross-sections, the observations are assumed to be iid. of form \((Y_{i}, D_i, X_i, T_i)\), where \(T_i\) is a dummy variable if unit \(i\) is observed pre- or post-treatment period, such that the observed outcome can be defined as

\[Y_i = T_i Y_{i1} + (1-T_i) Y_{i0}.\]

Further, treatment and covariates are assumed to be stationary, such that the joint distribution of \((D,X)\) is invariant to \(T\).

DoubleMLIDIDCS implements difference-in-differences models for repeated cross-sections. Estimation is conducted via its fit() method:

In [62]: import numpy as np

In [63]: import doubleml as dml

In [64]: from doubleml.datasets import make_did_SZ2020

In [65]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [66]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [67]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [68]: np.random.seed(42)

In [69]: data = make_did_SZ2020(n_obs=500, cross_sectional_data=True, return_type='DataFrame')

In [70]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', t_col='t')

In [71]: dml_did_obj = dml.DoubleMLDIDCS(obj_dml_data, ml_g, ml_m)

In [72]: print(dml_did_obj.fit())
================== DoubleMLDIDCS Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Learner ml_g_d0_t0 RMSE: [[17.66519949]]
Learner ml_g_d0_t1 RMSE: [[43.79590888]]
Learner ml_g_d1_t0 RMSE: [[33.27277427]]
Learner ml_g_d1_t1 RMSE: [[49.65172857]]
Learner ml_m RMSE: [[0.48909902]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
       coef   std err         t     P>|t|      2.5 %     97.5 %
d -6.604603  8.724009 -0.757061  0.449014 -23.703346  10.494139