# 3. Models#

The DoubleML includes the following models.

## 3.1. Partially linear models (PLM)#

The partially linear models (PLM) take the form

$Y = D \theta_0 + g_0(X) + \zeta,$

where treatment effects are additive with some sort of linear form.

### 3.1.1. Partially linear regression model (PLR)#

Partially linear regression (PLR) models take the form

\begin{align}\begin{aligned}Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align}

where $$Y$$ is the outcome variable and $$D$$ is the policy variable of interest. The high-dimensional vector $$X = (X_1, \ldots, X_p)$$ consists of other confounding covariates, and $$\zeta$$ and $$V$$ are stochastic errors.

DoubleMLPLR implements PLR models. Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.datasets import make_plr_CCDDHNR2018

In [4]: from sklearn.ensemble import RandomForestRegressor

In [5]: from sklearn.base import clone

In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [7]: ml_l = clone(learner)

In [8]: ml_m = clone(learner)

In [9]: np.random.seed(1111)

In [10]: data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [11]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [12]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

In [13]: print(dml_plr_obj.fit())
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.18356413]]
Learner ml_m RMSE: [[1.06008533]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef  std err         t         P>|t|     2.5 %    97.5 %
d  0.512672  0.04491  11.41566  3.492417e-30  0.424651  0.600694

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn") learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_l = learner$clone()
ml_m = learner$clone() set.seed(1111) data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table') obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m) dml_plr_obj$fit()
print(dml_plr_obj)

================= DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d   0.47319    0.04165   11.36   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



### 3.1.2. Partially linear IV regression model (PLIV)#

Partially linear IV regression (PLIV) models take the form

\begin{align}\begin{aligned}Y - D \theta_0 = g_0(X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align}

where $$Y$$ is the outcome variable, $$D$$ is the policy variable of interest and $$Z$$ denotes one or multiple instrumental variables. The high-dimensional vector $$X = (X_1, \ldots, X_p)$$ consists of other confounding covariates, and $$\zeta$$ and $$V$$ are stochastic errors.

DoubleMLPLIV implements PLIV models. Estimation is conducted via its fit() method:

In [14]: import numpy as np

In [15]: import doubleml as dml

In [16]: from doubleml.datasets import make_pliv_CHS2015

In [17]: from sklearn.ensemble import RandomForestRegressor

In [18]: from sklearn.base import clone

In [19]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [20]: ml_l = clone(learner)

In [21]: ml_m = clone(learner)

In [22]: ml_r = clone(learner)

In [23]: np.random.seed(2222)

In [24]: data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type='DataFrame')

In [25]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='Z1')

In [26]: dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)

In [27]: print(dml_pliv_obj.fit())
================== DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['Z1']
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.48390784]]
Learner ml_m RMSE: [[0.53209683]]
Learner ml_r RMSE: [[1.25240463]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef   std err         t         P>|t|     2.5 %    97.5 %
d  0.481705  0.084337  5.711638  1.118938e-08  0.316407  0.647004

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone() ml_m = learner$clone()
ml_r = learner$clone() set.seed(2222) data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type="data.table") obj_dml_data = DoubleMLData$new(data, y_col="y", d_col = "d", z_cols= "Z1")
dml_pliv_obj = DoubleMLPLIV$new(obj_dml_data, ml_l, ml_m, ml_r) dml_pliv_obj$fit()
print(dml_pliv_obj)

================= DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): Z1
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger
ml_r: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d   0.66133    0.07796   8.483   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



## 3.2. Interactive regression models (IRM)#

The interactive regression model (IRM) take the form

$Y = g_0(D, X) + U,$

where treatment effects are fully heterogeneous.

### 3.2.1. Binary Interactive Regression Model (IRM)#

Interactive regression (IRM) models take the form

\begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align}

where the treatment variable is binary, $$D \in \lbrace 0,1 \rbrace$$. We consider estimation of the average treatment effects when treatment effects are fully heterogeneous.

Target parameters of interest in this model are the average treatment effect (ATE),

$\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X)]$

and the average treatment effect of the treated (ATTE),

$\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X) | D=1].$

DoubleMLIRM implements IRM models. Estimation is conducted via its fit() method:

In [28]: import numpy as np

In [29]: import doubleml as dml

In [30]: from doubleml.datasets import make_irm_data

In [31]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [32]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [33]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [34]: np.random.seed(3333)

In [35]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')

In [36]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [37]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

In [38]: print(dml_irm_obj.fit())
================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: ATE

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=10, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=10, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[1.07085301]]
Learner ml_g1 RMSE: [[1.09682314]]
Classification:
Learner ml_m Log Loss: [[0.55863386]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef  std err         t     P>|t|     2.5 %    97.5 %
d  0.599297   0.1887  3.175931  0.001494  0.229452  0.969141

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

set.seed(3333)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m)
dml_irm_objfit() print(dml_irm_obj)  ================= DoubleMLIRM Object ================== ------------------ Data summary ------------------ Outcome variable: y Treatment variable(s): d Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10 Instrument(s): No. Observations: 500 ------------------ Score & algorithm ------------------ Score function: ATE DML algorithm: dml2 ------------------ Machine learner ------------------ ml_g: regr.ranger ml_m: classif.ranger ------------------ Resampling ------------------ No. folds: 5 No. repeated sample splits: 1 Apply cross-fitting: TRUE ------------------ Fit summary ------------------ Estimates and significance testing of the effect of target variables Estimate. Std. Error t value Pr(>|t|) d 0.3958 0.2651 1.493 0.135  ### 3.2.2. Average Potential Outcomes (APOs)# For general discrete-values treatments $$D \in \lbrace d_0, \dots, d_l \rbrace$$ the model can be generalized to \begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\A_j = m_{0,j}(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} where $$A_j := 1\lbrace D = d_j\rbrace$$ is an indicator variable for treatment level $$d_j$$ and $$m_{0,j}(X)$$ denotes the corresponding propensity score. Possible target parameters of interest in this model are the average potential outcomes (APOs) $\theta_{0,j} = \mathbb{E}[g_0(d_j, X)].$ DoubleMLAPO implements the estimation of average potential outcomes. Estimation is conducted via its fit() method: In [39]: import numpy as np In [40]: import doubleml as dml In [41]: from doubleml.datasets import make_irm_data In [42]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier In [43]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2) In [44]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2) In [45]: np.random.seed(3333) In [46]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame') In [47]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd') In [48]: dml_apo_obj = dml.DoubleMLAPO(obj_dml_data, ml_g, ml_m, treatment_level=0) In [49]: print(dml_apo_obj.fit()) ================== DoubleMLAPO Object ================== ------------------ Data summary ------------------ Outcome variable: y Treatment variable(s): ['d'] Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10'] Instrument variable(s): None No. Observations: 500 ------------------ Score & algorithm ------------------ Score function: APO ------------------ Machine learner ------------------ Learner ml_g: RandomForestRegressor(max_depth=5, max_features=10, min_samples_leaf=2) Learner ml_m: RandomForestClassifier(max_depth=5, max_features=10, min_samples_leaf=2) Out-of-sample Performance: Regression: Learner ml_g0 RMSE: [[1.08191204]] Learner ml_g1 RMSE: [[1.06694255]] Classification: Learner ml_m Log Loss: [[0.55863386]] ------------------ Resampling ------------------ No. folds: 5 No. repeated sample splits: 1 ------------------ Fit summary ------------------ coef std err t P>|t| 2.5 % 97.5 % d -0.119669 0.137529 -0.870142 0.384223 -0.38922 0.149882  ### 3.2.3. Average Potential Outcomes (APOs) for Multiple Treatment Levels# If multiple treatment levels should be estimated simulatenously, another possible target parameter of interest in this model are contrasts (or average treatment effects) between treatment levels $$d_j$$ and $$d_k$$: $\theta_{0,jk} = \mathbb{E}[g_0(d_j, X) - g_0(d_k, X)].$ DoubleMLAPOS implements the estimation of average potential outcomes for multiple treatment levels. Estimation is conducted via its fit() method. The causal_contrast() method allows to estimate causal contrasts between treatment levels: In [50]: import numpy as np In [51]: import doubleml as dml In [52]: from doubleml.datasets import make_irm_data In [53]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier In [54]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2) In [55]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2) In [56]: np.random.seed(3333) In [57]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame') In [58]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd') In [59]: dml_apos_obj = dml.DoubleMLAPOS(obj_dml_data, ml_g, ml_m, treatment_levels=[0, 1]) In [60]: print(dml_apos_obj.fit()) ================== DoubleMLAPOS Object ================== ------------------ Fit summary ------------------ coef std err t P>|t| 2.5 % 97.5 % 0 -0.162784 0.180190 -0.903406 0.366310 -0.515950 0.190381 1 0.526102 0.137809 3.817628 0.000135 0.256002 0.796203 In [61]: causal_contrast_model = dml_apos_obj.causal_contrast(reference_levels=0) In [62]: print(causal_contrast_model.summary) coef std err t P>|t| 2.5 % 97.5 % 1 vs 0 0.688887 0.228214 3.018602 0.002539 0.241596 1.136178  ### 3.2.4. Interactive IV model (IIVM)# Interactive IV regression (IIVM) models take the form \begin{align}\begin{aligned}Y = \ell_0(D, X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} where the treatment variable is binary, $$D \in \lbrace 0,1 \rbrace$$ and the instrument is binary, $$Z \in \lbrace 0,1 \rbrace$$. Consider the functions $$g_0$$, $$r_0$$ and $$m_0$$, where $$g_0$$ maps the support of $$(Z,X)$$ to $$\mathbb{R}$$ and $$r_0$$ and $$m_0$$ respectively map the support of $$(Z,X)$$ and $$X$$ to $$(\varepsilon, 1-\varepsilon)$$ for some $$\varepsilon \in (0, 1/2)$$, such that \begin{align}\begin{aligned}Y = g_0(Z, X) + \nu, & &\mathbb{E}(\nu | Z, X) = 0,\\D = r_0(Z, X) + U, & &\mathbb{E}(U | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align} The target parameter of interest in this model is the local average treatment effect (LATE), $\theta_0 = \frac{\mathbb{E}[g_0(1, X)] - \mathbb{E}[g_0(0,X)]}{\mathbb{E}[r_0(1, X)] - \mathbb{E}[r_0(0,X)]}.$ DoubleMLIIVM implements IIVM models. Estimation is conducted via its fit() method: In [63]: import numpy as np In [64]: import doubleml as dml In [65]: from doubleml.datasets import make_iivm_data In [66]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier In [67]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) In [68]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) In [69]: ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) In [70]: np.random.seed(4444) In [71]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame') In [72]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z') In [73]: dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r) In [74]: print(dml_iivm_obj.fit()) ================== DoubleMLIIVM Object ================== ------------------ Data summary ------------------ Outcome variable: y Treatment variable(s): ['d'] Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20'] Instrument variable(s): ['z'] No. Observations: 1000 ------------------ Score & algorithm ------------------ Score function: LATE ------------------ Machine learner ------------------ Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2) Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2) Learner ml_r: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2) Out-of-sample Performance: Regression: Learner ml_g0 RMSE: [[1.12983057]] Learner ml_g1 RMSE: [[1.13102231]] Classification: Learner ml_m Log Loss: [[0.69684828]] Learner ml_r0 Log Loss: [[0.69508862]] Learner ml_r1 Log Loss: [[0.43503345]] ------------------ Resampling ------------------ No. folds: 5 No. repeated sample splits: 1 ------------------ Fit summary ------------------ coef std err t P>|t| 2.5 % 97.5 % d 0.389126 0.23113 1.683581 0.092263 -0.063881 0.842132  library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) set.seed(4444) ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_r = ml_mclone()
data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d", z_cols="z") dml_iivm_obj = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)
dml_iivm_obj\$fit()
print(dml_iivm_obj)

================= DoubleMLIIVM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Instrument(s): z
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: classif.ranger
ml_r: classif.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d    0.3568     0.1988   1.794   0.0727 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



## 3.3. Difference-in-Differences Models (DID)#

Difference-in-Differences Models (DID) implemented in the package focus on the the binary treatment case with with two treatment periods.

Adopting the notation from Sant’Anna and Zhao (2020), let $$Y_{it}$$ be the outcome of interest for unit $$i$$ at time $$t$$. Further, let $$D_{it}=1$$ indicate if unit $$i$$ is treated before time $$t$$ (otherwise $$D_{it}=0$$). Since all units start as untreated ($$D_{i0}=0$$), define $$D_{i}=D_{i1}.$$ Relying on the potential outcome notation, denote $$Y_{it}(0)$$ as the outcome of unit $$i$$ at time $$t$$ if the unit did not receive treatment up until time $$t$$ and analogously for $$Y_{it}(1)$$ with treatment. Consequently, the observed outcome for unit is $$i$$ at time $$t$$ is $$Y_{it}=D_{it} Y_{it}(1) + (1-D_{it}) Y_{it}(0)$$. Further, let $$X_i$$ be a vector of pre-treatment covariates.

Target parameter of interest is the average treatment effect on the treated (ATTE)

$\theta_0 = \mathbb{E}[Y_{i1}(1)- Y_{i1}(0)|D_i=1].$

The corresponding identifying assumptions are

• (Cond.) Parallel Trends: $$\mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=1] = \mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=0]\quad a.s.$$

• Overlap: $$\exists\epsilon > 0$$: $$P(D_i=1) > \epsilon$$ and $$P(D_i=1|X_i) \le 1-\epsilon\quad a.s.$$

Note

For a more detailed introduction and recent developments of the difference-in-differences literature see e.g. Roth et al. (2022).

### 3.3.1. Panel data#

If panel data are available, the observations are assumed to be iid. of form $$(Y_{i0}, Y_{i1}, D_i, X_i)$$. Remark that the difference $$\Delta Y_i= Y_{i1}-Y_{i0}$$ has to be defined as the outcome y in the DoubleMLData object.

DoubleMLIDID implements difference-in-differences models for panel data. Estimation is conducted via its fit() method:

In [75]: import numpy as np

In [76]: import doubleml as dml

In [77]: from doubleml.datasets import make_did_SZ2020

In [78]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [79]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [80]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [81]: np.random.seed(42)

In [82]: data = make_did_SZ2020(n_obs=500, return_type='DataFrame')

# y is already defined as the difference of observed outcomes
In [83]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [84]: dml_did_obj = dml.DoubleMLDID(obj_dml_data, ml_g, ml_m)

In [85]: print(dml_did_obj.fit())
================== DoubleMLDID Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[16.27429763]]
Learner ml_g1 RMSE: [[13.35731523]]
Classification:
Learner ml_m Log Loss: [[0.66601815]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef   std err         t     P>|t|     2.5 %    97.5 %
d -2.840718  1.760386 -1.613691  0.106595 -6.291011  0.609575


### 3.3.2. Repeated cross-sections#

For repeated cross-sections, the observations are assumed to be iid. of form $$(Y_{i}, D_i, X_i, T_i)$$, where $$T_i$$ is a dummy variable if unit $$i$$ is observed pre- or post-treatment period, such that the observed outcome can be defined as

$Y_i = T_i Y_{i1} + (1-T_i) Y_{i0}.$

Further, treatment and covariates are assumed to be stationary, such that the joint distribution of $$(D,X)$$ is invariant to $$T$$.

DoubleMLIDIDCS implements difference-in-differences models for repeated cross-sections. Estimation is conducted via its fit() method:

In [86]: import numpy as np

In [87]: import doubleml as dml

In [88]: from doubleml.datasets import make_did_SZ2020

In [89]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [90]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [91]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [92]: np.random.seed(42)

In [93]: data = make_did_SZ2020(n_obs=500, cross_sectional_data=True, return_type='DataFrame')

In [94]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', t_col='t')

In [95]: dml_did_obj = dml.DoubleMLDIDCS(obj_dml_data, ml_g, ml_m)

In [96]: print(dml_did_obj.fit())
================== DoubleMLDIDCS Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Regression:
Learner ml_g_d0_t0 RMSE: [[17.4915707]]
Learner ml_g_d0_t1 RMSE: [[44.85397773]]
Learner ml_g_d1_t0 RMSE: [[32.74938952]]
Learner ml_g_d1_t1 RMSE: [[53.7282094]]
Classification:
Learner ml_m Log Loss: [[0.67936506]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef   std err         t     P>|t|      2.5 %    97.5 %
d -4.9944  7.561785 -0.660479  0.508947 -19.815226  9.826426


## 3.4. Sample Selection Models (SSM)#

Sample Selection Models (SSM) implemented in the package focus on the the binary treatment case when outcomes are only observed for a subpopulation due to sample selection or outcome attrition.

The implementation and notation is based on Bia, Huber and Lafférs (2023). Let $$D_i$$ be the binary treatment indicator and $$Y_{i}(d)$$ the potential outcome under treatment value $$d$$. Further, define $$Y_{i}:=Y_{i}(D)$$ to be the realized outcome and $$S_{i}$$ as a binary selection indicator. The outcome $$Y_{i}$$ is only observed if $$S_{i}=1$$. Finally, let $$X_i$$ be a vector of observed covariates, measures prior to treatment assignment.

Target parameter of interest is the average treatment effect (ATE)

$\theta_0 = \mathbb{E}[Y_{i}(1)- Y_{i}(0)].$

The corresponding identifying assumption is

• Cond. Independence of Treatment: $$Y_i(d) \perp D_i|X_i\quad a.s.$$ for $$d=0,1$$

where further assmputions are made in the context of the respective sample selection model.

Note

A more detailed example can be found in the Example Gallery.

### 3.4.1. Missingness at Random#

Consider the following two additional assumptions for the sample selection model:

• Cond. Independence of Selection: $$Y_i(d) \perp S_i|D_i=d, X_i\quad a.s.$$ for $$d=0,1$$

• Common Support: $$P(D_i=1|X_i)>0$$ and $$P(S_i=1|D_i=d, X_i)>0$$ for $$d=0,1$$

such that outcomes are missing at random (for the score see Scores).

DoubleMLSSM implements sample selection models. The score score='missing-at-random' refers to the correponding score relying on the assumptions above. The DoubleMLData object has to be defined with the additional argument s_col for the selection indicator. Estimation is conducted via its fit() method:

In [97]: import numpy as np

In [98]: from sklearn.linear_model import LassoCV, LogisticRegressionCV

In [99]: from doubleml.datasets import make_ssm_data

In [100]: import doubleml as dml

In [101]: np.random.seed(42)

In [102]: n_obs = 2000

In [103]: df = make_ssm_data(n_obs=n_obs, mar=True, return_type='DataFrame')

In [104]: dml_data = dml.DoubleMLData(df, 'y', 'd', s_col='s')

In [105]: ml_g = LassoCV()

In [106]: ml_m = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [107]: ml_pi = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [108]: dml_ssm = dml.DoubleMLSSM(dml_data, ml_g, ml_m, ml_pi, score='missing-at-random')

In [109]: dml_ssm.fit()
Out[109]: <doubleml.irm.ssm.DoubleMLSSM at 0x7fe30f670fe0>

In [110]: print(dml_ssm)
================== DoubleMLSSM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
Selection variable: s
No. Observations: 2000

------------------ Score & algorithm ------------------
Score function: missing-at-random

------------------ Machine learner   ------------------
Learner ml_g: LassoCV()
Learner ml_pi: LogisticRegressionCV(penalty='l1', solver='liblinear')
Learner ml_m: LogisticRegressionCV(penalty='l1', solver='liblinear')
Out-of-sample Performance:
Regression:
Learner ml_g_d0 RMSE: [[1.10039862]]
Learner ml_g_d1 RMSE: [[1.11071087]]
Classification:
Learner ml_pi Log Loss: [[0.53791422]]
Learner ml_m Log Loss: [[0.63593298]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.965531  0.065969  14.636048  1.654070e-48  0.836234  1.094829


### 3.4.2. Nonignorable Nonresponse#

When sample selection or outcome attriction is realated to unobservables, identification generally requires an instrument for the selection indicator $$S_i$$. Consider the following additional assumptions for the instrumental variable:

• Cond. Correlation: $$\exists Z: \mathbb{E}[Z\cdot S|D,X] \neq 0$$

• Cond. Independence: $$Y_i(d,z)=Y_i(d)$$ and $$Y_i \perp Z_i|D_i=d, X_i\quad a.s.$$ for $$d=0,1$$

This requires the instrumental variable $$Z_i$$, which must not affect $$Y_i$$ or be associated with unobservables affecting $$Y_i$$ conditional on $$D_i$$ and $$X_i$$. Further, the selection is determined via a (unknown) threshold model:

• Threshold: $$S_i = 1\{V_i \le \xi(D,X,Z)\}$$ where $$\xi$$ is a general function and $$V_i$$ is a scalar with strictly monotonic cumulative distribution function conditional on $$X_i$$.

• Cond. Independence: $$Y_i \perp (Z_i, D_i)|X_i$$.

Let $$\Pi_i := P(S_i=1|D_i, X_i, Z_i)$$ denote the selection probability. Additionally, the following assumptions are required:

• Common Support for Treatment: $$P(D_i=1|X_i, \Pi)>0$$

• Cond. Effect Homogeneity: $$\mathbb{E}[Y_i(1)-Y_i(0)|S_i=1, X_i=x, V_i=v] = \mathbb{E}[Y_i(1)-Y_i(0)|X_i=x, V_i=v]$$

• Common Support for Selection: $$P(S_i=1|D_i=d, X_i=x, Z_i=z)>0\quad a.s.$$ for $$d=0,1$$

For further details, see Bia, Huber and Lafférs (2023).

DoubleMLSSM implements sample selection models. The score score='nonignorable' refers to the correponding score relying on the assumptions above. The DoubleMLData object has to be defined with the additional argument s_col for the selection indicator and z_cols for the instrument. Estimation is conducted via its fit() method:

In [111]: import numpy as np

In [112]: from sklearn.linear_model import LassoCV, LogisticRegressionCV

In [113]: from doubleml.datasets import make_ssm_data

In [114]: import doubleml as dml

In [115]: np.random.seed(42)

In [116]: n_obs = 2000

In [117]: df = make_ssm_data(n_obs=n_obs, mar=False, return_type='DataFrame')

In [118]: dml_data = dml.DoubleMLData(df, 'y', 'd', z_cols='z', s_col='s')

In [119]: ml_g = LassoCV()

In [120]: ml_m = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [121]: ml_pi = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [122]: dml_ssm = dml.DoubleMLSSM(dml_data, ml_g, ml_m, ml_pi, score='nonignorable')

In [123]: dml_ssm.fit()
Out[123]: <doubleml.irm.ssm.DoubleMLSSM at 0x7fe314f1f320>

In [124]: print(dml_ssm)
================== DoubleMLSSM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): ['z']
Selection variable: s
No. Observations: 2000

------------------ Score & algorithm ------------------
Score function: nonignorable

------------------ Machine learner   ------------------
Learner ml_g: LassoCV()
Learner ml_pi: LogisticRegressionCV(penalty='l1', solver='liblinear')
Learner ml_m: LogisticRegressionCV(penalty='l1', solver='liblinear')
Out-of-sample Performance:
Regression:
Learner ml_g_d0 RMSE: [[0.92827999]]
Learner ml_g_d1 RMSE: [[1.10079785]]
Classification:
Learner ml_pi Log Loss: [[0.44124313]]
Learner ml_m Log Loss: [[0.59854797]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
coef   std err         t         P>|t|     2.5 %    97.5 %
d  1.14268  0.183373  6.231467  4.620874e-10  0.783276  1.502084