3. Models#

The DoubleML-package includes the following models.

3.1. Partially linear models (PLM)#

The partially linear models (PLM) take the form

\[Y = D \theta_0 + g_0(X) + \zeta,\]

where treatment effects are additive with some sort of linear form.

3.1.1. Partially linear regression model (PLR)#

Partially linear regression (PLR) models take the form

\[ \begin{align}\begin{aligned}Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where \(Y\) is the outcome variable and \(D\) is the policy variable of interest. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.

digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }
     Y -> D -> V [dir="back"];
     X -> D;
     Y -> X [dir="back"];
}

Causal diagram#

DoubleMLPLR implements PLR models. Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.datasets import make_plr_CCDDHNR2018

In [4]: from sklearn.ensemble import RandomForestRegressor

In [5]: from sklearn.base import clone

In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [7]: ml_l = clone(learner)

In [8]: ml_m = clone(learner)

In [9]: np.random.seed(1111)

In [10]: data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In [11]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [12]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

In [13]: print(dml_plr_obj.fit())
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.18356413]]
Learner ml_m RMSE: [[1.06008533]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef  std err         t         P>|t|     2.5 %    97.5 %
d  0.512672  0.04491  11.41566  3.492417e-30  0.424651  0.600694
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn")

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
set.seed(1111)
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table')
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m)
dml_plr_obj$fit()
print(dml_plr_obj)
================= DoubleMLPLR Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): 
Selection variable: 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.47319    0.04165   11.36   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.1.2. Partially linear IV regression model (PLIV)#

Partially linear IV regression (PLIV) models take the form

\[ \begin{align}\begin{aligned}Y - D \theta_0 = g_0(X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align} \]

where \(Y\) is the outcome variable, \(D\) is the policy variable of interest and \(Z\) denotes one or multiple instrumental variables. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.

digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#56B4E9"]
       Z [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }

     Z -> V [dir="back"];
     D -> X [dir="back"];
     Y -> D [dir="both"];
     X -> Y;
     Z -> X [dir="back"];
     Z -> D;

     { rank=same; Y D }
     { rank=same; Z X }
         { rank=same; V }
}

Causal diagram#

DoubleMLPLIV implements PLIV models. Estimation is conducted via its fit() method:

In [14]: import numpy as np

In [15]: import doubleml as dml

In [16]: from doubleml.datasets import make_pliv_CHS2015

In [17]: from sklearn.ensemble import RandomForestRegressor

In [18]: from sklearn.base import clone

In [19]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [20]: ml_l = clone(learner)

In [21]: ml_m = clone(learner)

In [22]: ml_r = clone(learner)

In [23]: np.random.seed(2222)

In [24]: data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type='DataFrame')

In [25]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='Z1')

In [26]: dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)

In [27]: print(dml_pliv_obj.fit())
================== DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['Z1']
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.48390784]]
Learner ml_m RMSE: [[0.53209683]]
Learner ml_r RMSE: [[1.25240463]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef   std err         t         P>|t|     2.5 %    97.5 %
d  0.481705  0.084337  5.711638  1.118938e-08  0.316407  0.647004
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
ml_r = learner$clone()
set.seed(2222)
data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_col = "d", z_cols= "Z1")
dml_pliv_obj = DoubleMLPLIV$new(obj_dml_data, ml_l, ml_m, ml_r)
dml_pliv_obj$fit()
print(dml_pliv_obj)
================= DoubleMLPLIV Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): Z1
Selection variable: 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger
ml_r: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.66133    0.07796   8.483   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.2. Interactive regression models (IRM)#

The interactive regression model (IRM) take the form

\[Y = g_0(D, X) + U,\]

where treatment effects are fully heterogeneous.

3.2.1. Binary Interactive Regression Model (IRM)#

Interactive regression (IRM) models take the form

\[ \begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\). We consider estimation of the average treatment effects when treatment effects are fully heterogeneous.

Target parameters of interest in this model are the average treatment effect (ATE),

\[\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X)]\]

and the average treatment effect of the treated (ATTE),

\[\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X) | D=1].\]
digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }
     Y -> D -> V [dir="back"];
     X -> D;
     Y -> X [dir="back"];
}

Causal diagram#

DoubleMLIRM implements IRM models. Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.datasets import make_irm_data

In [4]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [5]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [6]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [7]: np.random.seed(3333)

In [8]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')

In [9]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [10]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

In [11]: print(dml_irm_obj.fit())
================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: ATE

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=10, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=10, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[1.07085301]]
Learner ml_g1 RMSE: [[1.09682314]]
Classification:
Learner ml_m Log Loss: [[0.55863386]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef  std err         t     P>|t|     2.5 %    97.5 %
d  0.599297   0.1887  3.175931  0.001494  0.229452  0.969141
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

set.seed(3333)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m)
dml_irm_obj$fit()
print(dml_irm_obj)
================= DoubleMLIRM Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s): 
Selection variable: 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: classif.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)
d    0.3958     0.2651   1.493    0.135


3.2.2. Average Potential Outcomes (APOs)#

For general discrete-values treatments \(D \in \lbrace d_0, \dots, d_l \rbrace\) the model can be generalized to

\[ \begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\A_j = m_{0,j}(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where \(A_j := 1\lbrace D = d_j\rbrace\) is an indicator variable for treatment level \(d_j\) and \(m_{0,j}(X)\) denotes the corresponding propensity score.

Possible target parameters of interest in this model are the average potential outcomes (APOs)

\[\theta_{0,j} = \mathbb{E}[g_0(d_j, X)].\]

DoubleMLAPO implements the estimation of average potential outcomes. Estimation is conducted via its fit() method:

In [12]: import numpy as np

In [13]: import doubleml as dml

In [14]: from doubleml.datasets import make_irm_data

In [15]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [16]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [17]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [18]: np.random.seed(3333)

In [19]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')

In [20]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [21]: dml_apo_obj = dml.DoubleMLAPO(obj_dml_data, ml_g, ml_m, treatment_level=0)

In [22]: print(dml_apo_obj.fit())
================== DoubleMLAPO Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: APO

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=10, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=10, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g_d_lvl0 RMSE: [[1.09351167]]
Learner ml_g_d_lvl1 RMSE: [[1.07085301]]
Classification:
Learner ml_m Log Loss: [[0.55863386]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
      coef   std err         t     P>|t|     2.5 %    97.5 %
d -0.13084  0.137165 -0.953884  0.340142 -0.399679  0.137999

3.2.3. Average Potential Outcomes (APOs) for Multiple Treatment Levels#

If multiple treatment levels should be estimated simulatenously, another possible target parameter of interest in this model are contrasts (or average treatment effects) between treatment levels \(d_j\) and \(d_k\):

\[\theta_{0,jk} = \mathbb{E}[g_0(d_j, X) - g_0(d_k, X)].\]

DoubleMLAPOS implements the estimation of average potential outcomes for multiple treatment levels. Estimation is conducted via its fit() method. The causal_contrast() method allows to estimate causal contrasts between treatment levels:

In [23]: import numpy as np

In [24]: import doubleml as dml

In [25]: from doubleml.datasets import make_irm_data

In [26]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [27]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [28]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

In [29]: np.random.seed(3333)

In [30]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')

In [31]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [32]: dml_apos_obj = dml.DoubleMLAPOS(obj_dml_data, ml_g, ml_m, treatment_levels=[0, 1])

In [33]: print(dml_apos_obj.fit())
================== DoubleMLAPOS Object ==================

------------------ Fit summary       ------------------
       coef   std err         t     P>|t|     2.5 %    97.5 %
0 -0.152706  0.186689 -0.817967  0.413376 -0.518610  0.213199
1  0.533283  0.134784  3.956588  0.000076  0.269112  0.797454

In [34]: causal_contrast_model = dml_apos_obj.causal_contrast(reference_levels=0)

In [35]: print(causal_contrast_model.summary)
            coef   std err         t     P>|t|     2.5 %   97.5 %
1 vs 0  0.685989  0.231734  2.960236  0.003074  0.231798  1.14018

3.2.4. Interactive IV model (IIVM)#

Interactive IV regression (IIVM) models take the form

\[ \begin{align}\begin{aligned}Y = \ell_0(D, X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} \]

where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\) and the instrument is binary, \(Z \in \lbrace 0,1 \rbrace\). Consider the functions \(g_0\), \(r_0\) and \(m_0\), where \(g_0\) maps the support of \((Z,X)\) to \(\mathbb{R}\) and \(r_0\) and \(m_0\) respectively map the support of \((Z,X)\) and \(X\) to \((\varepsilon, 1-\varepsilon)\) for some \(\varepsilon \in (0, 1/2)\), such that

\[ \begin{align}\begin{aligned}Y = g_0(Z, X) + \nu, & &\mathbb{E}(\nu | Z, X) = 0,\\D = r_0(Z, X) + U, & &\mathbb{E}(U | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align} \]

The target parameter of interest in this model is the local average treatment effect (LATE),

\[\theta_0 = \frac{\mathbb{E}[g_0(1, X)] - \mathbb{E}[g_0(0,X)]}{\mathbb{E}[r_0(1, X)] - \mathbb{E}[r_0(0,X)]}.\]
digraph {
     nodesep=1;
     ranksep=1;
     rankdir=LR;
     { node [shape=circle, style=filled]
       Y [fillcolor="#56B4E9"]
       D [fillcolor="#56B4E9"]
       Z [fillcolor="#F0E442"]
       V [fillcolor="#F0E442"]
       X [fillcolor="#D55E00"]
     }

     Z -> V [dir="back"];
     D -> X [dir="back"];
     Y -> D [dir="both"];
     X -> Y;
     Z -> X [dir="back"];
     Z -> D;

     { rank=same; Y D }
     { rank=same; Z X }
         { rank=same; V }
}

Causal diagram#

DoubleMLIIVM implements IIVM models. Estimation is conducted via its fit() method:

In [36]: import numpy as np

In [37]: import doubleml as dml

In [38]: from doubleml.datasets import make_iivm_data

In [39]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [40]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [41]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [42]: ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [43]: np.random.seed(4444)

In [44]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame')

In [45]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')

In [46]: dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)

In [47]: print(dml_iivm_obj.fit())
================== DoubleMLIIVM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['z']
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: LATE

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[1.12983057]]
Learner ml_g1 RMSE: [[1.13102231]]
Classification:
Learner ml_m Log Loss: [[0.69684828]]
Learner ml_r0 Log Loss: [[0.69508862]]
Learner ml_r1 Log Loss: [[0.43503345]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef  std err         t     P>|t|     2.5 %    97.5 %
d  0.389126  0.23113  1.683581  0.092263 -0.063881  0.842132
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

set.seed(4444)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_r = ml_m$clone()
data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d", z_cols="z")
dml_iivm_obj = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)
dml_iivm_obj$fit()
print(dml_iivm_obj)
================= DoubleMLIIVM Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Instrument(s): z
Selection variable: 
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: classif.ranger
ml_r: classif.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)  
d    0.3568     0.1988   1.794   0.0727 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.3. Difference-in-Differences Models (DID)#

Difference-in-Differences Models (DID) implemented in the package focus on the the binary treatment case with staggered adoption.

Note

The notation and identifying assumptions are based on Callaway and Sant’Anna (2021), but adjusted to better fit into the general package documentation conventions, sometimes slightly abusing notation. The underlying score functions are based on Sant’Anna and Zhao (2020), Zimmert (2018) and Chang (2020). For a more detailed introduction and recent developments of the difference-in-differences literature see e.g. Roth et al. (2022).

We consider \(n\) observed units at time periods \(t=1,\dots, \mathcal{T}\). The treatment status for unit \(i\) at time period \(t\) is denoted by the binary variable \(D_{i,t}=1\). The package considers the staggered adoption setting, where a unit stays treated after it has been treated once (Irreversibility of Treatment).

Let \(G^{\mathrm{g}}_i\) be an indicator variable that takes value one if unit \(i\) is treated at time period \(t=\mathrm{g}\), \(G^{\mathrm{g}}_i=1\{G_i=\mathrm{g}\}\) with \(G_i\) refering to the first post-treatment period. I units are never exposed to the treatment, define \(G_i=\infty\).

The target parameters are defined in terms of differences in potential outcomes. The observed and potential outcome for each unit \(i\) at time period \(t\) are assumed to be of the form

\[Y_{i,t} = Y_{i,t}(0) + \sum_{\mathrm{g}=2}^{\mathcal{T}} (Y_{i,t}(\mathrm{g}) - Y_{i,t}(0)) \cdot G^{\mathrm{g}}_i,\]

such that we observe one consistent potential outcome for each unit at each time period.

The corresponding target parameters are the average causal effects of the treatment

\[ATT(\mathrm{g},t):= \mathbb{E}[Y_{i,t}(\mathrm{g}) - Y_{i,t}(0)|G^{\mathrm{g}}_i=1].\]

This target parameter quantifies the average change in potential outcomes for units that are treated the first time in period \(\mathrm{g}\) with the difference in outcome being evaluated for time period \(t\). The corresponding control groups, defined by an indicator \(C\), can be typically set as either the never treated or not yet treated units. Let

\[\begin{split}\begin{align} C_{i,t}^{(\text{nev})} \equiv C_{i}^{(\text{nev})} &:= 1\{G_i=\infty\} \quad \text{(never treated)}, \\ C_{i,t}^{(\text{nyt})} &:= 1\{G_i > t\} \quad \text{(not yet treated)}. \end{align}\end{split}\]

The corresponding identifying assumptions are:

  1. Irreversibility of Treatment:

    \(D_{i,1} = 0 \quad a.s.\) For all \(t=2,\dots,\mathcal{T}\), \(D_{i,t-1} = 1\) implies \(D_{i,t} = 1 \quad a.s.\)

  2. Panel Data (Random Sampling):

    \((Y_{i,1},\dots, Y_{i,\mathcal{T}}, X_i, D_{i,1}, \dots, D_{i,\mathcal{T}})_{i=1}^n\) is independent and identically distributed.

  3. Limited Treatment Anticipation:

    There is a known \(\delta\ge 0\) such that \(\mathbb{E}[Y_{i,t}(\mathrm{g})|X_i, G_i^{\mathrm{g}}=1] = \mathbb{E}[Y_{i,t}(0)|X_i, G_i^{\mathrm{g}}=1]\quad a.s.\) for all \(\mathrm{g}\in\mathcal{G}, t\in\{1,\dots,\mathcal{T}\}\) such that \(t< \mathrm{g}-\delta\).

  4. Conditional Parallel Trends:

    Let \(\delta\) be defined as in Assumption 3.\ For each \(\mathrm{g}\in\mathcal{G}\) and \(t\in\{2,\dots,\mathcal{T}\}\) such that \(t\ge \mathrm{g}-\delta\):

    1. Never Treated:

      \(\mathbb{E}[Y_{i,t}(0) - Y_{i,t-1}(0)|X_i, G_i^{\mathrm{g}}=1] = \mathbb{E}[Y_{i,t}(0) - Y_{i,t-1}(0)|X_i,C_{i}^{(\text{nev})}=1] \quad a.s.\)

    2. Not Yet Treated:

      \(\mathbb{E}[Y_{i,t}(0) - Y_{i,t-1}(0)|X_i, G_i^{\mathrm{g}}=1] = \mathbb{E}[Y_{i,t}(0) - Y_{i,t-1}(0)|X_i,C_{i,t+\delta}^{(\text{nyt})}=1] \quad a.s.\)

  5. Overlap:

    For each time period \(t=2,\dots,\mathcal{T}\) and \(\mathrm{g}\in\mathcal{G}\) there exists a \(\epsilon > 0\) such that \(P(G_i^{\mathrm{g}}=1) > \epsilon\) and \(P(G_i^{\mathrm{g}}=1|X_i, G_i^{\mathrm{g}} + C_{i,t}^{(\text{nyt})}=1) < 1-\epsilon\quad a.s.\)

Note

For a detailed discussion of the assumptions see Callaway and Sant’Anna (2021).

Under the assumptions above (either Assumption 4.a or 4.b), the target parameter \(ATT(\mathrm{g},t)\) is identified see Theorem 1. Callaway and Sant’Anna (2021).

3.3.1. Panel data#

For the estimation of the target parameters \(ATT(\mathrm{g},t)\) the following nuisance functions are required:

\[\begin{split}\begin{align} g_{0, \mathrm{g}, t_\text{pre}, t_\text{eval}, \delta}(X_i) &:= \mathbb{E}[Y_{i,t_\text{eval}} - Y_{i,t_\text{pre}}|X_i, C_{i,t_\text{eval} + \delta}^{(\cdot)} = 1], \\ m_{0, \mathrm{g}, t_\text{eval} + \delta}(X_i) &:= P(G_i^{\mathrm{g}}=1|X_i, G_i^{\mathrm{g}} + C_{i,t_\text{eval} + \delta}^{(\cdot)}=1). \end{align}\end{split}\]

where \(g_{0, \mathrm{g}, t_\text{pre}, t_\text{eval},\delta}(\cdot)\) denotes the population outcome regression function and \(m_{0, \mathrm{g}, t_\text{eval} + \delta}(\cdot)\) the generalized propensity score. The interpretation of the parameters is as follows:

  • \(\mathrm{g}\) is the first post-treatment period of interest, i.e. the treatment group.

  • \(t_\text{pre}\) is the pre-treatment period, i.e. the time period from which the conditional parallel trends are assumed.

  • \(t_\text{eval}\) is the time period of interest or evaluation period, i.e. the time period where the treatment effect is evaluated.

  • \(\delta\) is number of anticipation periods, i.e. the number of time periods for which units are assumed to anticipate the treatment.

Note

Remark that the nuisance functions depend on the control group used for the estimation of the target parameter. By slight abuse of notation we use the same notation for both control groups \(C_{i,t}^{(\text{nev})}\) and \(C_{i,t}^{(\text{nyt})}\). More specifically, the control group only depends on \(\delta\) for not yet treated units.

Under these assumptions the target parameter \(ATT(\mathrm{g},t_\text{eval})\) can be estimated by choosing a suitable combination of \((\mathrm{g}, t_\text{pre}, t_\text{eval}, \delta)\) if \(t_\text{eval} - t_\text{pre} \ge 1 + \delta\), i.e. the parallel trends are assumed to hold at least one period more than the anticipation period.

Note

The choice \(t_\text{pre}= \min(\mathrm{g},t_\text{eval}) -\delta-1\) corresponds to the definition of \(ATT_{dr}(\mathrm{g},t_\text{eval};\delta)\) from Callaway and Sant’Anna (2021).

In the following, we will omit the subscript \(\delta\) in the notation of the nuisance functions and the control group (implicitly assuming \(\delta=0\)).

For a given tuple \((\mathrm{g}, t_\text{pre}, t_\text{eval})\) the target parameter \(ATT(\mathrm{g},t)\) is estimated by solving the empirical version of the the following linear moment condition:

\[ATT(\mathrm{g}, t_\text{pre}, t_\text{eval}):= -\frac{\mathbb{E}[\psi_b(W,\eta_0)]}{\mathbb{E}[\psi_a(W,\eta_0)]}\]

with nuisance elements \(\eta_0=(g_{0, \mathrm{g}, t_\text{pre}, t_\text{eval}}, m_{0, \mathrm{g}, t_\text{eval}})\) and score function \(\psi(W,\theta, \eta)\) being defined in section Panel Data. Under the identifying assumptions above

\[ATT(\mathrm{g}, t_\text{pre}, t_\text{eval}) = ATT(\mathrm{g},t).\]

DoubleMLDIDMulti implements the estimation of \(ATT(\mathrm{g}, t_\text{pre}, t_\text{eval})\) for multiple time periods and requires DoubleMLPanelData as input. Setting gt_combinations='standard' will estimate the target parameter for all (possible) combinations of \((\mathrm{g}, t_\text{pre}, t_\text{eval})\) with \(\mathrm{g}\in\{2,\dots,\mathcal{T}\}\) and \((t_\text{pre}, t_\text{eval})\) with \(t_\text{eval}\in\{2,\dots,\mathcal{T}\}\) and \(t_\text{pre}= \min(\mathrm{g},t_\text{eval}) -\delta-1\). This corresponds to the setting where all trends are set as short as possible, but still respecting the anticipation period.

Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.did.datasets import make_did_CS2021

In [4]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [5]: np.random.seed(42)

In [6]: df = make_did_CS2021(n_obs=500)

In [7]: dml_data = dml.data.DoubleMLPanelData(
   ...:     df,
   ...:     y_col="y",
   ...:     d_cols="d",
   ...:     id_col="id",
   ...:     t_col="t",
   ...:     x_cols=["Z1", "Z2", "Z3", "Z4"],
   ...:     datetime_unit="M"
   ...: )
   ...: 

In [8]: dml_did_obj = dml.did.DoubleMLDIDMulti(
   ...:     obj_dml_data=dml_data,
   ...:     ml_g=RandomForestRegressor(min_samples_split=10),
   ...:     ml_m=RandomForestClassifier(min_samples_split=10),
   ...:     gt_combinations="standard",
   ...:     control_group="never_treated",
   ...: )
   ...: 

In [9]: print(dml_did_obj.fit())
================== DoubleMLDIDMulti Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational
Control group: never_treated
Anticipation periods: 0

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(min_samples_split=10)
Learner ml_m: RandomForestClassifier(min_samples_split=10)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[ 4.18516129  3.93074943  7.02260304 11.27461519  4.07788588  4.29168951
   3.95311164  8.07229774  3.79330022  3.90145324  4.00518448  3.89472978]]
Learner ml_g1 RMSE: [[3.11942111 3.42094064 6.87309461 9.51494845 4.31910229 3.59563003
  3.86897905 7.36696349 3.97619643 3.94427158 3.77746575 4.38470495]]
Classification:
Learner ml_m Log Loss: [[0.70344386 0.72269685 0.70305686 0.70557077 0.7204309  0.69572427
  0.70774361 0.72987186 0.76535102 0.78386025 0.74475816 0.79953099]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
                                  coef   std err  ...     2.5 %    97.5 %
ATT(2025-03,2025-01,2025-02) -0.698642  0.489488  ... -1.658021  0.260738
ATT(2025-03,2025-02,2025-03)  0.211002  0.496591  ... -0.762299  1.184303
ATT(2025-03,2025-02,2025-04)  0.900127  0.938263  ... -0.938836  2.739089
ATT(2025-03,2025-02,2025-05)  2.001603  1.568932  ... -1.073447  5.076653
ATT(2025-04,2025-01,2025-02)  0.046507  0.430465  ... -0.797189  0.890204
ATT(2025-04,2025-02,2025-03)  0.108870  0.479959  ... -0.831833  1.049573
ATT(2025-04,2025-03,2025-04)  0.952146  0.463816  ...  0.043082  1.861210
ATT(2025-04,2025-03,2025-05)  2.918293  0.891527  ...  1.170933  4.665653
ATT(2025-05,2025-01,2025-02) -0.082905  0.468449  ... -1.001049  0.835239
ATT(2025-05,2025-02,2025-03) -0.035689  0.643679  ... -1.297276  1.225899
ATT(2025-05,2025-03,2025-04)  0.270248  0.443672  ... -0.599334  1.139830
ATT(2025-05,2025-04,2025-05)  1.186775  0.542268  ...  0.123950  2.249601

[12 rows x 6 columns]

Note

Remark that the output contains two different outcome regressions \(g(0,X)\) and \(g(1,X)\). As in the IRM model the outcome regression \(g(0,X)\) refers to the control group, whereas \(g(1,X)\) refers to the outcome regression for the treatment group, i.e.

\[\begin{split}\begin{align} g(0,X) &\approx g_{0, \mathrm{g}, t_\text{pre}, t_\text{eval}, \delta}(X_i) = \mathbb{E}[Y_{i,t_\text{eval}} - Y_{i,t_\text{pre}}|X_i, C_{i,t_\text{eval} + \delta}^{(\cdot)} = 1],\\ g(1,X) &\approx \mathbb{E}[Y_{i,t_\text{eval}} - Y_{i,t_\text{pre}}|X_i, G_i^{\mathrm{g}} = 1]. \end{align}\end{split}\]

Further, \(g(1,X)\) is only required for Sensitivity Analysis and is not used for the estimation of the target parameter.

Note

A more detailed example is available in the Example Gallery.

3.3.2. Repeated cross-sections#

Note

Will be implemented soon.

3.3.3. Effect Aggregation#

The following section considers the aggregation of different \(ATT(\mathrm{g},t)\) to summary measures based on Callaway and Sant’Anna (2021). All implemented aggregation schemes take the form of a weighted average of the \(ATT(\mathrm{g},t)\) estimates

\[\theta = \sum_{\mathrm{g}\in \mathcal{G}} \sum_{t=2}^{\mathcal{T}} \omega(\mathrm{g},t) \cdot ATT(\mathrm{g},t)\]

where \(\omega(\mathrm{g},t)\) is a weight function based on the treatment group \(\mathrm{g}\) and time period \(t\). The aggragation schemes are implmented via the aggregate() method of the DoubleMLDIDMulti class.

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.did.datasets import make_did_CS2021

In [4]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [5]: np.random.seed(42)

In [6]: df = make_did_CS2021(n_obs=500)

In [7]: dml_data = dml.data.DoubleMLPanelData(
   ...:     df,
   ...:     y_col="y",
   ...:     d_cols="d",
   ...:     id_col="id",
   ...:     t_col="t",
   ...:     x_cols=["Z1", "Z2", "Z3", "Z4"],
   ...:     datetime_unit="M"
   ...: )
   ...: 

In [8]: dml_did_obj = dml.did.DoubleMLDIDMulti(
   ...:     obj_dml_data=dml_data,
   ...:     ml_g=RandomForestRegressor(min_samples_split=10),
   ...:     ml_m=RandomForestClassifier(min_samples_split=10),
   ...:     gt_combinations="standard",
   ...:     control_group="never_treated",
   ...: )
   ...: 

In [9]: dml_did_obj.fit()
Out[9]: <doubleml.did.did_multi.DoubleMLDIDMulti at 0x7f475ae01b80>

In [10]: agg_did_obj = dml_did_obj.aggregate(aggregation="group")

In [11]: agg_did_obj.aggregated_frameworks.bootstrap()
Out[11]: <doubleml.double_ml_framework.DoubleMLFramework at 0x7f475ae02660>

In [12]: print(agg_did_obj)
================== DoubleMLDIDAggregation Object ==================
 Group Aggregation 

------------------ Overall Aggregated Effects ------------------
    coef  std err        t    P>|t|   2.5 %   97.5 %
1.389489 0.537438 2.585394 0.009727 0.33613 2.442847
------------------ Aggregated Effects         ------------------
             coef   std err         t     P>|t|     2.5 %    97.5 %
2025-03  1.037577  0.943693  1.099485  0.271556 -0.812028  2.887182
2025-04  1.935220  0.642648  3.011323  0.002601  0.675653  3.194786
2025-05  1.186775  0.542268  2.188541  0.028630  0.123950  2.249601
------------------ Additional Information     ------------------
Score function: observational
Control group: never_treated
Anticipation periods: 0

The method aggregate() requires the aggregation argument to be set to one of the following values:

  • 'group': aggregates \(ATT(\mathrm{g},t)\) estimates by the treatment group \(\mathrm{g}\).

  • 'time': aggregates \(ATT(\mathrm{g},t)\) estimates by the time period \(t\) (based on group size).

  • 'eventstudy': aggregates \(ATT(\mathrm{g},t)\) estimates based on time difference to first treatment assignment like an event study (based on group size).

  • dictionary: a dictionary with values containing the aggregation weights (as numpy.ma.MaskedArray).

Note

A more detailed example on effect aggregation is available in the example gallery. For a detailed discussion on different aggregation schemes, we refer to of Callaway and Sant’Anna (2021).

3.3.4. Two treatment periods#

Warning

This documentation refers to the deprecated implementation for two time periods. This functionality will be removed in a future version.

Note

We recommend using the implementation Panel data and Repeated cross-sections.

Difference-in-Differences Models (DID) implemented in the package focus on the the binary treatment case with with two treatment periods.

Adopting the notation from Sant’Anna and Zhao (2020), let \(Y_{it}\) be the outcome of interest for unit \(i\) at time \(t\). Further, let \(D_{it}=1\) indicate if unit \(i\) is treated before time \(t\) (otherwise \(D_{it}=0\)). Since all units start as untreated (\(D_{i0}=0\)), define \(D_{i}=D_{i1}.\) Relying on the potential outcome notation, denote \(Y_{it}(0)\) as the outcome of unit \(i\) at time \(t\) if the unit did not receive treatment up until time \(t\) and analogously for \(Y_{it}(1)\) with treatment. Consequently, the observed outcome for unit is \(i\) at time \(t\) is \(Y_{it}=D_{it} Y_{it}(1) + (1-D_{it}) Y_{it}(0)\). Further, let \(X_i\) be a vector of pre-treatment covariates.

Target parameter of interest is the average treatment effect on the treated (ATTE)

\[\theta_0 = \mathbb{E}[Y_{i1}(1)- Y_{i1}(0)|D_i=1].\]

The corresponding identifying assumptions are

  • (Cond.) Parallel Trends: \(\mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=1] = \mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=0]\quad a.s.\)

  • Overlap: \(\exists\epsilon > 0\): \(P(D_i=1) > \epsilon\) and \(P(D_i=1|X_i) \le 1-\epsilon\quad a.s.\)

Note

For a more detailed introduction and recent developments of the difference-in-differences literature see e.g. Roth et al. (2022).

3.3.4.1. Panel Data#

If panel data are available, the observations are assumed to be iid. of form \((Y_{i0}, Y_{i1}, D_i, X_i)\). Remark that the difference \(\Delta Y_i= Y_{i1}-Y_{i0}\) has to be defined as the outcome y in the DoubleMLData object.

DoubleMLIDID implements difference-in-differences models for panel data. Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.did.datasets import make_did_SZ2020

In [4]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [5]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [6]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [7]: np.random.seed(42)

In [8]: data = make_did_SZ2020(n_obs=500, return_type='DataFrame')

# y is already defined as the difference of observed outcomes
In [9]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [10]: dml_did_obj = dml.DoubleMLDID(obj_dml_data, ml_g, ml_m)

In [11]: print(dml_did_obj.fit())
================== DoubleMLDID Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[16.27429763]]
Learner ml_g1 RMSE: [[13.35731523]]
Classification:
Learner ml_m Log Loss: [[0.66601815]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef   std err         t     P>|t|     2.5 %    97.5 %
d -2.840718  1.760386 -1.613691  0.106595 -6.291011  0.609575

3.3.4.2. Repeated cross-sections#

For repeated cross-sections, the observations are assumed to be iid. of form \((Y_{i}, D_i, X_i, T_i)\), where \(T_i\) is a dummy variable if unit \(i\) is observed pre- or post-treatment period, such that the observed outcome can be defined as

\[Y_i = T_i Y_{i1} + (1-T_i) Y_{i0}.\]

Further, treatment and covariates are assumed to be stationary, such that the joint distribution of \((D,X)\) is invariant to \(T\).

DoubleMLIDIDCS implements difference-in-differences models for repeated cross-sections. Estimation is conducted via its fit() method:

In [12]: import numpy as np

In [13]: import doubleml as dml

In [14]: from doubleml.did.datasets import make_did_SZ2020

In [15]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [16]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [17]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)

In [18]: np.random.seed(42)

In [19]: data = make_did_SZ2020(n_obs=500, cross_sectional_data=True, return_type='DataFrame')

In [20]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', t_col='t')

In [21]: dml_did_obj = dml.DoubleMLDIDCS(obj_dml_data, ml_g, ml_m)

In [22]: print(dml_did_obj.fit())
================== DoubleMLDIDCS Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: observational

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Regression:
Learner ml_g_d0_t0 RMSE: [[17.4915707]]
Learner ml_g_d0_t1 RMSE: [[44.85397773]]
Learner ml_g_d1_t0 RMSE: [[32.74938952]]
Learner ml_g_d1_t1 RMSE: [[53.7282094]]
Classification:
Learner ml_m Log Loss: [[0.67936506]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
     coef   std err         t     P>|t|      2.5 %    97.5 %
d -4.9944  7.561785 -0.660479  0.508947 -19.815226  9.826426

3.4. Sample Selection Models (SSM)#

Sample Selection Models (SSM) implemented in the package focus on the the binary treatment case when outcomes are only observed for a subpopulation due to sample selection or outcome attrition.

The implementation and notation is based on Bia, Huber and Lafférs (2023). Let \(D_i\) be the binary treatment indicator and \(Y_{i}(d)\) the potential outcome under treatment value \(d\). Further, define \(Y_{i}:=Y_{i}(D)\) to be the realized outcome and \(S_{i}\) as a binary selection indicator. The outcome \(Y_{i}\) is only observed if \(S_{i}=1\). Finally, let \(X_i\) be a vector of observed covariates, measures prior to treatment assignment.

Target parameter of interest is the average treatment effect (ATE)

\[\theta_0 = \mathbb{E}[Y_{i}(1)- Y_{i}(0)].\]

The corresponding identifying assumption is

  • Cond. Independence of Treatment: \(Y_i(d) \perp D_i|X_i\quad a.s.\) for \(d=0,1\)

where further assmputions are made in the context of the respective sample selection model.

Note

A more detailed example can be found in the Example Gallery.

3.4.1. Missingness at Random#

Consider the following two additional assumptions for the sample selection model:

  • Cond. Independence of Selection: \(Y_i(d) \perp S_i|D_i=d, X_i\quad a.s.\) for \(d=0,1\)

  • Common Support: \(P(D_i=1|X_i)>0\) and \(P(S_i=1|D_i=d, X_i)>0\) for \(d=0,1\)

such that outcomes are missing at random (for the score see Scores).

DoubleMLSSM implements sample selection models. The score score='missing-at-random' refers to the correponding score relying on the assumptions above. The DoubleMLData object has to be defined with the additional argument s_col for the selection indicator. Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: from sklearn.linear_model import LassoCV, LogisticRegressionCV

In [3]: from doubleml.datasets import make_ssm_data

In [4]: import doubleml as dml

In [5]: np.random.seed(42)

In [6]: n_obs = 2000

In [7]: df = make_ssm_data(n_obs=n_obs, mar=True, return_type='DataFrame')

In [8]: dml_data = dml.DoubleMLData(df, 'y', 'd', s_col='s')

In [9]: ml_g = LassoCV()

In [10]: ml_m = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [11]: ml_pi = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [12]: dml_ssm = dml.DoubleMLSSM(dml_data, ml_g, ml_m, ml_pi, score='missing-at-random')

In [13]: dml_ssm.fit()
Out[13]: <doubleml.irm.ssm.DoubleMLSSM at 0x7f475a613050>

In [14]: print(dml_ssm)
================== DoubleMLSSM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
Score/Selection variable: s
No. Observations: 2000

------------------ Score & algorithm ------------------
Score function: missing-at-random

------------------ Machine learner   ------------------
Learner ml_g: LassoCV()
Learner ml_pi: LogisticRegressionCV(penalty='l1', solver='liblinear')
Learner ml_m: LogisticRegressionCV(penalty='l1', solver='liblinear')
Out-of-sample Performance:
Regression:
Learner ml_g_d0 RMSE: [[1.10039862]]
Learner ml_g_d1 RMSE: [[1.11071087]]
Classification:
Learner ml_pi Log Loss: [[0.53791422]]
Learner ml_m Log Loss: [[0.63593298]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.965531  0.065969  14.636048  1.654070e-48  0.836234  1.094829
library(DoubleML)
library(mlr3)
library(data.table)

set.seed(3141)
n_obs = 2000
df = make_ssm_data(n_obs=n_obs, mar=TRUE, return_type="data.table")
dml_data = DoubleMLData$new(df, y_col="y", d_cols="d", s_col="s")

ml_g = lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min")
ml_m = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")
ml_pi = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")

dml_ssm = DoubleMLSSM$new(dml_data, ml_g, ml_m, ml_pi, score="missing-at-random")
dml_ssm$fit()
print(dml_ssm)
================= DoubleMLSSM Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
Selection variable: s
No. Observations: 2000

------------------ Score & algorithm ------------------
Score function: missing-at-random
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.cv_glmnet
ml_pi: classif.cv_glmnet
ml_m: classif.cv_glmnet

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.93300    0.05891   15.84   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.4.2. Nonignorable Nonresponse#

When sample selection or outcome attriction is realated to unobservables, identification generally requires an instrument for the selection indicator \(S_i\). Consider the following additional assumptions for the instrumental variable:

  • Cond. Correlation: \(\exists Z: \mathbb{E}[Z\cdot S|D,X] \neq 0\)

  • Cond. Independence: \(Y_i(d,z)=Y_i(d)\) and \(Y_i \perp Z_i|D_i=d, X_i\quad a.s.\) for \(d=0,1\)

This requires the instrumental variable \(Z_i\), which must not affect \(Y_i\) or be associated with unobservables affecting \(Y_i\) conditional on \(D_i\) and \(X_i\). Further, the selection is determined via a (unknown) threshold model:

  • Threshold: \(S_i = 1\{V_i \le \xi(D,X,Z)\}\) where \(\xi\) is a general function and \(V_i\) is a scalar with strictly monotonic cumulative distribution function conditional on \(X_i\).

  • Cond. Independence: \(Y_i \perp (Z_i, D_i)|X_i\).

Let \(\Pi_i := P(S_i=1|D_i, X_i, Z_i)\) denote the selection probability. Additionally, the following assumptions are required:

  • Common Support for Treatment: \(P(D_i=1|X_i, \Pi)>0\)

  • Cond. Effect Homogeneity: \(\mathbb{E}[Y_i(1)-Y_i(0)|S_i=1, X_i=x, V_i=v] = \mathbb{E}[Y_i(1)-Y_i(0)|X_i=x, V_i=v]\)

  • Common Support for Selection: \(P(S_i=1|D_i=d, X_i=x, Z_i=z)>0\quad a.s.\) for \(d=0,1\)

For further details, see Bia, Huber and Lafférs (2023).

DAG

Causal paths under nonignorable nonresponse#

DoubleMLSSM implements sample selection models. The score score='nonignorable' refers to the correponding score relying on the assumptions above. The DoubleMLData object has to be defined with the additional argument s_col for the selection indicator and z_cols for the instrument. Estimation is conducted via its fit() method:

In [15]: import numpy as np

In [16]: from sklearn.linear_model import LassoCV, LogisticRegressionCV

In [17]: from doubleml.datasets import make_ssm_data

In [18]: import doubleml as dml

In [19]: np.random.seed(42)

In [20]: n_obs = 2000

In [21]: df = make_ssm_data(n_obs=n_obs, mar=False, return_type='DataFrame')

In [22]: dml_data = dml.DoubleMLData(df, 'y', 'd', z_cols='z', s_col='s')

In [23]: ml_g = LassoCV()

In [24]: ml_m = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [25]: ml_pi = LogisticRegressionCV(penalty='l1', solver='liblinear')

In [26]: dml_ssm = dml.DoubleMLSSM(dml_data, ml_g, ml_m, ml_pi, score='nonignorable')

In [27]: dml_ssm.fit()
Out[27]: <doubleml.irm.ssm.DoubleMLSSM at 0x7f475a612540>

In [28]: print(dml_ssm)
================== DoubleMLSSM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): ['z']
Score/Selection variable: s
No. Observations: 2000

------------------ Score & algorithm ------------------
Score function: nonignorable

------------------ Machine learner   ------------------
Learner ml_g: LassoCV()
Learner ml_pi: LogisticRegressionCV(penalty='l1', solver='liblinear')
Learner ml_m: LogisticRegressionCV(penalty='l1', solver='liblinear')
Out-of-sample Performance:
Regression:
Learner ml_g_d0 RMSE: [[0.92827999]]
Learner ml_g_d1 RMSE: [[1.10079785]]
Classification:
Learner ml_pi Log Loss: [[0.44124313]]
Learner ml_m Log Loss: [[0.59854797]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
      coef   std err         t         P>|t|     2.5 %    97.5 %
d  1.14268  0.183373  6.231467  4.620874e-10  0.783276  1.502084
library(DoubleML)
library(mlr3)
library(data.table)

set.seed(3141)
n_obs = 2000
df = make_ssm_data(n_obs=n_obs, mar=FALSE, return_type="data.table")
dml_data = DoubleMLData$new(df, y_col="y", d_cols="d", z_cols = "z", s_col="s")

ml_g = lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min")
ml_m = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")
ml_pi = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")

dml_ssm = DoubleMLSSM$new(dml_data, ml_g, ml_m, ml_pi, score="nonignorable")
dml_ssm$fit()
print(dml_ssm)
================= DoubleMLSSM Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): z
Selection variable: s
No. Observations: 2000

------------------ Score & algorithm ------------------
Score function: nonignorable
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.cv_glmnet
ml_pi: classif.cv_glmnet
ml_m: classif.cv_glmnet

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.89273    0.06161   14.49   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


3.5. Regression Discontinuity Designs (RDD)#

Regression Discontinuity Designs (RDD) are causal inference methods used when treatment assignment is determined by a continuous running variable (“score”) crossing a known threshold (“cutoff”). These designs exploit discontinuities in the probability of receiving treatment at the cutoff to estimate the average treatment effect. RDDs are divided into two main types: Sharp and Fuzzy.

The key idea behind RDD is that units just above and just below the threshold are assumed to be comparable, differing only in the treatment assignment. This allows estimating the causal effect at the threshold by comparing outcomes of treated and untreated units.

Our implementation follows work from Noack, Olma and Rothe (2024).

Let \(Y_i\) be the observed outcome of an individual and \(D_i\) the treatment it received. By using a set of additional covariates \(X_i\) for each observation, \(Y_i\) and \(D_i\) can be adjusted in a first stage, to reduce the standard deviation in the estimation of the causal effect.

Note

To fit into the package syntax, our notation differs as follows from the one used in most standard RDD works (as for example Cattaneo and Titiunik (2022)):
  • \(S_i\) the score (instead of \(X_i\))

  • \(X_i\) the covariates (instead of \(Z_i\))

  • \(D_i\) the treatment received (in sharp RDD instead of \(T_i\))

  • \(T_i\) the treatment assigned (only relevant in fuzzy RDD)

Note

The doubleml.rdd module depends on rdrobust which can be installed via pip install rdrobust or pip install doubleml[rdd].

3.5.1. Sharp Regression Discontinuity Design#

In a Sharp RDD, the treatment \(D_i\) is deterministically assigned at the cutoff (\(D_i = \mathbb{1}\{S_i \geq c\}\)).

Let \(S_i\) represent the score, and let \(c\) denote the cutoff point. Further, let \(Y_i(1)\) and \(Y_i(0)\) denote the potential outcomes with and without treatment, respectively. Then, the treatment effect at the cutoff

\[\tau_0 = \mathbb{E}[Y_i(1)-Y_i(0)\mid S_i = c]\]

is identified as the difference in the conditional expectation of \(Y_i\) at the cutoff from both sides

\[\tau_0 = \lim_{s \to c^+} \mathbb{E}[Y_i \mid S_i = s] - \lim_{s \to c^-} \mathbb{E}[Y_i \mid S_i = s]\]

The key assumption for identifying this effect in a sharp RDD is:

  • Continuity: The conditional mean of the potential outcomes \(\mathbb{E}[Y_i(d)\mid S_i=s]\) for \(d \in \{0, 1\}\) is continuous at the cutoff level \(c\).

This includes the necessary condition of exogeneity, implying units cannot perfectly manipulate their value of \(S_i\) to either receive or avoid treatment exactly at the cutoff.

Without the use of covariates, \(\tau_{0}\) is typically estimated by running separate local linear regressions on each side of the cutoff, yielding an estimator of the form:

\[\hat{\tau}_{\text{base}}(h) = \sum_{i=1}^n w_i(h)Y_i,\]

where the \(w_i(h)\) are local linear regression weights that depend on the data through the realizations of the running variable only and \(h > 0\) is a bandwidth.

Under standard conditions, which include that the running variable is continuously distributed, and that the bandwidth \(h\) tends to zero at an appropriate rate, the estimator \(\hat{\tau}_{\text{base}}(h)\) is approximately normally distributed in large samples, with bias of order \(h^2\) and variance of order \((nh)^{-1}\):

\[\hat{\tau}_{\text{base}}(h) \stackrel{a}{\sim} N\left(\tau + h^2 B_{\text{base}},(nh)^{-1}V_{\text{base}}\right).\]

If covariates are available, they can be used to improve the accuracy of empirical RD estimates. The most popular strategy is to include them linearly and without kernel localization in the local linear regression. By simple least squares algebra, this “linear adjustment” estimator can be written as a no-covariates estimator with the covariate-adjusted outcome \(Y_i - X_i^{\top} \widehat{\gamma}_h\):

\[\widehat{\tau}_{\text{lin}}(h) = \sum_{i=1}^n w_i(h)\left(Y_i - X_i^{\top} \widehat{\gamma}_h\right).\]

Here, \(\widehat{\gamma}_h\) is the minimizer from the regression

\[\underset{\beta,\gamma}{\mathrm{arg\,min}} \, \sum_{i=1}^n K_h(S_i) (Y_i - Q_i^\top\beta- X_i^{\top}\gamma )^2,\]

with \(Q_i =(D_i, S_i, D_i S_i, 1)^T\) (see fs_specification in Implementation Details), \(K_h(v)=K(v/h)/h\) with \(K(\cdot)\) a kernel function.

If \(\mathbb{E}[X_i | S_i = s]\) is twice continuously differentiable around the cutoff, then the distribution of \(\widehat{\tau}_{\text{lin}}(h)\) is similar to the one of the base estimator with potentially smaller variance term \(V_{\text{lin}}\).

As this linear adjustment might not exploit the available covariate information efficiently, DoubleML features an RDD estimator with flexible covariate adjustment based on potentially nonlinear adjustment functions \(\eta\). The estimator takes the following form:

\[\widehat{\tau}_{\text{RDFlex}}(h; \eta) = \sum_{i=1}^n w_i(h) M_i(\eta), \quad M_i(\eta) = Y_i - \eta(X_i).\]

Similar to other algorithms in DoubleML, \(\eta\) is estimated by ML methods and with crossfitting. Different than in other models, there is no orthogonal score, but a similar global insensitivity property holds (for details see Noack, Olma and Rothe (2024)). We adjust the outcome variable by the influence of the covariates.

This reduces the variance in the estimation potentially even further to:

\[V(\eta) = \frac{\bar{\kappa}}{f_X(0)} \left( \mathbb{V}[M_i(\eta) | S_i = 0^+] + \mathbb{V}[M_i(\eta) | S_i = 0^-] \right).\]

with \(\bar{\kappa}\) being a kernel constant. To maximize the precision of the estimator \(\widehat\tau(h;\eta)\) for any particular bandwidth \(h\), \(\eta\) has to be chosen such that \(V(\eta)\) is as small as possible. The equally-weighted average of the left and right limits of the conditional expectation function \(\mathbb{E}[Y_i|S_i=s,X_i=x]\) at the cutoff achieves this goal. According to Noack, Olma and Rothe (2024), it holds:

\[V(\eta) \geq V(\eta_0) \text{ for all } \eta,\]

where:

\[\eta_0(x) = \frac{1}{2} \left( \mu_0^+(x) + \mu_0^-(x) \right), \quad \mu_0^\star(x) = \mathbb{E}[Y_i | S_i = 0^\star, X_i = x] \text{ for } \star \in \{+, -\}.\]

RDFlex implements this regression discontinuity design with \(\eta_0\) being estimated by user-specified ML methods. The indicator fuzzy=False indicates a sharp design. The DoubleMLData object has to be defined with the arguments:

  • y_col refers to the observed outcome, on which we want to estimate the effect at the cutoff

  • s_col refers to the score

  • x_cols refers to the covariates to be adjusted for

  • d_cols is an indicator of whether an observation is treated or not. In the sharp design, this should be identical to an indicator of whether an observation is left or right of the cutoff (\(D_i = \mathbb{I}[S_i > c]\))

Estimation is conducted via its fit() method:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from sklearn.linear_model import LassoCV

In [4]: from doubleml.rdd.datasets import make_simple_rdd_data

In [5]: from doubleml.rdd import RDFlex

In [6]: import doubleml as dml

In [7]: np.random.seed(42)

In [8]: data_dict = make_simple_rdd_data(n_obs=1000, fuzzy=False)

In [9]: cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]

In [10]: df = pd.DataFrame(np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])), columns=['y', 'd', 'score'] + cov_names)

In [11]: dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')

In [12]: ml_g = LassoCV()

In [13]: rdflex_obj = RDFlex(dml_data, ml_g, fuzzy=False)

In [14]: rdflex_obj.fit()
Out[14]: <doubleml.rdd.rdd.RDFlex at 0x7f475ae1be00>

In [15]: print(rdflex_obj)
Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      1.290     0.565     2.285    2.232e-02  [0.183, 2.396]
Robust                 -        -     2.053    4.005e-02  [0.062, 2.660]
Design Type:        Sharp
Cutoff:             0
First Stage Kernel: triangular
Final Bandwidth:    [0.63117637]

3.5.2. Fuzzy Regression Discontinuity Design#

In a Fuzzy RDD, treatment assignment \(T_i\) is identical to the sharp RDD (\(T_i = \mathbb{1}\{S_i \geq c\}\)), however, compliance is limited around the cutoff which leads to a different treatment received \(D_i\) than assigned (\(D_i \neq T_i\)) for some units.

The parameter of interest in the Fuzzy RDD is the average treatment effect at the cutoff, for all individuals that comply with the assignment

\[\theta_{0} = \mathbb{E}[Y_i(1)-Y_i(0)\mid S_i = c, \{i\in \text{compliers}\}]\]

with \(Y_i(D_i(T_i))\) being the potential outcome under the potential treatments. This effect is identified by

\[\theta_{0} = \frac{\lim_{s \to c^+} \mathbb{E}[Y_i \mid S_i = s] - \lim_{s \to c^-} \mathbb{E}[Y_i \mid S_i = s]}{\lim_{s \to c^+} \mathbb{E}[D_i \mid S_i = s] - \lim_{s \to c^-} \mathbb{E}[D_i \mid S_i = s]}\]

The assumptions for identifying the ATT in a fuzzy RDD are:

  • Continuity of Potential Outcomes: Similar to sharp RDD, the conditional mean of the potential outcomes \(\mathbb{E}[Y_i(d)\mid S_i=s]\) for \(d \in \{0, 1\}\) is continuous at the cutoff level \(c\).

  • Continuity of Treatment Assignment Probability: The probability of receiving treatment \(\mathbb{E}[D_i | S_i = s]\) must change discontinuously at the cutoff, but there should be no other jumps in the probability.

  • Monotonicity: There must be no “defiers”, meaning individuals for whom the treatment assignment goes in the opposite direction of the score.

Under similar considerations as in the sharp case, an estimator using flexible covariate adjustment can be derived as:

\[\hat{\theta}(h; \widehat{\eta}_Y, \widehat{\eta}_D) = \frac{\hat{\tau}_Y(h; \widehat{\eta}_Y)}{\hat{\tau}_D(h; \widehat{\eta}_D)} = \frac{\sum_{i=1}^n w_{i}(h) (Y_i - \widehat{\eta}_{Y}(X_i))}{\sum_{i=1}^n w_{i}(h) (T_i - \widehat{\eta}_{D}(X_i))},\]

where \(\eta_Y\) and \(\eta_D\) are defined as in the sharp RDD setting, with the respective outcome.

RDFlex implements this fuzzy RDD with flexible covariate adjustment. The indicator fuzzy=True indicates a fuzzy design. The DoubleMLData object has to be defined with the arguments:

  • y_col refers to the observed outcome, on which we want to estimate the effect at the cutoff

  • s_col refers to the score

  • x_cols refers to the covariates to be adjusted for

  • d_cols is an indicator of whether an observation is treated or not. In the fuzzy design, this should not be identical to an indicator of whether an observation is left or right of the cutoff (\(D_i \neq \mathbb{I}[S_i > c]\))

Estimation is conducted via its fit() method:

In [16]: import numpy as np

In [17]: import pandas as pd

In [18]: from sklearn.linear_model import LassoCV, LogisticRegressionCV

In [19]: from doubleml.rdd.datasets import make_simple_rdd_data

In [20]: from doubleml.rdd import RDFlex

In [21]: import doubleml as dml

In [22]: np.random.seed(42)

In [23]: data_dict = make_simple_rdd_data(n_obs=1000, fuzzy=True)

In [24]: cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]

In [25]: df = pd.DataFrame(np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])), columns=['y', 'd', 'score'] + cov_names)

In [26]: dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')

In [27]: ml_g = LassoCV()

In [28]: ml_m = LogisticRegressionCV()

In [29]: rdflex_obj = RDFlex(dml_data, ml_g, ml_m, fuzzy=True)

In [30]: rdflex_obj.fit()
Out[30]: <doubleml.rdd.rdd.RDFlex at 0x7f47599a3da0>

In [31]: print(rdflex_obj)
Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      3.207     4.935     0.650    5.157e-01  [-6.464, 12.879]
Robust                 -        -     0.682    4.955e-01  [-7.313, 15.111]
Design Type:        Fuzzy
Cutoff:             0
First Stage Kernel: triangular
Final Bandwidth:    [0.61404894]

3.5.3. Implementation Details#

There are some specialities in the RDFlex implementation that differ from the rest of the package and thus deserve to be pointed out here.

  1. Bandwidth Selection: The bandwidth is a crucial tuning parameter for RDD algorithms. By default, our implementation uses the rdbwselect method from the rdrobust library for an initial selection. This can be overridden by the user using the parameter h_fs. Since covariate adjustment and RDD fitting are interacting, by default, we repeat the bandwidth selection and nuisance estimation steps once in the fit() method. This can be adjusted by n_iterations.

  2. Kernel Selection: Another crucial decision when estimating with RDD is the kernel determining the weights for observations around the cutoff. For this, the parameters fs_kernel and kernel are important. The latter is a key-worded argument and is used in the RDD estimation, while the fs_kernel specifies the kernel used in the nuisance estimation. By default, both of them are triangular.

  3. Local and Global Learners: RDFlex estimates the nuisance functions locally around the cutoff. In certain scenarios, it can be desirable to rather perform a global fit on the full support of the score \(S\). For this, the Global Learners in doubleml.utils can be used (see our example notebook in the Example Gallery).

  4. First Stage Specifications: In nuisance estimation, we have to add variable(s) to add information about the location of the observation left or right of the cutoff. Available options are: In the default case fs_specification="cutoff", this is an indicator of whether the observation is left or right. If fs_specification="cutoff and score", additionally the score is added. In the case of fs_specification="interacted cutoff and score", also an interaction term of the cutoff indicator and the score is added.

  5. Intention-to-Treat Effects: Above, we demonstrated how to estimate the ATE at the cutoff in a fuzzy RDD. To estimate an Intention-to-Treat effect instead, the parameter fuzzy=False can be selected.

  6. Key-worded Arguments: rdrobust as the underlying RDD library has additional parameters to tune the estimation. You can use **kwargs to add them via RDFlex.