3. Models#
The DoubleML includes the following models.
3.1. Partially linear models (PLM)#
The partially linear models (PLM) take the form
where treatment effects are additive with some sort of linear form.
3.1.1. Partially linear regression model (PLR)#
Partially linear regression (PLR) models take the form
where \(Y\) is the outcome variable and \(D\) is the policy variable of interest. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.
DoubleMLPLR
implements PLR models.
Estimation is conducted via its fit()
method:
In [1]: import numpy as np
In [2]: import doubleml as dml
In [3]: from doubleml.datasets import make_plr_CCDDHNR2018
In [4]: from sklearn.ensemble import RandomForestRegressor
In [5]: from sklearn.base import clone
In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [7]: ml_l = clone(learner)
In [8]: ml_m = clone(learner)
In [9]: np.random.seed(1111)
In [10]: data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [11]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [12]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)
In [13]: print(dml_plr_obj.fit())
================== DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
------------------ Machine learner ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.18356413]]
Learner ml_m RMSE: [[1.06008533]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.512672 0.04491 11.41566 3.492417e-30 0.424651 0.600694
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn")
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
set.seed(1111)
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table')
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m)
dml_plr_obj$fit()
print(dml_plr_obj)
================= DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
ml_l: regr.ranger
ml_m: regr.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.47319 0.04165 11.36 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3.1.2. Partially linear IV regression model (PLIV)#
Partially linear IV regression (PLIV) models take the form
where \(Y\) is the outcome variable, \(D\) is the policy variable of interest and \(Z\) denotes one or multiple instrumental variables. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.
DoubleMLPLIV
implements PLIV models.
Estimation is conducted via its fit()
method:
In [14]: import numpy as np
In [15]: import doubleml as dml
In [16]: from doubleml.datasets import make_pliv_CHS2015
In [17]: from sklearn.ensemble import RandomForestRegressor
In [18]: from sklearn.base import clone
In [19]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [20]: ml_l = clone(learner)
In [21]: ml_m = clone(learner)
In [22]: ml_r = clone(learner)
In [23]: np.random.seed(2222)
In [24]: data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type='DataFrame')
In [25]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='Z1')
In [26]: dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)
In [27]: print(dml_pliv_obj.fit())
================== DoubleMLPLIV Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['Z1']
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
------------------ Machine learner ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[1.48390784]]
Learner ml_m RMSE: [[0.53209683]]
Learner ml_r RMSE: [[1.25240463]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.481705 0.084337 5.711638 1.118938e-08 0.316407 0.647004
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
ml_r = learner$clone()
set.seed(2222)
data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_col = "d", z_cols= "Z1")
dml_pliv_obj = DoubleMLPLIV$new(obj_dml_data, ml_l, ml_m, ml_r)
dml_pliv_obj$fit()
print(dml_pliv_obj)
================= DoubleMLPLIV Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): Z1
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
ml_l: regr.ranger
ml_m: regr.ranger
ml_r: regr.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.66133 0.07796 8.483 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3.2. Interactive regression models (IRM)#
The interactive regression model (IRM) take the form
where treatment effects are fully heterogeneous.
3.2.1. Binary Interactive Regression Model (IRM)#
Interactive regression (IRM) models take the form
where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\). We consider estimation of the average treatment effects when treatment effects are fully heterogeneous.
Target parameters of interest in this model are the average treatment effect (ATE),
and the average treatment effect of the treated (ATTE),
DoubleMLIRM
implements IRM models.
Estimation is conducted via its fit()
method:
In [28]: import numpy as np
In [29]: import doubleml as dml
In [30]: from doubleml.datasets import make_irm_data
In [31]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [32]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
In [33]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
In [34]: np.random.seed(3333)
In [35]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')
In [36]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [37]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
In [38]: print(dml_irm_obj.fit())
================== DoubleMLIRM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: ATE
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=10, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=10, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[1.07085301]]
Learner ml_g1 RMSE: [[1.09682314]]
Classification:
Learner ml_m Log Loss: [[0.55863386]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.599297 0.1887 3.175931 0.001494 0.229452 0.969141
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(3333)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m)
dml_irm_obj$fit()
print(dml_irm_obj)
================= DoubleMLIRM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s):
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2
------------------ Machine learner ------------------
ml_g: regr.ranger
ml_m: classif.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.3958 0.2651 1.493 0.135
3.2.2. Average Potential Outcomes (APOs)#
For general discrete-values treatments \(D \in \lbrace d_0, \dots, d_l \rbrace\) the model can be generalized to
where \(A_j := 1\lbrace D = d_j\rbrace\) is an indicator variable for treatment level \(d_j\) and \(m_{0,j}(X)\) denotes the corresponding propensity score.
Possible target parameters of interest in this model are the average potential outcomes (APOs)
DoubleMLAPO
implements the estimation of average potential outcomes.
Estimation is conducted via its fit()
method:
In [39]: import numpy as np
In [40]: import doubleml as dml
In [41]: from doubleml.datasets import make_irm_data
In [42]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [43]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
In [44]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
In [45]: np.random.seed(3333)
In [46]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')
In [47]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [48]: dml_apo_obj = dml.DoubleMLAPO(obj_dml_data, ml_g, ml_m, treatment_level=0)
In [49]: print(dml_apo_obj.fit())
================== DoubleMLAPO Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: APO
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=10, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=10, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[1.08191204]]
Learner ml_g1 RMSE: [[1.06694255]]
Classification:
Learner ml_m Log Loss: [[0.55863386]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d -0.119669 0.137529 -0.870142 0.384223 -0.38922 0.149882
3.2.3. Average Potential Outcomes (APOs) for Multiple Treatment Levels#
If multiple treatment levels should be estimated simulatenously, another possible target parameter of interest in this model are contrasts (or average treatment effects) between treatment levels \(d_j\) and \(d_k\):
DoubleMLAPOS
implements the estimation of average potential outcomes for multiple treatment levels.
Estimation is conducted via its fit()
method. The causal_contrast()
method allows to estimate causal contrasts between treatment levels:
In [50]: import numpy as np
In [51]: import doubleml as dml
In [52]: from doubleml.datasets import make_irm_data
In [53]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [54]: ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
In [55]: ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
In [56]: np.random.seed(3333)
In [57]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')
In [58]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [59]: dml_apos_obj = dml.DoubleMLAPOS(obj_dml_data, ml_g, ml_m, treatment_levels=[0, 1])
In [60]: print(dml_apos_obj.fit())
================== DoubleMLAPOS Object ==================
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
0 -0.162784 0.180190 -0.903406 0.366310 -0.515950 0.190381
1 0.526102 0.137809 3.817628 0.000135 0.256002 0.796203
In [61]: causal_contrast_model = dml_apos_obj.causal_contrast(reference_levels=0)
In [62]: print(causal_contrast_model.summary)
coef std err t P>|t| 2.5 % 97.5 %
1 vs 0 0.688887 0.228214 3.018602 0.002539 0.241596 1.136178
3.2.4. Interactive IV model (IIVM)#
Interactive IV regression (IIVM) models take the form
where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\) and the instrument is binary, \(Z \in \lbrace 0,1 \rbrace\). Consider the functions \(g_0\), \(r_0\) and \(m_0\), where \(g_0\) maps the support of \((Z,X)\) to \(\mathbb{R}\) and \(r_0\) and \(m_0\) respectively map the support of \((Z,X)\) and \(X\) to \((\varepsilon, 1-\varepsilon)\) for some \(\varepsilon \in (0, 1/2)\), such that
The target parameter of interest in this model is the local average treatment effect (LATE),
DoubleMLIIVM
implements IIVM models.
Estimation is conducted via its fit()
method:
In [63]: import numpy as np
In [64]: import doubleml as dml
In [65]: from doubleml.datasets import make_iivm_data
In [66]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [67]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [68]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [69]: ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [70]: np.random.seed(4444)
In [71]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame')
In [72]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')
In [73]: dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)
In [74]: print(dml_iivm_obj.fit())
================== DoubleMLIIVM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['z']
No. Observations: 1000
------------------ Score & algorithm ------------------
Score function: LATE
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[1.12983057]]
Learner ml_g1 RMSE: [[1.13102231]]
Classification:
Learner ml_m Log Loss: [[0.69684828]]
Learner ml_r0 Log Loss: [[0.69508862]]
Learner ml_r1 Log Loss: [[0.43503345]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.389126 0.23113 1.683581 0.092263 -0.063881 0.842132
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(4444)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_r = ml_m$clone()
data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d", z_cols="z")
dml_iivm_obj = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)
dml_iivm_obj$fit()
print(dml_iivm_obj)
================= DoubleMLIIVM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Instrument(s): z
No. Observations: 1000
------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2
------------------ Machine learner ------------------
ml_g: regr.ranger
ml_m: classif.ranger
ml_r: classif.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.3568 0.1988 1.794 0.0727 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3.3. Difference-in-Differences Models (DID)#
Difference-in-Differences Models (DID) implemented in the package focus on the the binary treatment case with with two treatment periods.
Adopting the notation from Sant’Anna and Zhao (2020), let \(Y_{it}\) be the outcome of interest for unit \(i\) at time \(t\). Further, let \(D_{it}=1\) indicate if unit \(i\) is treated before time \(t\) (otherwise \(D_{it}=0\)). Since all units start as untreated (\(D_{i0}=0\)), define \(D_{i}=D_{i1}.\) Relying on the potential outcome notation, denote \(Y_{it}(0)\) as the outcome of unit \(i\) at time \(t\) if the unit did not receive treatment up until time \(t\) and analogously for \(Y_{it}(1)\) with treatment. Consequently, the observed outcome for unit is \(i\) at time \(t\) is \(Y_{it}=D_{it} Y_{it}(1) + (1-D_{it}) Y_{it}(0)\). Further, let \(X_i\) be a vector of pre-treatment covariates.
Target parameter of interest is the average treatment effect on the treated (ATTE)
The corresponding identifying assumptions are
(Cond.) Parallel Trends: \(\mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=1] = \mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=0]\quad a.s.\)
Overlap: \(\exists\epsilon > 0\): \(P(D_i=1) > \epsilon\) and \(P(D_i=1|X_i) \le 1-\epsilon\quad a.s.\)
Note
For a more detailed introduction and recent developments of the difference-in-differences literature see e.g. Roth et al. (2022).
3.3.1. Panel data#
If panel data are available, the observations are assumed to be iid. of form \((Y_{i0}, Y_{i1}, D_i, X_i)\).
Remark that the difference \(\Delta Y_i= Y_{i1}-Y_{i0}\) has to be defined as the outcome y
in the DoubleMLData
object.
DoubleMLIDID
implements difference-in-differences models for panel data.
Estimation is conducted via its fit()
method:
In [75]: import numpy as np
In [76]: import doubleml as dml
In [77]: from doubleml.datasets import make_did_SZ2020
In [78]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [79]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)
In [80]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)
In [81]: np.random.seed(42)
In [82]: data = make_did_SZ2020(n_obs=500, return_type='DataFrame')
# y is already defined as the difference of observed outcomes
In [83]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [84]: dml_did_obj = dml.DoubleMLDID(obj_dml_data, ml_g, ml_m)
In [85]: print(dml_did_obj.fit())
================== DoubleMLDID Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: observational
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Regression:
Learner ml_g0 RMSE: [[16.27429763]]
Learner ml_g1 RMSE: [[13.35731523]]
Classification:
Learner ml_m Log Loss: [[0.66601815]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d -2.840718 1.760386 -1.613691 0.106595 -6.291011 0.609575
3.3.2. Repeated cross-sections#
For repeated cross-sections, the observations are assumed to be iid. of form \((Y_{i}, D_i, X_i, T_i)\), where \(T_i\) is a dummy variable if unit \(i\) is observed pre- or post-treatment period, such that the observed outcome can be defined as
Further, treatment and covariates are assumed to be stationary, such that the joint distribution of \((D,X)\) is invariant to \(T\).
DoubleMLIDIDCS
implements difference-in-differences models for repeated cross-sections.
Estimation is conducted via its fit()
method:
In [86]: import numpy as np
In [87]: import doubleml as dml
In [88]: from doubleml.datasets import make_did_SZ2020
In [89]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [90]: ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)
In [91]: ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)
In [92]: np.random.seed(42)
In [93]: data = make_did_SZ2020(n_obs=500, cross_sectional_data=True, return_type='DataFrame')
In [94]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', t_col='t')
In [95]: dml_did_obj = dml.DoubleMLDIDCS(obj_dml_data, ml_g, ml_m)
In [96]: print(dml_did_obj.fit())
================== DoubleMLDIDCS Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: observational
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, min_samples_leaf=5)
Learner ml_m: RandomForestClassifier(max_depth=5, min_samples_leaf=5)
Out-of-sample Performance:
Regression:
Learner ml_g_d0_t0 RMSE: [[17.4915707]]
Learner ml_g_d0_t1 RMSE: [[44.85397773]]
Learner ml_g_d1_t0 RMSE: [[32.74938952]]
Learner ml_g_d1_t1 RMSE: [[53.7282094]]
Classification:
Learner ml_m Log Loss: [[0.67936506]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d -4.9944 7.561785 -0.660479 0.508947 -19.815226 9.826426
3.4. Sample Selection Models (SSM)#
Sample Selection Models (SSM) implemented in the package focus on the the binary treatment case when outcomes are only observed for a subpopulation due to sample selection or outcome attrition.
The implementation and notation is based on Bia, Huber and Lafférs (2023). Let \(D_i\) be the binary treatment indicator and \(Y_{i}(d)\) the potential outcome under treatment value \(d\). Further, define \(Y_{i}:=Y_{i}(D)\) to be the realized outcome and \(S_{i}\) as a binary selection indicator. The outcome \(Y_{i}\) is only observed if \(S_{i}=1\). Finally, let \(X_i\) be a vector of observed covariates, measures prior to treatment assignment.
Target parameter of interest is the average treatment effect (ATE)
The corresponding identifying assumption is
Cond. Independence of Treatment: \(Y_i(d) \perp D_i|X_i\quad a.s.\) for \(d=0,1\)
where further assmputions are made in the context of the respective sample selection model.
Note
A more detailed example can be found in the Example Gallery.
3.4.1. Missingness at Random#
Consider the following two additional assumptions for the sample selection model:
Cond. Independence of Selection: \(Y_i(d) \perp S_i|D_i=d, X_i\quad a.s.\) for \(d=0,1\)
Common Support: \(P(D_i=1|X_i)>0\) and \(P(S_i=1|D_i=d, X_i)>0\) for \(d=0,1\)
such that outcomes are missing at random (for the score see Scores).
DoubleMLSSM
implements sample selection models. The score score='missing-at-random'
refers to the correponding score
relying on the assumptions above. The DoubleMLData
object has to be defined with the additional argument s_col
for the selection indicator.
Estimation is conducted via its fit()
method:
In [97]: import numpy as np
In [98]: from sklearn.linear_model import LassoCV, LogisticRegressionCV
In [99]: from doubleml.datasets import make_ssm_data
In [100]: import doubleml as dml
In [101]: np.random.seed(42)
In [102]: n_obs = 2000
In [103]: df = make_ssm_data(n_obs=n_obs, mar=True, return_type='DataFrame')
In [104]: dml_data = dml.DoubleMLData(df, 'y', 'd', s_col='s')
In [105]: ml_g = LassoCV()
In [106]: ml_m = LogisticRegressionCV(penalty='l1', solver='liblinear')
In [107]: ml_pi = LogisticRegressionCV(penalty='l1', solver='liblinear')
In [108]: dml_ssm = dml.DoubleMLSSM(dml_data, ml_g, ml_m, ml_pi, score='missing-at-random')
In [109]: dml_ssm.fit()
Out[109]: <doubleml.irm.ssm.DoubleMLSSM at 0x7f5a86d1a990>
In [110]: print(dml_ssm)
================== DoubleMLSSM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
Selection variable: s
No. Observations: 2000
------------------ Score & algorithm ------------------
Score function: missing-at-random
------------------ Machine learner ------------------
Learner ml_g: LassoCV()
Learner ml_pi: LogisticRegressionCV(penalty='l1', solver='liblinear')
Learner ml_m: LogisticRegressionCV(penalty='l1', solver='liblinear')
Out-of-sample Performance:
Regression:
Learner ml_g_d0 RMSE: [[1.10039862]]
Learner ml_g_d1 RMSE: [[1.11071087]]
Classification:
Learner ml_pi Log Loss: [[0.53791422]]
Learner ml_m Log Loss: [[0.63593298]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.965531 0.065969 14.636048 1.654070e-48 0.836234 1.094829
3.4.2. Nonignorable Nonresponse#
When sample selection or outcome attriction is realated to unobservables, identification generally requires an instrument for the selection indicator \(S_i\). Consider the following additional assumptions for the instrumental variable:
Cond. Correlation: \(\exists Z: \mathbb{E}[Z\cdot S|D,X] \neq 0\)
Cond. Independence: \(Y_i(d,z)=Y_i(d)\) and \(Y_i \perp Z_i|D_i=d, X_i\quad a.s.\) for \(d=0,1\)
This requires the instrumental variable \(Z_i\), which must not affect \(Y_i\) or be associated with unobservables affecting \(Y_i\) conditional on \(D_i\) and \(X_i\). Further, the selection is determined via a (unknown) threshold model:
Threshold: \(S_i = 1\{V_i \le \xi(D,X,Z)\}\) where \(\xi\) is a general function and \(V_i\) is a scalar with strictly monotonic cumulative distribution function conditional on \(X_i\).
Cond. Independence: \(Y_i \perp (Z_i, D_i)|X_i\).
Let \(\Pi_i := P(S_i=1|D_i, X_i, Z_i)\) denote the selection probability. Additionally, the following assumptions are required:
Common Support for Treatment: \(P(D_i=1|X_i, \Pi)>0\)
Cond. Effect Homogeneity: \(\mathbb{E}[Y_i(1)-Y_i(0)|S_i=1, X_i=x, V_i=v] = \mathbb{E}[Y_i(1)-Y_i(0)|X_i=x, V_i=v]\)
Common Support for Selection: \(P(S_i=1|D_i=d, X_i=x, Z_i=z)>0\quad a.s.\) for \(d=0,1\)
For further details, see Bia, Huber and Lafférs (2023).
DoubleMLSSM
implements sample selection models. The score score='nonignorable'
refers to the correponding score
relying on the assumptions above. The DoubleMLData
object has to be defined with the additional argument s_col
for the selection indicator
and z_cols
for the instrument.
Estimation is conducted via its fit()
method:
In [111]: import numpy as np
In [112]: from sklearn.linear_model import LassoCV, LogisticRegressionCV
In [113]: from doubleml.datasets import make_ssm_data
In [114]: import doubleml as dml
In [115]: np.random.seed(42)
In [116]: n_obs = 2000
In [117]: df = make_ssm_data(n_obs=n_obs, mar=False, return_type='DataFrame')
In [118]: dml_data = dml.DoubleMLData(df, 'y', 'd', z_cols='z', s_col='s')
In [119]: ml_g = LassoCV()
In [120]: ml_m = LogisticRegressionCV(penalty='l1', solver='liblinear')
In [121]: ml_pi = LogisticRegressionCV(penalty='l1', solver='liblinear')
In [122]: dml_ssm = dml.DoubleMLSSM(dml_data, ml_g, ml_m, ml_pi, score='nonignorable')
In [123]: dml_ssm.fit()
Out[123]: <doubleml.irm.ssm.DoubleMLSSM at 0x7f5a879c5d30>
In [124]: print(dml_ssm)
================== DoubleMLSSM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): ['z']
Selection variable: s
No. Observations: 2000
------------------ Score & algorithm ------------------
Score function: nonignorable
------------------ Machine learner ------------------
Learner ml_g: LassoCV()
Learner ml_pi: LogisticRegressionCV(penalty='l1', solver='liblinear')
Learner ml_m: LogisticRegressionCV(penalty='l1', solver='liblinear')
Out-of-sample Performance:
Regression:
Learner ml_g_d0 RMSE: [[0.92827999]]
Learner ml_g_d1 RMSE: [[1.10079785]]
Classification:
Learner ml_pi Log Loss: [[0.44124313]]
Learner ml_m Log Loss: [[0.59854797]]
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 1.14268 0.183373 6.231467 4.620874e-10 0.783276 1.502084