3. Models#
3.1. Partially linear regression model (PLR)#
Partially linear regression (PLR) models take the form
where \(Y\) is the outcome variable and \(D\) is the policy variable of interest. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.
![digraph {
nodesep=1;
ranksep=1;
rankdir=LR;
{ node [shape=circle, style=filled]
Y [fillcolor="#56B4E9"]
D [fillcolor="#F0E442"]
V [fillcolor="#F0E442"]
X [fillcolor="#D55E00"]
}
Y -> D -> V [dir="back"];
X -> D;
Y -> X [dir="back"];
}](../_images/graphviz-8852e5db087f49410d0a5212d9b7fdcb58f0aaf9.png)
Causal diagram#
DoubleMLPLR
implements PLR models.
Estimation is conducted via its fit()
method:
In [1]: import numpy as np
In [2]: import doubleml as dml
In [3]: from doubleml.datasets import make_plr_CCDDHNR2018
In [4]: from sklearn.ensemble import RandomForestRegressor
In [5]: from sklearn.base import clone
In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [7]: ml_l = clone(learner)
In [8]: ml_m = clone(learner)
In [9]: np.random.seed(1111)
In [10]: data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [11]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [12]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)
In [13]: print(dml_plr_obj.fit())
================== DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.54137 0.042959 12.602064 2.056917e-36 0.457172 0.625568
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn")
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
set.seed(1111)
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table')
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m)
dml_plr_obj$fit()
print(dml_plr_obj)
================= DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
ml_l: regr.ranger
ml_m: regr.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.47659 0.04166 11.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3.2. Partially linear IV regression model (PLIV)#
Partially linear IV regression (PLIV) models take the form
where \(Y\) is the outcome variable, \(D\) is the policy variable of interest and \(Z\) denotes one or multiple instrumental variables. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.
![digraph {
nodesep=1;
ranksep=1;
rankdir=LR;
{ node [shape=circle, style=filled]
Y [fillcolor="#56B4E9"]
D [fillcolor="#56B4E9"]
Z [fillcolor="#F0E442"]
V [fillcolor="#F0E442"]
X [fillcolor="#D55E00"]
}
Z -> V [dir="back"];
D -> X [dir="back"];
Y -> D [dir="both"];
X -> Y;
Z -> X [dir="back"];
Z -> D;
{ rank=same; Y D }
{ rank=same; Z X }
{ rank=same; V }
}](../_images/graphviz-21b721be23729673da52c6a08d58e43dd1769d11.png)
Causal diagram#
DoubleMLPLIV
implements PLIV models.
Estimation is conducted via its fit()
method:
In [14]: import numpy as np
In [15]: import doubleml as dml
In [16]: from doubleml.datasets import make_pliv_CHS2015
In [17]: from sklearn.ensemble import RandomForestRegressor
In [18]: from sklearn.base import clone
In [19]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [20]: ml_l = clone(learner)
In [21]: ml_m = clone(learner)
In [22]: ml_r = clone(learner)
In [23]: np.random.seed(2222)
In [24]: data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type='DataFrame')
In [25]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='Z1')
In [26]: dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)
In [27]: print(dml_pliv_obj.fit())
================== DoubleMLPLIV Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['Z1']
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.472449 0.091465 5.165344 2.399971e-07 0.293181 0.651717
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
ml_r = learner$clone()
set.seed(2222)
data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_col = "d", z_cols= "Z1")
dml_pliv_obj = DoubleMLPLIV$new(obj_dml_data, ml_l, ml_m, ml_r)
dml_pliv_obj$fit()
print(dml_pliv_obj)
================= DoubleMLPLIV Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): Z1
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
ml_l: regr.ranger
ml_m: regr.ranger
ml_r: regr.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.66184 0.07786 8.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3.3. Interactive regression model (IRM)#
Interactive regression (IRM) models take the form
where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\). We consider estimation of the average treatment effects when treatment effects are fully heterogeneous. Target parameters of interest in this model are the average treatment effect (ATE),
and the average treatment effect of the treated (ATTE),
![digraph {
nodesep=1;
ranksep=1;
rankdir=LR;
{ node [shape=circle, style=filled]
Y [fillcolor="#56B4E9"]
D [fillcolor="#F0E442"]
V [fillcolor="#F0E442"]
X [fillcolor="#D55E00"]
}
Y -> D -> V [dir="back"];
X -> D;
Y -> X [dir="back"];
}](../_images/graphviz-8852e5db087f49410d0a5212d9b7fdcb58f0aaf9.png)
Causal diagram#
DoubleMLIRM
implements IRM models.
Estimation is conducted via its fit()
method:
In [28]: import numpy as np
In [29]: import doubleml as dml
In [30]: from doubleml.datasets import make_irm_data
In [31]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [32]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [33]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [34]: np.random.seed(3333)
In [35]: data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')
In [36]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [37]: dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
In [38]: print(dml_irm_obj.fit())
================== DoubleMLIRM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.667284 0.188651 3.537135 0.000404 0.297535 1.037034
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(3333)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m)
dml_irm_obj$fit()
print(dml_irm_obj)
================= DoubleMLIRM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2
------------------ Machine learner ------------------
ml_g: regr.ranger
ml_m: classif.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.6695 0.2097 3.192 0.00141 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3.4. Interactive IV model (IIVM)#
Interactive IV regression (IIVM) models take the form
where the treatment variable is binary, \(D \in \lbrace 0,1 \rbrace\) and the instrument is binary, \(Z \in \lbrace 0,1 \rbrace\). Consider the functions \(g_0\), \(r_0\) and \(m_0\), where \(g_0\) maps the support of \((Z,X)\) to \(\mathbb{R}\) and \(r_0\) and \(m_0\) respectively map the support of \((Z,X)\) and \(X\) to \((\varepsilon, 1-\varepsilon)\) for some \(\varepsilon \in (0, 1/2)\), such that
The target parameter of interest in this model is the local average treatment effect (LATE),
![digraph {
nodesep=1;
ranksep=1;
rankdir=LR;
{ node [shape=circle, style=filled]
Y [fillcolor="#56B4E9"]
D [fillcolor="#56B4E9"]
Z [fillcolor="#F0E442"]
V [fillcolor="#F0E442"]
X [fillcolor="#D55E00"]
}
Z -> V [dir="back"];
D -> X [dir="back"];
Y -> D [dir="both"];
X -> Y;
Z -> X [dir="back"];
Z -> D;
{ rank=same; Y D }
{ rank=same; Z X }
{ rank=same; V }
}](../_images/graphviz-21b721be23729673da52c6a08d58e43dd1769d11.png)
Causal diagram#
DoubleMLIIVM
implements IIVM models.
Estimation is conducted via its fit()
method:
In [39]: import numpy as np
In [40]: import doubleml as dml
In [41]: from doubleml.datasets import make_iivm_data
In [42]: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
In [43]: ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [44]: ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [45]: ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [46]: np.random.seed(4444)
In [47]: data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame')
In [48]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z')
In [49]: dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)
In [50]: print(dml_iivm_obj.fit())
================== DoubleMLIIVM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['z']
No. Observations: 1000
------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2
------------------ Machine learner ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.488343 0.222121 2.198546 0.02791 0.052994 0.923692
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(4444)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_r = ml_m$clone()
data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d", z_cols="z")
dml_iivm_obj = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)
dml_iivm_obj$fit()
print(dml_iivm_obj)
================= DoubleMLIIVM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Instrument(s): z
No. Observations: 1000
------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2
------------------ Machine learner ------------------
ml_g: regr.ranger
ml_m: classif.ranger
ml_r: classif.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.3569 0.1990 1.793 0.073 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1