# 3. Models#

## 3.1. Partially linear regression model (PLR)#

Partially linear regression (PLR) models take the form

\begin{align}\begin{aligned}Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align}

where $$Y$$ is the outcome variable and $$D$$ is the policy variable of interest. The high-dimensional vector $$X = (X_1, \ldots, X_p)$$ consists of other confounding covariates, and $$\zeta$$ and $$V$$ are stochastic errors.

DoubleMLPLR implements PLR models. Estimation is conducted via its fit() method:

In : import numpy as np

In : import doubleml as dml

In : from doubleml.datasets import make_plr_CCDDHNR2018

In : from sklearn.ensemble import RandomForestRegressor

In : from sklearn.base import clone

In : learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In : ml_l = clone(learner)

In : ml_m = clone(learner)

In : np.random.seed(1111)

In : data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In : obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In : dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

In : print(dml_plr_obj.fit())
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.541398  0.042944  12.606942  1.933512e-36  0.457229  0.625568

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn") learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_l = learner$clone()
ml_m = learner$clone() set.seed(1111) data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table') obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m) dml_plr_obj$fit()
print(dml_plr_obj)

================= DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d   0.47659    0.04166   11.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



## 3.2. Partially linear IV regression model (PLIV)#

Partially linear IV regression (PLIV) models take the form

\begin{align}\begin{aligned}Y - D \theta_0 = g_0(X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align}

where $$Y$$ is the outcome variable, $$D$$ is the policy variable of interest and $$Z$$ denotes one or multiple instrumental variables. The high-dimensional vector $$X = (X_1, \ldots, X_p)$$ consists of other confounding covariates, and $$\zeta$$ and $$V$$ are stochastic errors.

DoubleMLPLIV implements PLIV models. Estimation is conducted via its fit() method:

In : import numpy as np

In : import doubleml as dml

In : from doubleml.datasets import make_pliv_CHS2015

In : from sklearn.ensemble import RandomForestRegressor

In : from sklearn.base import clone

In : learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In : ml_l = clone(learner)

In : ml_m = clone(learner)

In : ml_r = clone(learner)

In : np.random.seed(2222)

In : data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type='DataFrame')

In : obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='Z1')

In : dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)

In : print(dml_pliv_obj.fit())
================== DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): ['Z1']
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_r: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
coef   std err         t         P>|t|     2.5 %    97.5 %
d  0.470604  0.091988  5.115925  3.122075e-07  0.290311  0.650897

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone() ml_m = learner$clone()
ml_r = learner$clone() set.seed(2222) data = make_pliv_CHS2015(alpha=0.5, n_obs=500, dim_x=20, dim_z=1, return_type="data.table") obj_dml_data = DoubleMLData$new(data, y_col="y", d_col = "d", z_cols= "Z1")
dml_pliv_obj = DoubleMLPLIV$new(obj_dml_data, ml_l, ml_m, ml_r) dml_pliv_obj$fit()
print(dml_pliv_obj)

================= DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): Z1
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger
ml_r: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d   0.66184    0.07786     8.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



## 3.3. Interactive regression model (IRM)#

Interactive regression (IRM) models take the form

\begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align}

where the treatment variable is binary, $$D \in \lbrace 0,1 \rbrace$$. We consider estimation of the average treatment effects when treatment effects are fully heterogeneous. Target parameters of interest in this model are the average treatment effect (ATE),

$\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X)]$

and the average treatment effect of the treated (ATTE),

$\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X) | D=1].$

DoubleMLIRM implements IRM models. Estimation is conducted via its fit() method:

In : import numpy as np

In : import doubleml as dml

In : from doubleml.datasets import make_irm_data

In : from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In : ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In : ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In : np.random.seed(3333)

In : data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

In : obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In : dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

In : print(dml_irm_obj.fit())
================== DoubleMLIRM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2)

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
coef   std err         t     P>|t|    2.5 %    97.5 %
d  0.660971  0.188626  3.504133  0.000458  0.29127  1.030671

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)

set.seed(3333)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=20, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m)
dml_irm_objfit() print(dml_irm_obj)  ================= DoubleMLIRM Object ================== ------------------ Data summary ------------------ Outcome variable: y Treatment variable(s): d Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20 Instrument(s): No. Observations: 500 ------------------ Score & algorithm ------------------ Score function: ATE DML algorithm: dml2 ------------------ Machine learner ------------------ ml_g: regr.ranger ml_m: classif.ranger ------------------ Resampling ------------------ No. folds: 5 No. repeated sample splits: 1 Apply cross-fitting: TRUE ------------------ Fit summary ------------------ Estimates and significance testing of the effect of target variables Estimate. Std. Error t value Pr(>|t|) d 0.6695 0.2097 3.192 0.00141 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  ## 3.4. Interactive IV model (IIVM)# Interactive IV regression (IIVM) models take the form \begin{align}\begin{aligned}Y = \ell_0(D, X) + \zeta, & &\mathbb{E}(\zeta | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0,\end{aligned}\end{align} where the treatment variable is binary, $$D \in \lbrace 0,1 \rbrace$$ and the instrument is binary, $$Z \in \lbrace 0,1 \rbrace$$. Consider the functions $$g_0$$, $$r_0$$ and $$m_0$$, where $$g_0$$ maps the support of $$(Z,X)$$ to $$\mathbb{R}$$ and $$r_0$$ and $$m_0$$ respectively map the support of $$(Z,X)$$ and $$X$$ to $$(\varepsilon, 1-\varepsilon)$$ for some $$\varepsilon \in (0, 1/2)$$, such that \begin{align}\begin{aligned}Y = g_0(Z, X) + \nu, & &\mathbb{E}(\nu | Z, X) = 0,\\D = r_0(Z, X) + U, & &\mathbb{E}(U | Z, X) = 0,\\Z = m_0(X) + V, & &\mathbb{E}(V | X) = 0.\end{aligned}\end{align} The target parameter of interest in this model is the local average treatment effect (LATE), $\theta_0 = \frac{\mathbb{E}[g_0(1, X)] - \mathbb{E}[g_0(0,X)]}{\mathbb{E}[r_0(1, X)] - \mathbb{E}[r_0(0,X)]}.$ Causal diagram# DoubleMLIIVM implements IIVM models. Estimation is conducted via its fit() method: In : import numpy as np In : import doubleml as dml In : from doubleml.datasets import make_iivm_data In : from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier In : ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) In : ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) In : ml_r = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2) In : np.random.seed(4444) In : data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1.0, return_type='DataFrame') In : obj_dml_data = dml.DoubleMLData(data, 'y', 'd', z_cols='z') In : dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r) In : print(dml_iivm_obj.fit()) ================== DoubleMLIIVM Object ================== ------------------ Data summary ------------------ Outcome variable: y Treatment variable(s): ['d'] Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20'] Instrument variable(s): ['z'] No. Observations: 1000 ------------------ Score & algorithm ------------------ Score function: LATE DML algorithm: dml2 ------------------ Machine learner ------------------ Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2) Learner ml_m: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2) Learner ml_r: RandomForestClassifier(max_depth=5, max_features=20, min_samples_leaf=2) ------------------ Resampling ------------------ No. folds: 5 No. repeated sample splits: 1 Apply cross-fitting: True ------------------ Fit summary ------------------ coef std err t P>|t| 2.5 % 97.5 % d 0.48992 0.222138 2.205476 0.027421 0.054537 0.925302  library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) set.seed(4444) ml_g = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_r = ml_mclone()
data = make_iivm_data(theta=0.5, n_obs=1000, dim_x=20, alpha_x=1, return_type="data.table")
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d", z_cols="z") dml_iivm_obj = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)
dml_iivm_obj\$fit()
print(dml_iivm_obj)

================= DoubleMLIIVM Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Instrument(s): z
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: LATE
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: classif.ranger
ml_r: classif.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d    0.3569     0.1990   1.793    0.073 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1