7. Variance estimation and confidence intervals for a causal parameter of interest#
7.1. Variance estimation#
Under regularity conditions the estimator \(\tilde{\theta}_0\) concentrates in a \(1/\sqrt{N}\)-neighborhood of \(\theta_0\) and the sampling error \(\sqrt{N}(\tilde{\theta}_0 - \theta_0)\) is approximately normal
with mean zero and variance given by
Estimates of the variance are obtained by
An approximate confidence interval is given by
As an example we consider a partially linear regression model (PLR)
implemented in DoubleMLPLR
.
In [1]: import doubleml as dml
In [2]: from doubleml.datasets import make_plr_CCDDHNR2018
In [3]: from sklearn.ensemble import RandomForestRegressor
In [4]: from sklearn.base import clone
In [5]: np.random.seed(3141)
In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
In [7]: ml_l = clone(learner)
In [8]: ml_m = clone(learner)
In [9]: data = make_plr_CCDDHNR2018(alpha=0.5, return_type='DataFrame')
In [10]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
In [11]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)
In [12]: dml_plr_obj.fit();
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn")
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_l = learner$clone()
ml_m = learner$clone()
set.seed(3141)
obj_dml_data = make_plr_CCDDHNR2018(alpha=0.5)
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m)
dml_plr_obj$fit()
The fit()
method of DoubleMLPLR
stores the estimate \(\tilde{\theta}_0\) in its coef
attribute.
In [13]: print(dml_plr_obj.coef)
[0.46283393]
print(dml_plr_obj$coef)
d
0.5443965
The asymptotic standard error \(\hat{\sigma}/\sqrt{N}\) is stored in its se
attribute.
In [14]: print(dml_plr_obj.se)
[0.04104737]
print(dml_plr_obj$se)
d
0.04512331
Additionally, the value of the \(t\)-statistic and the corresponding p-value are provided in the attributes
t_stat
and pval
.
In [15]: print(dml_plr_obj.t_stat)
[11.27560486]
In [16]: print(dml_plr_obj.pval)
[1.73184249e-29]
print(dml_plr_obj$t_stat)
print(dml_plr_obj$pval)
d
12.06464
d
1.623681e-33
Note
In Python, an overview of all these estimates, together with a 95 % confidence interval is stored in the attribute
summary
.In R, a summary can be obtained by using the method
summary()
. Theconfint()
method performs estimation of confidence intervals.
In [17]: print(dml_plr_obj.summary)
coef std err t P>|t| 2.5 % 97.5 %
d 0.462834 0.041047 11.275605 1.731842e-29 0.382383 0.543285
dml_plr_obj$summary()
dml_plr_obj$confint()
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.54440 0.04512 12.06 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2.5 % | 97.5 % | |
---|---|---|
d | 0.4559565 | 0.6328366 |
A more detailed overview of the fitted model, its specifications and the summary can be obtained via the string-representation of the object.
In [18]: print(dml_plr_obj)
================== DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True
------------------ Fit summary ------------------
coef std err t P>|t| 2.5 % 97.5 %
d 0.462834 0.041047 11.275605 1.731842e-29 0.382383 0.543285
print(dml_plr_obj)
================= DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
ml_l: regr.ranger
ml_m: regr.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 0.54440 0.04512 12.06 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1