# 7. Variance estimation and confidence intervals for a causal parameter of interest#

## 7.1. Variance estimation#

Under regularity conditions the estimator $$\tilde{\theta}_0$$ concentrates in a $$1/\sqrt{N}$$-neighborhood of $$\theta_0$$ and the sampling error $$\sqrt{N}(\tilde{\theta}_0 - \theta_0)$$ is approximately normal

$\sqrt{N}(\tilde{\theta}_0 - \theta_0) \leadsto N(o, \sigma^2),$

with mean zero and variance given by

\begin{align}\begin{aligned}\sigma^2 := J_0^{-2} \mathbb{E}(\psi^2(W; \theta_0, \eta_0)),\\J_0 = \mathbb{E}(\psi_a(W; \eta_0)).\end{aligned}\end{align}

Estimates of the variance are obtained by

\begin{align}\begin{aligned}\hat{\sigma}^2 &= \hat{J}_0^{-2} \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \big[\psi(W_i; \tilde{\theta}_0, \hat{\eta}_{0,k})\big]^2,\\\hat{J}_0 &= \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \psi_a(W_i; \hat{\eta}_{0,k}).\end{aligned}\end{align}

An approximate confidence interval is given by

$\big[\tilde{\theta}_0 \pm \Phi^{-1}(1 - \alpha/2) \hat{\sigma} / \sqrt{N}].$

As an example we consider a partially linear regression model (PLR) implemented in DoubleMLPLR.

In : import doubleml as dml

In : from doubleml.datasets import make_plr_CCDDHNR2018

In : from sklearn.ensemble import RandomForestRegressor

In : from sklearn.base import clone

In : np.random.seed(3141)

In : learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In : ml_l = clone(learner)

In : ml_m = clone(learner)

In : data = make_plr_CCDDHNR2018(alpha=0.5, return_type='DataFrame')

In : obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In : dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

In : dml_plr_obj.fit();

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn") learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_l = learner$clone()
ml_m = learner$clone() set.seed(3141) obj_dml_data = make_plr_CCDDHNR2018(alpha=0.5) dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_l, ml_m)
dml_plr_obj$fit()  The fit() method of DoubleMLPLR stores the estimate $$\tilde{\theta}_0$$ in its coef attribute. In : print(dml_plr_obj.coef) [0.46283393]  print(dml_plr_obj$coef)

        d
0.5443965


The asymptotic standard error $$\hat{\sigma}/\sqrt{N}$$ is stored in its se attribute.

In : print(dml_plr_obj.se)
[0.04104737]

print(dml_plr_obj$se)   d 0.04512331  Additionally, the value of the $$t$$-statistic and the corresponding p-value are provided in the attributes t_stat and pval. In : print(dml_plr_obj.t_stat) [11.27560486] In : print(dml_plr_obj.pval) [1.73184249e-29]  print(dml_plr_obj$t_stat)
print(dml_plr_obj$pval)   d 12.06464   d 1.623681e-33  Note • In Python, an overview of all these estimates, together with a 95 % confidence interval is stored in the attribute summary. • In R, a summary can be obtained by using the method summary(). The confint() method performs estimation of confidence intervals. In : print(dml_plr_obj.summary) coef std err t P>|t| 2.5 % 97.5 % d 0.462834 0.041047 11.275605 1.731842e-29 0.382383 0.543285  dml_plr_obj$summary()
dml_plr_obj\$confint()

Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d   0.54440    0.04512   12.06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


A matrix: 1 × 2 of type dbl
2.5 %97.5 %
d0.45595650.6328366

A more detailed overview of the fitted model, its specifications and the summary can be obtained via the string-representation of the object.

In : print(dml_plr_obj)
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.462834  0.041047  11.275605  1.731842e-29  0.382383  0.543285

print(dml_plr_obj)

================= DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s):
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_l: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d   0.54440    0.04512   12.06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1