7. Variance estimation and confidence intervals for a causal parameter of interest

7.1. Variance estimation

Under regularity conditions the estimator \(\tilde{\theta}_0\) concentrates in a \(1/\sqrt{N}\)-neighborhood of \(\theta_0\) and the sampling error \(\sqrt{N}(\tilde{\theta}_0 - \theta_0)\) is approximately normal

\[\sqrt{N}(\tilde{\theta}_0 - \theta_0) \leadsto N(o, \sigma^2),\]

with mean zero and variance given by

\[ \begin{align}\begin{aligned}\sigma^2 := J_0^{-2} \mathbb{E}(\psi^2(W; \theta_0, \eta_0)),\\J_0 = \mathbb{E}(\psi_a(W; \eta_0)).\end{aligned}\end{align} \]

Estimates of the variance are obtained by

\[ \begin{align}\begin{aligned}\hat{\sigma}^2 &= \hat{J}_0^{-2} \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \big[\psi(W_i; \tilde{\theta}_0, \hat{\eta}_{0,k})\big]^2,\\\hat{J}_0 &= \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \psi_a(W_i; \hat{\eta}_{0,k}).\end{aligned}\end{align} \]

An approximate confidence interval is given by

\[\big[\tilde{\theta}_0 \pm \Phi^{-1}(1 - \alpha/2) \hat{\sigma} / \sqrt{N}].\]

As an example we consider a partially linear regression model (PLR) implemented in DoubleMLPLR.

In [1]: import doubleml as dml

In [2]: from doubleml.datasets import make_plr_CCDDHNR2018

In [3]: from sklearn.ensemble import RandomForestRegressor

In [4]: from sklearn.base import clone

In [5]: np.random.seed(3141)

In [6]: learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

In [7]: ml_g = clone(learner)

In [8]: ml_m = clone(learner)

In [9]: data = make_plr_CCDDHNR2018(alpha=0.5, return_type='DataFrame')

In [10]: obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

In [11]: dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_g, ml_m)

In [12]: dml_plr_obj.fit();
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
lgr::get_logger("mlr3")$set_threshold("warn")

learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_g = learner$clone()
ml_m = learner$clone()

set.seed(3141)
obj_dml_data = make_plr_CCDDHNR2018(alpha=0.5)
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m)
dml_plr_obj$fit()

The fit() method of DoubleMLPLR stores the estimate \(\tilde{\theta}_0\) in its coef attribute.

In [13]: print(dml_plr_obj.coef)
[0.46280962]
print(dml_plr_obj$coef)
        d 
0.5443965 

The asymptotic standard error \(\hat{\sigma}/\sqrt{N}\) is stored in its se attribute.

In [14]: print(dml_plr_obj.se)
[0.04105558]
print(dml_plr_obj$se)
         d 
0.04512331 

Additionally, the value of the \(t\)-statistic and the corresponding p-value are provided in the attributes t_stat and pval.

In [15]: print(dml_plr_obj.t_stat)
[11.27275855]

In [16]: print(dml_plr_obj.pval)
[1.78876317e-29]
print(dml_plr_obj$t_stat)
print(dml_plr_obj$pval)
       d 
12.06464 
           d 
1.623681e-33 

Note

  • In Python, an overview of all these estimates, together with a 95 % confidence interval is stored in the attribute summary.

  • In R, a summary can be obtained by using the method summary(). The confint() method performs estimation of confidence intervals.

In [17]: print(dml_plr_obj.summary)
      coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.46281  0.041056  11.272759  1.788763e-29  0.382342  0.543277
dml_plr_obj$summary()
dml_plr_obj$confint()
Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.54440    0.04512   12.06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


A matrix: 1 × 2 of type dbl
2.5 %97.5 %
d0.45595650.6328366

A more detailed overview of the fitted model, its specifications and the summary can be obtained via the string-representation of the object.

In [18]: print(dml_plr_obj)
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']
Instrument variable(s): None
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
      coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.46281  0.041056  11.272759  1.788763e-29  0.382342  0.543277
print(dml_plr_obj)
================= DoubleMLPLR Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
Instrument(s): 
No. Observations: 500

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
ml_g: regr.ranger
ml_m: regr.ranger

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE

------------------ Fit summary       ------------------
 Estimates and significance testing of the effect of target variables
  Estimate. Std. Error t value Pr(>|t|)    
d   0.54440    0.04512   12.06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1