Python: Sensitivity Analysis#

This notebook illustrates the sensitivity analysis tools with the partiallly linear regression model (PLR). The DoubleML package implements sensitivity analysis based on Chernozhukov et al. (2022).

Simulation Example#

For illustration purposes, we will work with generated data. This enables us to set the counfounding strength, such that we can correctly access quality of e.g. the robustness values.

[1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

import doubleml as dml
from doubleml.datasets import make_confounded_plr_data, fetch_401K

Data#

The data will be generated via make_confounded_plr_data and set the confounding values to cf_y=0.1 and cf_d=0.1.

Both parameters determine the strength of the confounding

  • cf_y measures the proportion of residual variance in the outcome explained by unobserved confounders

  • cf_d measires the porportion of residual variance of the treatment representer generated by unobserved confounders.

For further details, see Chernozhukov et al. (2022) or User guide. Further, the true treatment effect is set to theta=5.0.

[2]:
cf_y = 0.1
cf_d = 0.1
theta = 5.0
[3]:
np.random.seed(42)
dpg_dict = make_confounded_plr_data(n_obs=1000, cf_y=cf_y, cf_d=cf_d, theta=theta)
x_cols = [f'X{i + 1}' for i in np.arange(dpg_dict['x'].shape[1])]
df = pd.DataFrame(np.column_stack((dpg_dict['x'], dpg_dict['y'], dpg_dict['d'])), columns=x_cols + ['y', 'd'])
dml_data = dml.DoubleMLData(df, 'y', 'd')
print(dml_data)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4']
Instrument variable(s): None
No. Observations: 1000

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 6 entries, X1 to d
dtypes: float64(6)
memory usage: 47.0 KB

DoubleML Object#

We fit a DoubleMLPLR object and use basic random forest to flexibly estimate the nuisance elements.

[4]:
np.random.seed(42)
dml_obj = dml.DoubleMLPLR(dml_data,
                          ml_l=RandomForestRegressor(),
                          ml_m=RandomForestRegressor(),
                          n_folds=5,
                          score='partialling out')
dml_obj.fit()
print(dml_obj)
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4']
Instrument variable(s): None
No. Observations: 1000

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor()
Learner ml_m: RandomForestRegressor()
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[11.82684324]]
Learner ml_m RMSE: [[1.11552911]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
       coef   std err        t         P>|t|     2.5 %    97.5 %
d  4.130122  0.455448  9.06827  1.209219e-19  3.237461  5.022783

The effect estimate is biased due to the unobserved confounding and the corresponding confidence interval does not cover the true parameter theta=5.0

[5]:
print(dml_obj.confint())
      2.5 %    97.5 %
d  3.237461  5.022783

Sensitivity Analysis#

To perform a sensitivity analysis with the DoubleML package you can use the sensitivity_analysis() method. The sensitivity analysis is based on the strength of the confounding cf_y and cf_d (default values \(0.03\)) and the parameter rho, which measures the correlation between the difference of the long and short form of the outcome regression and the Riesz representer (the default value \(1.0\) is conservative and considers adversarial counfounding). To additionally incorporate statistical uncertainty, a significance level (default \(0.95\)) is used.

These input parameters are used to calculate upper and lower bounds (including the corresponding confidence level) on the treatment effect estimate.

Further, the sensitivity analysis includes a null hypothesis (default \(0.0\)), which is used to compute robustness values. The robustness value \(RV\) is defined as the required confounding strength (cf_y=rv and cf_d=rv), such that the lower or upper bound of the causal parameter includes the null hypothesis. The robustness value \(RVa\) defined analogous but additionally incorporates statistical uncertainty (as it is based on the confidence intervals of the bounds). For a more detailed explanation see the User Guide or Chernozhukov et al. (2022).

The results of the analysis can be printed via the sensitivity_summary property or directly accessed via the sensitivity_params property.

[6]:
dml_obj.sensitivity_analysis()
print(dml_obj.sensitivity_summary)
================== Sensitivity Analysis ==================

------------------ Scenario          ------------------
Significance Level: level=0.95
Sensitivity parameters: cf_y=0.03; cf_d=0.03, rho=1.0

------------------ Bounds with CI    ------------------
   CI lower  theta lower     theta  theta upper  CI upper
0  3.098308     3.832693  4.130122     4.427551  5.192526

------------------ Robustness Values ------------------
   H_0     RV (%)    RVa (%)
0  0.0  34.287815  29.784405
[7]:
dml_obj.sensitivity_params
[7]:
{'theta': {'lower': array([3.8326928]), 'upper': array([4.42755087])},
 'se': {'lower': array([0.44647451]), 'upper': array([0.46507214])},
 'ci': {'lower': array([3.09830758]), 'upper': array([5.19252647])},
 'rv': array([0.34287815]),
 'rva': array([0.29784405]),
 'input': {'cf_y': 0.03,
  'cf_d': 0.03,
  'rho': 1.0,
  'level': 0.95,
  'null_hypothesis': array([0.])}}

The robustness value \(RV\) means that unobserved counfounders would have to account for \(34\%\) of the residual variation (in treatment and outcome) to explain away the treatment effect.

We can also consider a contour plot to consider different values of confounding cf_y and cf_d. The contour plot and robustness values are based on the upper or lower bounds based on the null hypothesis (in this case this results in the lower bound).

Remark that both, the robustness values and the contour plot use the prespecified value of rho and the null_hypothesis.

[8]:
dml_obj.sensitivity_plot()