Note
-
Download Jupyter notebook:
https://docs.doubleml.org/stable/examples/py_double_ml_basic_iv.ipynb.
Python: Basic Instrumental Variables calculation#
In this example we show how to use the DoubleML functionality of Instrumental Variables (IVs) in the basic setting shown in the graph below, where:
Z is the instrument
C is a vector of unobserved confounders
D is the decision or treatment variable
Y is the outcome
So, we will first generate synthetic data using linear models compatible with the diagram, and then use the DoubleML package to estimate the causal effect from D to Y.
We assume that you have basic knowledge of instrumental variables and linear regression.
[1]:
from numpy.random import seed, normal, binomial, uniform
from pandas import DataFrame
from sklearn.linear_model import LinearRegression, LogisticRegression
import doubleml as dml
seed(1234)
Instrumental Variables Directed Acyclic Graph (IV - DAG)#
Data Simulation#
This code generates n
samples in which there is a unique binary confounder. The treatment is also a binary variable, while the outcome is a continuous linear model.
The quantity we want to recover using IVs is the decision_impact
, which is the impact of the decision variable into the outcome.
[2]:
n = 1000
instrument_impact = 0.7
decision_impact = - 2
confounder = binomial(1, 0.3, n)
instrument = binomial(1, 0.5, n)
decision = (uniform(0, 1, n) <= instrument_impact*instrument + 0.4*confounder).astype(int)
outcome = 30 + decision_impact*decision + 10 * confounder + normal(0, 2, n)
df = DataFrame({
'instrument': instrument,
'decision': decision,
'outcome': outcome
})
Naive estimation#
We can see that if we make a direct estimation of the impact of the decision
into the outcome
, though the difference of the averages of outcomes between the two decision groups, we obtain a biased estimate.
[3]:
outcome_1 = df[df.decision==1].outcome.mean()
outcome_0 = df[df.decision==0].outcome.mean()
print(outcome_1 - outcome_0)
1.1099472942084532
Using DoubleML#
DoubleML assumes that there is at least one observed confounder. For this reason, we create a fake variable that doesn’t bring any kind of information to the model, called obs_confounder
.
To use the DoubleML we need to specify the Machine Learning methods we want to use to estimate the different relationships between variables:
ml_g
models the functional relationship betwen theoutcome
and the pairinstrument
and observed confoundersobs_confounders
. In this case we choose aLinearRegression
because the outcome is continuous.ml_m
models the functional relationship betwen theobs_confounders
and theinstrument
. In this case we choose aLogisticRegression
because the outcome is dichotomic.ml_r
models the functional relationship betwen thedecision
and the pairinstrument
and observed confoundersobs_confounders
. In this case we choose aLogisticRegression
because the outcome is dichotomic.
Notice that instead of using linear and logistic regression, we could use more flexible models capable of dealing with non-linearities such as random forests, boosting, …
[4]:
df['obs_confounders'] = 1
ml_g = LinearRegression()
ml_m = LogisticRegression(penalty=None)
ml_r = LogisticRegression(penalty=None)
obj_dml_data = dml.DoubleMLData(
df, y_col='outcome', d_cols='decision',
z_cols='instrument', x_cols='obs_confounders'
)
dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)
print(dml_iivm_obj.fit().summary)
coef std err t P>|t| 2.5 % 97.5 %
decision -1.950545 0.487872 -3.998063 0.000064 -2.906757 -0.994332
We can see that the causal effect is estimated without bias.
References#
Ruiz de Villa, A. Causal Inference for Data Science, Manning Publications, 2024.