{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python: Basic Instrumental Variables calculation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we show how to use the DoubleML functionality of Instrumental Variables (IVs) in the basic setting shown in the graph below, where:\n", "\n", "- Z is the instrument\n", "- C is a vector of unobserved confounders\n", "- D is the decision or treatment variable\n", "- Y is the outcome\n", "\n", "So, we will first generate synthetic data using linear models compatible with the diagram, and then use the DoubleML package to estimate the causal effect from D to Y. \n", "\n", "We assume that you have basic knowledge of instrumental variables and linear regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numpy.random import seed, normal, binomial, uniform\n", "from pandas import DataFrame\n", "from sklearn.linear_model import LinearRegression, LogisticRegression\n", "import doubleml as dml\n", "\n", "seed(1234)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instrumental Variables Directed Acyclic Graph (IV - DAG)" ] }, { "cell_type": "markdown", "id": "14d56698", "metadata": {}, "source": [ "![basic_iv_example_nb.png](../_static/basic_iv_example_nb.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Simulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code generates `n` samples in which there is a unique binary confounder. The treatment is also a binary variable, while the outcome is a continuous linear model. \n", "\n", "The quantity we want to recover using IVs is the `decision_impact`, which is the impact of the decision variable into the outcome. " ] }, { "cell_type": "code", "execution_count": null, "id": "5f8b1555", "metadata": {}, "outputs": [], "source": [ "n = 1000\n", "instrument_impact = 0.7\n", "decision_impact = - 2\n", "\n", "confounder = binomial(1, 0.3, n)\n", "instrument = binomial(1, 0.5, n)\n", "decision = (uniform(0, 1, n) <= instrument_impact*instrument + 0.4*confounder).astype(int)\n", "outcome = 30 + decision_impact*decision + 10 * confounder + normal(0, 2, n)\n", "\n", "df = DataFrame({\n", " 'instrument': instrument,\n", " 'decision': decision,\n", " 'outcome': outcome\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naive estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that if we make a direct estimation of the impact of the `decision` into the `outcome`, though the difference of the averages of outcomes between the two decision groups, we obtain a biased estimate. " ] }, { "cell_type": "code", "execution_count": null, "id": "2d00221a", "metadata": {}, "outputs": [], "source": [ "outcome_1 = df[df.decision==1].outcome.mean()\n", "outcome_0 = df[df.decision==0].outcome.mean()\n", "print(outcome_1 - outcome_0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using DoubleML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DoubleML assumes that there is at least one observed confounder. For this reason, we create a fake variable that doesn't bring any kind of information to the model, called `obs_confounder`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use the DoubleML we need to specify the Machine Learning methods we want to use to estimate the different relationships between variables:\n", "\n", "- `ml_g` models the functional relationship betwen the `outcome` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LinearRegression` because the outcome is continuous. \n", "- `ml_m` models the functional relationship betwen the `obs_confounders` and the `instrument`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n", "- `ml_r` models the functional relationship betwen the `decision` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n", "\n", "\n", "Notice that instead of using linear and logistic regression, we could use more flexible models capable of dealing with non-linearities such as random forests, boosting, ... " ] }, { "cell_type": "code", "execution_count": null, "id": "600b8196", "metadata": {}, "outputs": [], "source": [ "df['obs_confounders'] = 1\n", "\n", "ml_g = LinearRegression()\n", "ml_m = LogisticRegression(penalty=None)\n", "ml_r = LogisticRegression(penalty=None)\n", "\n", "obj_dml_data = dml.DoubleMLData(\n", " df, y_col='outcome', d_cols='decision', \n", " z_cols='instrument', x_cols='obs_confounders'\n", ")\n", "dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)\n", "print(dml_iivm_obj.fit().summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the causal effect is estimated without bias." ] }, { "cell_type": "markdown", "id": "c84ca8b9", "metadata": {}, "source": [ "## References\n", "\n", "Ruiz de Villa, A. Causal Inference for Data Science, Manning Publications, 2024." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }