{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python: Basic Instrumental Variables calculation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we show how to use the DoubleML functionality of Instrumental Variables (IVs) in the basic setting shown in the graph below, where:\n",
    "\n",
    "- Z is the instrument\n",
    "- C is a vector of unobserved confounders\n",
    "- D is the decision or treatment variable\n",
    "- Y is the outcome\n",
    "\n",
    "So, we will first generate synthetic data using linear models compatible with the diagram, and then use the DoubleML package to estimate the causal effect from D to Y. \n",
    "\n",
    "We assume that you have basic knowledge of instrumental variables and linear regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.random import seed, normal, binomial, uniform\n",
    "from pandas import DataFrame\n",
    "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
    "import doubleml as dml\n",
    "\n",
    "seed(1234)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instrumental Variables Directed Acyclic Graph (IV - DAG)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14d56698",
   "metadata": {},
   "source": [
    "![basic_iv_example_nb.png](../_static/basic_iv_example_nb.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Simulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code generates `n` samples in which there is a unique binary confounder. The treatment is also a binary variable, while the outcome is a continuous linear model. \n",
    "\n",
    "The quantity we want to recover using IVs is the `decision_impact`, which is the impact of the decision variable into the outcome. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f8b1555",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 1000\n",
    "instrument_impact = 0.7\n",
    "decision_impact = - 2\n",
    "\n",
    "confounder = binomial(1, 0.3, n)\n",
    "instrument = binomial(1, 0.5, n)\n",
    "decision = (uniform(0, 1, n) <= instrument_impact*instrument + 0.4*confounder).astype(int)\n",
    "outcome = 30 + decision_impact*decision + 10 * confounder + normal(0, 2, n)\n",
    "\n",
    "df = DataFrame({\n",
    "    'instrument': instrument,\n",
    "    'decision': decision,\n",
    "    'outcome': outcome\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Naive estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that if we make a direct estimation of the impact of the `decision` into the `outcome`, though the difference of the averages of outcomes between the two decision groups, we obtain a biased estimate. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d00221a",
   "metadata": {},
   "outputs": [],
   "source": [
    "outcome_1 = df[df.decision==1].outcome.mean()\n",
    "outcome_0 = df[df.decision==0].outcome.mean()\n",
    "print(outcome_1 - outcome_0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using DoubleML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DoubleML assumes that there is at least one observed confounder. For this reason, we create a fake variable that doesn't bring any kind of information to the model, called `obs_confounder`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use the DoubleML we need to specify the Machine Learning methods we want to use to estimate the different relationships between variables:\n",
    "\n",
    "- `ml_g` models the functional relationship betwen the `outcome` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LinearRegression` because the outcome is continuous. \n",
    "- `ml_m` models the functional relationship betwen the `obs_confounders` and the `instrument`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n",
    "- `ml_r` models the functional relationship betwen the `decision` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n",
    "\n",
    "\n",
    "Notice that instead of using linear and logistic regression, we could use more flexible models capable of dealing with non-linearities such as random forests, boosting, ... "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "600b8196",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['obs_confounders'] = 1\n",
    "\n",
    "ml_g = LinearRegression()\n",
    "ml_m = LogisticRegression(penalty=None)\n",
    "ml_r = LogisticRegression(penalty=None)\n",
    "\n",
    "obj_dml_data = dml.DoubleMLData(\n",
    "    df, y_col='outcome', d_cols='decision', \n",
    "    z_cols='instrument', x_cols='obs_confounders'\n",
    ")\n",
    "dml_iivm_obj = dml.DoubleMLIIVM(obj_dml_data, ml_g, ml_m, ml_r)\n",
    "print(dml_iivm_obj.fit().summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the causal effect is estimated without bias."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c84ca8b9",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "Ruiz de Villa, A. Causal Inference for Data Science, Manning Publications, 2024."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}