{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e401eb2c",
   "metadata": {},
   "source": [
    "# R: Basic Instrumental Variables Calculation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we show how to use the DoubleML functionality of Instrumental Variables (IVs) in the basic setting shown in the graph below, where:\n",
    "\n",
    "- Z is the instrument\n",
    "- C is a vector of unobserved confounders\n",
    "- D is the decision or treatment variable\n",
    "- Y is the outcome\n",
    "\n",
    "So, we will first generate synthetic data using linear models compatible with the diagram, and then use the DoubleML package to estimate the causal effect from D to Y. \n",
    "\n",
    "We assume that you have basic knowledge of instrumental variables and linear regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "library(DoubleML)\n",
    "library(mlr3learners)\n",
    "\n",
    "set.seed(1234)\n",
    "options(warn=-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instrumental Variables Directed Acyclic Graph (IV - DAG)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3250ef4",
   "metadata": {},
   "source": [
    "![basic_iv_example_nb.png](../_static/basic_iv_example_nb.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Simulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code generates `n` samples in which there is a unique binary confounder. The treatment is also a binary variable, while the outcome is a continuous linear model. \n",
    "\n",
    "The quantity we want to recover using IVs is the `decision_impact`, which is the impact of the decision variable into the outcome. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f8b1555",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "n <- 10000\n",
    "decision_effect <- -2\n",
    "instrument_effect <- 0.7\n",
    "\n",
    "confounder <- rbinom(n, 1, 0.3)\n",
    "instrument <- rbinom(n, 1, 0.5)\n",
    "decision <- as.numeric(runif(n) <= instrument_effect*instrument + 0.4*confounder)\n",
    "outcome <- 30 + decision_effect*decision + 10 * confounder + rnorm(n, sd=2)\n",
    "df <- data.frame(instrument, decision, outcome)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Naive estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that if we make a direct estimation of the impact of the `decision` into the `outcome`, though the difference of the averages of outcomes between the two decision groups, we obtain a biased estimate. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d00221a",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "mean(df[df$decision==1, 'outcome']) - mean(df[df$decision==0, 'outcome'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using DoubleML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DoubleML assumes that there is at least one observed confounder. For this reason, we create a fake variable that doesn't bring any kind of information to the model, called `obs_confounder`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use the DoubleML we need to specify the Machine Learning methods we want to use to estimate the different relationships between variables:\n",
    "\n",
    "- `ml_g` models the functional relationship betwen the `outcome` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LinearRegression` because the outcome is continuous. \n",
    "- `ml_m` models the functional relationship betwen the `obs_confounders` and the `instrument`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n",
    "- `ml_r` models the functional relationship betwen the `decision` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n",
    "\n",
    "\n",
    "Notice that instead of using linear and logistic regression, we could use more flexible models capable of dealing with non-linearities such as random forests, boosting, ... "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "600b8196",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "df['obs_confounders'] <- 1\n",
    "\n",
    "obj_dml_data = DoubleMLData$new(\n",
    "  df, y_col=\"outcome\", d_col = \"decision\", \n",
    "  z_cols= \"instrument\", x_cols = \"obs_confounders\"\n",
    ")\n",
    "\n",
    "ml_g = lrn(\"regr.lm\")\n",
    "ml_m = lrn(\"classif.log_reg\")\n",
    "ml_r = ml_m$clone()\n",
    "\n",
    "iv_2 = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)\n",
    "result <- iv_2$fit()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the causal effect is estimated without bias."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc9390cd",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "Ruiz de Villa, A. Causal Inference for Data Science, Manning Publications, 2024."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}