{ "cells": [ { "cell_type": "markdown", "id": "e401eb2c", "metadata": {}, "source": [ "# R: Basic Instrumental Variables Calculation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we show how to use the DoubleML functionality of Instrumental Variables (IVs) in the basic setting shown in the graph below, where:\n", "\n", "- Z is the instrument\n", "- C is a vector of unobserved confounders\n", "- D is the decision or treatment variable\n", "- Y is the outcome\n", "\n", "So, we will first generate synthetic data using linear models compatible with the diagram, and then use the DoubleML package to estimate the causal effect from D to Y. \n", "\n", "We assume that you have basic knowledge of instrumental variables and linear regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "library(DoubleML)\n", "library(mlr3learners)\n", "\n", "set.seed(1234)\n", "options(warn=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instrumental Variables Directed Acyclic Graph (IV - DAG)" ] }, { "cell_type": "markdown", "id": "a3250ef4", "metadata": {}, "source": [ "![basic_iv_example_nb.png](../_static/basic_iv_example_nb.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Simulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code generates `n` samples in which there is a unique binary confounder. The treatment is also a binary variable, while the outcome is a continuous linear model. \n", "\n", "The quantity we want to recover using IVs is the `decision_impact`, which is the impact of the decision variable into the outcome. " ] }, { "cell_type": "code", "execution_count": null, "id": "5f8b1555", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "n <- 10000\n", "decision_effect <- -2\n", "instrument_effect <- 0.7\n", "\n", "confounder <- rbinom(n, 1, 0.3)\n", "instrument <- rbinom(n, 1, 0.5)\n", "decision <- as.numeric(runif(n) <= instrument_effect*instrument + 0.4*confounder)\n", "outcome <- 30 + decision_effect*decision + 10 * confounder + rnorm(n, sd=2)\n", "df <- data.frame(instrument, decision, outcome)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naive estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that if we make a direct estimation of the impact of the `decision` into the `outcome`, though the difference of the averages of outcomes between the two decision groups, we obtain a biased estimate. " ] }, { "cell_type": "code", "execution_count": null, "id": "2d00221a", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "mean(df[df$decision==1, 'outcome']) - mean(df[df$decision==0, 'outcome'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using DoubleML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DoubleML assumes that there is at least one observed confounder. For this reason, we create a fake variable that doesn't bring any kind of information to the model, called `obs_confounder`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use the DoubleML we need to specify the Machine Learning methods we want to use to estimate the different relationships between variables:\n", "\n", "- `ml_g` models the functional relationship betwen the `outcome` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LinearRegression` because the outcome is continuous. \n", "- `ml_m` models the functional relationship betwen the `obs_confounders` and the `instrument`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n", "- `ml_r` models the functional relationship betwen the `decision` and the pair `instrument` and observed confounders `obs_confounders`. In this case we choose a `LogisticRegression` because the outcome is dichotomic.\n", "\n", "\n", "Notice that instead of using linear and logistic regression, we could use more flexible models capable of dealing with non-linearities such as random forests, boosting, ... " ] }, { "cell_type": "code", "execution_count": null, "id": "600b8196", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "df['obs_confounders'] <- 1\n", "\n", "obj_dml_data = DoubleMLData$new(\n", " df, y_col=\"outcome\", d_col = \"decision\", \n", " z_cols= \"instrument\", x_cols = \"obs_confounders\"\n", ")\n", "\n", "ml_g = lrn(\"regr.lm\")\n", "ml_m = lrn(\"classif.log_reg\")\n", "ml_r = ml_m$clone()\n", "\n", "iv_2 = DoubleMLIIVM$new(obj_dml_data, ml_g, ml_m, ml_r)\n", "result <- iv_2$fit()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the causal effect is estimated without bias." ] }, { "cell_type": "markdown", "id": "bc9390cd", "metadata": {}, "source": [ "## References\n", "\n", "Ruiz de Villa, A. Causal Inference for Data Science, Manning Publications, 2024." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.2.2" } }, "nbformat": 4, "nbformat_minor": 5 }