Getting Started

Installation

Please read the following installation instructions and make sure you installed the latest release of DoubleML on your local machine prior to our tutorial.

In case you want to learn more about DoubleML upfront, feel free to read through our user guide.

Install latest release from CRAN

Install latest release from CRAN via

install.packages("DoubleML")

Install development version from GitHub

The DoubleML package for R can be downloaded using the command (previous installation of the remotes package is required).

remotes::install_github("DoubleML/doubleml-for-r")

Load DoubleML

Load the package after completed installation.

library(DoubleML)

Install packages for learners

As described in our user guide section on learners and the corresponding chapter of the mlr3book, we have to install the packages that are required for using the ML learners. In our tutorial, we will use the R packages ranger, glmnet and xgboost.

install.packages("ranger")
install.packages("glmnet")
install.packages("xgboost")

Example

Once you installed all packages, try to run the following example. Load the DoubleML package.

library(DoubleML)

Load the Bonus data set.

df_bonus = fetch_bonus(return_type="data.table")
head(df_bonus)
##    inuidur1 female black othrace dep1 dep2 q2 q3 q4 q5 q6 agelt35 agegt54
## 1: 2.890372      0     0       0    0    1  0  0  0  1  0       0       0
## 2: 0.000000      0     0       0    0    0  0  0  0  1  0       0       0
## 3: 3.295837      0     0       0    0    0  0  0  1  0  0       0       0
## 4: 2.197225      0     0       0    0    0  0  1  0  0  0       1       0
## 5: 3.295837      0     0       0    1    0  0  0  0  1  0       0       1
## 6: 3.295837      1     0       0    0    0  0  0  0  1  0       0       1
##    durable lusd husd tg
## 1:       0    0    1  0
## 2:       0    1    0  0
## 3:       0    1    0  0
## 4:       0    0    0  1
## 5:       1    1    0  0
## 6:       0    1    0  0

Create a data backend.

# Specify the data and variables for the causal model
dml_data_bonus = DoubleMLData$new(df_bonus,
                             y_col = "inuidur1",
                             d_cols = "tg",
                             x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                        "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"))
print(dml_data_bonus)
## ================= DoubleMLData Object ==================
## 
## 
## ------------------ Data summary      ------------------
## Outcome variable: inuidur1
## Treatment variable(s): tg
## Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
## Instrument(s): 
## No. Observations: 5099

Create two learners for the nuisance components using mlr3 and mlr3learners.

library(mlr3)
library(mlr3learners)
# surpress messages from mlr3 package during fitting
lgr::get_logger("mlr3")$set_threshold("warn")

learner = lrn("regr.ranger", num.trees=500, max.depth=5, min.node.size=2)
ml_l_bonus = learner$clone()
ml_m_bonus = learner$clone()

Create a new instance of a causal model, here a partially linear regression model via DoubleMLPLR.

set.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_l=ml_l_bonus, ml_m=ml_m_bonus)
obj_dml_plr_bonus$fit()
print(obj_dml_plr_bonus)
## ================= DoubleMLPLR Object ==================
## 
## 
## ------------------ Data summary      ------------------
## Outcome variable: inuidur1
## Treatment variable(s): tg
## Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
## Instrument(s): 
## No. Observations: 5099
## 
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
## 
## ------------------ Machine learner   ------------------
## ml_l: regr.ranger
## ml_m: regr.ranger
## 
## ------------------ Resampling        ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: TRUE
## 
## ------------------ Fit summary       ------------------
##  Estimates and significance testing of the effect of target variables
##    Estimate. Std. Error t value Pr(>|t|)  
## tg  -0.07561    0.03536  -2.139   0.0325 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ready to go :-)

Once you are able to run this code, you are ready for our tutorial!