:parenttoc: True .. _intro: Getting started =============== The purpose of the following case-studies is to demonstrate the core functionalities of :ref:`DoubleML `. Data ---- For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model. .. tab-set:: .. tab-item:: Python :sync: py .. ipython:: python import numpy as np from doubleml.datasets import fetch_bonus # Load bonus data df_bonus = fetch_bonus('DataFrame') print(df_bonus.head(5)) # Simulate data np.random.seed(3141) n_obs = 500 n_vars = 100 theta = 3 X = np.random.normal(size=(n_obs, n_vars)) d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,)) y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,)) .. tab-item:: R :sync: r .. jupyter-execute:: library(DoubleML) # Load bonus data df_bonus = fetch_bonus(return_type="data.table") head(df_bonus) # Simulate data set.seed(3141) n_obs = 500 n_vars = 100 theta = 3 X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars) d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs) y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs) The causal model ---------------- Exemplarily we specify a partially linear regression model (PLR). **Partially linear regression (PLR)** models take the form .. math:: Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0, D = m_0(X) + V, & &\mathbb{E}(V | X) = 0, where :math:`Y` is the outcome variable and :math:`D` is the policy variable of interest. The high-dimensional vector :math:`X = (X_1, \ldots, X_p)` consists of other confounding covariates, and :math:`\zeta` and :math:`V` are stochastic errors. For details about the implemented models in the :ref:`DoubleML ` package we refer to the user guide :ref:`models`. The data-backend DoubleMLData ----------------------------- :ref:`DoubleML ` provides interfaces to dataframes as well as arrays. Details on the data-backend and the interfaces can be found in the :ref:`user guide `. The ``DoubleMLData`` class serves as data-backend and can be initialized from a dataframe by specifying the column ``y_col='inuidur1'`` serving as outcome variable :math:`Y`, the column(s) ``d_cols = 'tg'`` serving as treatment variable :math:`D` and the columns ``x_cols`` specifying the confounders. Alternatively an array interface can be used as shown below for the simulated data. .. tab-set:: .. tab-item:: Python :sync: py .. ipython:: python from doubleml import DoubleMLData # Specify the data and the variables for the causal model dml_data_bonus = DoubleMLData(df_bonus, y_col='inuidur1', d_cols='tg', x_cols=['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']) print(dml_data_bonus) # array interface to DoubleMLData dml_data_sim = DoubleMLData.from_arrays(X, y, d) print(dml_data_sim) .. tab-item:: R :sync: r .. jupyter-execute:: # Specify the data and variables for the causal model dml_data_bonus = DoubleMLData$new(df_bonus, y_col = "inuidur1", d_cols = "tg", x_cols = c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")) print(dml_data_bonus) # matrix interface to DoubleMLData dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d) dml_data_sim Learners to estimate the nuisance models ------------------------------------------------ To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify learners to estimate :math:`m_0` and :math:`g_0`. For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of :ref:`DoubleML ` is based on the meta-packages `scikit-learn `_ (Pedregosa et al., 2011) for Python and `mlr3 `_ (Lang et al, 2019) for R. For details on the specification of learners and their hyperparameters we refer to the user guide :ref:`learners`. .. tab-set:: .. tab-item:: Python :sync: py .. ipython:: python from sklearn.base import clone from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import LassoCV learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 5) ml_l_bonus = clone(learner) ml_m_bonus = clone(learner) learner = LassoCV() ml_l_sim = clone(learner) ml_m_sim = clone(learner) .. tab-item:: R :sync: r .. jupyter-execute:: library(mlr3) library(mlr3learners) # surpress messages from mlr3 package during fitting lgr::get_logger("mlr3")$set_threshold("warn") learner = lrn("regr.ranger", num.trees=500, max.depth=5, min.node.size=2) ml_l_bonus = learner$clone() ml_m_bonus = learner$clone() learner = lrn("regr.cv_glmnet", s="lambda.min") ml_l_sim = learner$clone() ml_m_sim = learner$clone() Cross-fitting, DML algorithms and Neyman-orthogonal score functions ------------------------------------------------------------------- When initializing the object for PLR models ``DoubleMLPLR``, we can further set parameters specifying the resampling: The number of folds used for cross-fitting ``n_folds`` (defaults to ``n_folds = 5``) as well as the number of repetitions when applying repeated cross-fitting ``n_rep`` (defaults to ``n_rep = 1``). Additionally, one can choose between the algorithms ``'dml1'`` and ``'dml2'`` via ``dml_procedure`` (defaults to ``'dml2'``). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. For the PLR model the default ``score`` is ``'partialling out'``, i.e., .. math:: \psi(W; \theta, \eta) := [Y - \ell(X) - \theta (D - m(X))] [D - m(X)]. Note that with this score, we do not estimate $g_0(X)$ directly, but the conditional expectation of :math:`Y` given :math:`X`, :math:`\ell = \mathbb{E}[Y|X]`. The user guide provides details about the :ref:`resampling`, the :ref:`algorithms` and the :ref:`scores`. Estimate double/debiased machine learning models ------------------------------------------------ We now initialize ``DoubleMLPLR`` objects for our examples using default parameters. The models are estimated by calling the ``fit()`` method and we can for example inspect the estimated treatment effect using the ``summary`` property. A more detailed result summary can be obtained via the string-representation of the object. Besides the ``fit()`` method :ref:`DoubleML ` model classes also provide functionalities to perform statistical inference like ``bootstrap()``, ``confint()`` and ``p_adjust()``, for details see the user guide :ref:`se_confint`. .. tab-set:: .. tab-item:: Python :sync: py .. ipython:: python from doubleml import DoubleMLPLR np.random.seed(3141) obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus) obj_dml_plr_bonus.fit(); print(obj_dml_plr_bonus) obj_dml_plr_sim = DoubleMLPLR(dml_data_sim, ml_l_sim, ml_m_sim) obj_dml_plr_sim.fit(); print(obj_dml_plr_sim) .. tab-item:: R :sync: r .. jupyter-execute:: set.seed(3141) obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_l=ml_l_bonus, ml_m=ml_m_bonus) obj_dml_plr_bonus$fit() print(obj_dml_plr_bonus) obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_l=ml_l_sim, ml_m=ml_m_sim) obj_dml_plr_sim$fit() print(obj_dml_plr_sim) References ++++++++++ * Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019), mlr3: A modern object-oriented machine learing framework in R. Journal of Open Source Software, `doi:10.21105/joss.01903 `_. * Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825--2830, `https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html `_.