:parenttoc: True

.. _intro:

Getting Started
===============

The purpose of the following case-studies is to demonstrate the core functionalities of
:ref:`DoubleML <doubleml_package>`.

Data
----

For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second
example we simulate data from a partially linear regression model.

.. tab-set::

    .. tab-item:: Python
        :sync: py

        .. ipython:: python

            import numpy as np
            from doubleml.datasets import fetch_bonus

            # Load bonus data
            df_bonus = fetch_bonus('DataFrame')
            print(df_bonus.head(5))

            # Simulate data
            np.random.seed(3141)
            n_obs = 500
            n_vars = 100
            theta = 3
            X = np.random.normal(size=(n_obs, n_vars))
            d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
            y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

    .. tab-item:: R
        :sync: r

        .. jupyter-execute::

            library(DoubleML)

            # Load bonus data
            df_bonus = fetch_bonus(return_type="data.table")
            head(df_bonus)

            # Simulate data
            set.seed(3141)
            n_obs = 500
            n_vars = 100
            theta = 3
            X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars)
            d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs)
            y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs)

The causal model
----------------

Exemplarily we specify a partially linear regression model (PLR). **Partially linear regression (PLR)** models take the
form

.. math::

    Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0,

    D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,

where :math:`Y` is the outcome variable and :math:`D` is the policy variable of interest.
The high-dimensional vector :math:`X = (X_1, \ldots, X_p)` consists of other confounding covariates,
and :math:`\zeta` and :math:`V` are stochastic errors.
For details about the implemented models in the :ref:`DoubleML <doubleml_package>` package we refer to the user guide
:ref:`models`.

The data-backend DoubleMLData
-----------------------------

:ref:`DoubleML <doubleml_package>` provides interfaces to dataframes as well as arrays.
Details on the data-backend and the interfaces can be found in the :ref:`user guide <data_backend>`.
The ``DoubleMLData`` class serves as data-backend and can be initialized from a dataframe by
specifying the column ``y_col='inuidur1'`` serving as outcome variable :math:`Y`, the column(s) ``d_cols = 'tg'``
serving as treatment variable :math:`D` and the columns ``x_cols`` specifying the confounders.
Alternatively an array interface can be used as shown below for the simulated data.

.. tab-set::

    .. tab-item:: Python
        :sync: py

        .. ipython:: python

            from doubleml import DoubleMLData

            # Specify the data and the variables for the causal model
            dml_data_bonus = DoubleMLData(df_bonus,
                                            y_col='inuidur1',
                                            d_cols='tg',
                                            x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
                                                    'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
                                                    'durable', 'lusd', 'husd'])
            print(dml_data_bonus)

            # array interface to DoubleMLData
            dml_data_sim = DoubleMLData.from_arrays(X, y, d)
            print(dml_data_sim)

    .. tab-item:: R
        :sync: r

        .. jupyter-execute::

            # Specify the data and variables for the causal model
            dml_data_bonus = DoubleMLData$new(df_bonus,
                                        y_col = "inuidur1",
                                        d_cols = "tg",
                                        x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                                    "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                                    "durable", "lusd", "husd"))
            print(dml_data_bonus)

            # matrix interface to DoubleMLData
            dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d)
            dml_data_sim


Learners to estimate the nuisance models
------------------------------------------------

To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to
specify learners to estimate :math:`m_0` and :math:`g_0`. For the bonus data we use a random forest
regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model.
The implementation of :ref:`DoubleML <doubleml_package>` is based on the meta-packages
`scikit-learn <https://scikit-learn.org/>`_ (Pedregosa et al., 2011) for Python
and `mlr3 <https://mlr3.mlr-org.com/>`_ (Lang et al, 2019) for R.
For details on the specification of learners and their hyperparameters we refer to the user guide :ref:`learners`.

.. tab-set::

    .. tab-item:: Python
        :sync: py

        .. ipython:: python

            from sklearn.base import clone
            from sklearn.ensemble import RandomForestRegressor
            from sklearn.linear_model import LassoCV

            learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 5)
            ml_l_bonus = clone(learner)
            ml_m_bonus = clone(learner)

            learner = LassoCV()
            ml_l_sim = clone(learner)
            ml_m_sim = clone(learner)

    .. tab-item:: R
        :sync: r

        .. jupyter-execute::

            library(mlr3)
            library(mlr3learners)
            # surpress messages from mlr3 package during fitting
            lgr::get_logger("mlr3")$set_threshold("warn")

            learner = lrn("regr.ranger", num.trees=500, max.depth=5, min.node.size=2)
            ml_l_bonus = learner$clone()
            ml_m_bonus = learner$clone()

            learner = lrn("regr.cv_glmnet", s="lambda.min")
            ml_l_sim = learner$clone()
            ml_m_sim = learner$clone()


Cross-fitting, DML algorithms and Neyman-orthogonal score functions
-------------------------------------------------------------------

When initializing the object for PLR models ``DoubleMLPLR``, we can further set parameters specifying the
resampling: The number of folds used for cross-fitting ``n_folds`` (defaults to ``n_folds = 5``) as well as the number
of repetitions when applying repeated cross-fitting ``n_rep`` (defaults to ``n_rep = 1``).
Additionally, one can choose between the algorithms ``'dml1'`` and  ``'dml2'`` via ``dml_procedure`` (defaults to
``'dml2'``).
Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions.
For the PLR model the default ``score`` is ``'partialling out'``, i.e., 

.. math::

    \psi(W; \theta, \eta) := [Y - \ell(X) - \theta (D - m(X))] [D - m(X)].


Note that with this score, we do not estimate $g_0(X)$ directly, but the conditional expectation of :math:`Y` given :math:`X`, :math:`\ell = \mathbb{E}[Y|X]`. The user guide provides details about the :ref:`resampling`, the :ref:`algorithms`
and the :ref:`scores`.

Estimate double/debiased machine learning models
------------------------------------------------

We now initialize ``DoubleMLPLR`` objects for our examples using default parameters.
The models are estimated by calling the ``fit()`` method and we can for example inspect the estimated treatment effect
using the ``summary`` property.
A more detailed result summary can be obtained via the string-representation of the object.
Besides the ``fit()`` method :ref:`DoubleML <doubleml_package>` model classes also provide functionalities to perform
statistical inference like ``bootstrap()``, ``confint()`` and ``p_adjust()``, for details see the user guide
:ref:`se_confint`.

.. tab-set::

    .. tab-item:: Python
        :sync: py

        .. ipython:: python

            from doubleml import DoubleMLPLR
            np.random.seed(3141)
            obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus)
            obj_dml_plr_bonus.fit();
            print(obj_dml_plr_bonus)

            obj_dml_plr_sim = DoubleMLPLR(dml_data_sim, ml_l_sim, ml_m_sim)
            obj_dml_plr_sim.fit();
            print(obj_dml_plr_sim)

    .. tab-item:: R
        :sync: r

        .. jupyter-execute::

            set.seed(3141)
            obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_l=ml_l_bonus, ml_m=ml_m_bonus)
            obj_dml_plr_bonus$fit()
            print(obj_dml_plr_bonus)

            obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_l=ml_l_sim, ml_m=ml_m_sim)
            obj_dml_plr_sim$fit()
            print(obj_dml_plr_sim)


References
++++++++++

* Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019), mlr3: A modern object-oriented machine learing framework in R. Journal of Open Source Software, `doi:10.21105/joss.01903 <https://doi.org/10.21105/joss.01903>`_.
* Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825--2830, `https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html <https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>`_.