1.1. doubleml.DoubleMLData#

class doubleml.DoubleMLData(data, y_col, d_cols, x_cols=None, z_cols=None, t_col=None, s_col=None, use_other_treat_as_covariate=True, force_all_x_finite=True)#

Double machine learning data-backend.

DoubleMLData objects can be initialized from pandas.DataFrame’s as well as numpy.ndarray’s.

  • data (pandas.DataFrame) – The data.

  • y_col (str) – The outcome variable.

  • d_cols (str or list) – The treatment variable(s).

  • x_cols (None, str or list) – The covariates. If None, all variables (columns of data) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates. Default is None.

  • z_cols (None, str or list) – The instrumental variable(s). Default is None.

  • t_col (None or str) – The time variable (only relevant/used for DiD Estimators). Default is None.

  • s_col (None or str) – The score or selection variable (only relevant/used for RDD or SSM Estimatiors). Default is None.

  • use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is True.

  • force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates x. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariates x. Default is True.


>>> from doubleml import DoubleMLData
>>> from doubleml.datasets import make_plr_CCDDHNR2018
>>> # initialization from pandas.DataFrame
>>> df = make_plr_CCDDHNR2018(return_type='DataFrame')
>>> obj_dml_data_from_df = DoubleMLData(df, 'y', 'd')
>>> # initialization from np.ndarray
>>> (x, y, d) = make_plr_CCDDHNR2018(return_type='array')
>>> obj_dml_data_from_array = DoubleMLData.from_arrays(x, y, d)


from_arrays(x, y, d[, z, t, s, ...])

Initialize DoubleMLData from numpy.ndarray's.


Function that assigns the role for the treatment variables in the multiple-treatment case.



All variables available in the dataset.


Logical indicating whether the outcome variable is binary with values 0 and 1.


Series with logical(s) indicating whether the treatment variable(s) are binary with values 0 and 1.


Array of treatment variable; Dynamic! Depends on the currently set treatment variable; To get an array of all treatment variables (independent of the currently set treatment variable) call obj.data[obj.d_cols].values.


The treatment variable(s).


The data.


Indicates whether to raise an error on infinite values and / or missings in the covariates x.


The number of coefficients to be estimated.


The number of instruments.


The number of observations.


The number of treatment variables.


Array of score or selection variable.


The score or selection variable.


Array of time variable.


The time variable.


Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates.


Array of covariates; Dynamic! May depend on the currently set treatment variable; To get an array of all covariates (independent of the currently set treatment variable) call obj.data[obj.x_cols].values.


The covariates.


Array of outcome variable.


The outcome variable.


Array of instrumental variables.


The instrumental variable(s).

classmethod DoubleMLData.from_arrays(x, y, d, z=None, t=None, s=None, use_other_treat_as_covariate=True, force_all_x_finite=True)#

Initialize DoubleMLData from numpy.ndarray’s.

  • x (numpy.ndarray) – Array of covariates.

  • y (numpy.ndarray) – Array of the outcome variable.

  • d (numpy.ndarray) – Array of treatment variables.

  • z (None or numpy.ndarray) – Array of instrumental variables. Default is None.

  • t (numpy.ndarray) – Array of the time variable (only relevant/used for DiD models). Default is None.

  • s (numpy.ndarray) – Array of the score or selection variable (only relevant/used for RDD and SSM models). Default is None.

  • use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is True.

  • force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates x. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariates x. Default is True.


>>> from doubleml import DoubleMLData
>>> from doubleml.datasets import make_plr_CCDDHNR2018
>>> (x, y, d) = make_plr_CCDDHNR2018(return_type='array')
>>> obj_dml_data_from_array = DoubleMLData.from_arrays(x, y, d)

Function that assigns the role for the treatment variables in the multiple-treatment case.


treatment_var (str) – Active treatment variable that will be set to d.