doubleml.DoubleMLData#

class doubleml.DoubleMLData(data, y_col, d_cols, x_cols=None, z_cols=None, t_col=None, s_col=None, use_other_treat_as_covariate=True, force_all_x_finite=True)#

Double machine learning data-backend.

DoubleMLData objects can be initialized from pandas.DataFrame’s as well as numpy.ndarray’s.

Parameters:
  • data (pandas.DataFrame) – The data.

  • y_col (str) – The outcome variable.

  • d_cols (str or list) – The treatment variable(s).

  • x_cols (None, str or list) – The covariates. If None, all variables (columns of data) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates. Default is None.

  • z_cols (None, str or list) – The instrumental variable(s). Default is None.

  • t_col (None or str) – The time variable (only relevant/used for DiD Estimators). Default is None.

  • s_col (None or str) – The selection variable (only relevant/used for SSM Estimatiors). Default is None.

  • use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is True.

  • force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates x. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariates x. Default is True.

Examples

>>> from doubleml import DoubleMLData
>>> from doubleml.datasets import make_plr_CCDDHNR2018
>>> # initialization from pandas.DataFrame
>>> df = make_plr_CCDDHNR2018(return_type='DataFrame')
>>> obj_dml_data_from_df = DoubleMLData(df, 'y', 'd')
>>> # initialization from np.ndarray
>>> (x, y, d) = make_plr_CCDDHNR2018(return_type='array')
>>> obj_dml_data_from_array = DoubleMLData.from_arrays(x, y, d)

Methods

from_arrays(x, y, d[, z, t, s, ...])

Initialize DoubleMLData from numpy.ndarray's.

set_x_d(treatment_var)

Function that assigns the role for the treatment variables in the multiple-treatment case.

Attributes

all_variables

All variables available in the dataset.

binary_outcome

Logical indicating whether the outcome variable is binary with values 0 and 1.

binary_treats

Series with logical(s) indicating whether the treatment variable(s) are binary with values 0 and 1.

d

Array of treatment variable; Dynamic! Depends on the currently set treatment variable; To get an array of all treatment variables (independent of the currently set treatment variable) call obj.data[obj.d_cols].values.

d_cols

The treatment variable(s).

data

The data.

force_all_x_finite

Indicates whether to raise an error on infinite values and / or missings in the covariates x.

n_coefs

The number of coefficients to be estimated.

n_instr

The number of instruments.

n_obs

The number of observations.

n_treat

The number of treatment variables.

s

Array of selection variable.

s_col

The selection variable.

t

Array of time variable.

t_col

The time variable.

use_other_treat_as_covariate

Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates.

x

Array of covariates; Dynamic! May depend on the currently set treatment variable; To get an array of all covariates (independent of the currently set treatment variable) call obj.data[obj.x_cols].values.

x_cols

The covariates.

y

Array of outcome variable.

y_col

The outcome variable.

z

Array of instrumental variables.

z_cols

The instrumental variable(s).

classmethod DoubleMLData.from_arrays(x, y, d, z=None, t=None, s=None, use_other_treat_as_covariate=True, force_all_x_finite=True)#

Initialize DoubleMLData from numpy.ndarray’s.

Parameters:
  • x (numpy.ndarray) – Array of covariates.

  • y (numpy.ndarray) – Array of the outcome variable.

  • d (numpy.ndarray) – Array of treatment variables.

  • z (None or numpy.ndarray) – Array of instrumental variables. Default is None.

  • t (numpy.ndarray) – Array of the time variable (only relevant/used for DiD models). Default is None.

  • s (numpy.ndarray) – Array of the selection variable (only relevant/used for SSM models). Default is None.

  • use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is True.

  • force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates x. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariates x. Default is True.

Examples

>>> from doubleml import DoubleMLData
>>> from doubleml.datasets import make_plr_CCDDHNR2018
>>> (x, y, d) = make_plr_CCDDHNR2018(return_type='array')
>>> obj_dml_data_from_array = DoubleMLData.from_arrays(x, y, d)
DoubleMLData.set_x_d(treatment_var)#

Function that assigns the role for the treatment variables in the multiple-treatment case.

Parameters:

treatment_var (str) – Active treatment variable that will be set to d.