1.3. doubleml.data.DoubleMLPanelData#

class doubleml.data.DoubleMLPanelData(data, y_col, d_cols, t_col, id_col, x_cols=None, z_cols=None, static_panel=False, use_other_treat_as_covariate=True, force_all_x_finite=True, datetime_unit='M')#

Double machine learning data-backend for panel data in long format.

DoubleMLPanelData objects can be initialized from pandas.DataFrame as well as numpy.ndarray objects.

Parameters:

data (pandas.DataFrame) – The data.
y_col (str) – The outcome variable.
d_cols (str or list) – The treatment variable(s) indicating the treatment groups in terms of first time of treatment exposure.
t_col (str) – The time variable indicating the time.
id_col (str) – Unique unit identifier.
x_cols (None, str or list) – The covariates. If None, all variables (columns of data) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates. Default is None.
z_cols (None, str or list) – The instrumental variable(s). Default is None.
static_panel (bool) – Indicates whether the data model corresponds to a static panel data approach (True) or to staggered adoption panel data (False). In the latter case, the treatment groups/values are defined in terms of the first time of treatment exposure. Default is False.
use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is True.
force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates x. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariates x. Default is True.
datetime_unit (str) – The unit of the time and treatment variable (if datetime type).

Examples

>>> from doubleml.did.datasets import make_did_CS2021
>>> from doubleml import DoubleMLPanelData
>>> df = make_did_CS2021(n_obs=500)
>>> dml_data = DoubleMLPanelData(
...     df,
...     y_col="y",
...     d_cols="d",
...     id_col="id",
...     t_col="t",
...     x_cols=["Z1", "Z2", "Z3", "Z4"],
...     datetime_unit="M"
... )

Methods

`from_arrays`(x, y, d, t, identifier[, z, s, ...])	Initialize `DoubleMLData` from `numpy.ndarray`'s.
`set_x_d`(treatment_var)	Function that assigns the role for the treatment variables in the multiple-treatment case.

Attributes

`all_variables`	All variables available in the dataset.
`binary_outcome`	Logical indicating whether the outcome variable is binary with values 0 and 1.
`binary_treats`	Series with logical(s) indicating whether the treatment variable(s) are binary with values 0 and 1.
`cluster_cols`	The cluster variable(s).
`cluster_vars`	Array of cluster variable(s).
`d`	Array of treatment variable; Dynamic! Depends on the currently set treatment variable; To get an array of all treatment variables (independent of the currently set treatment variable) call `obj.data[obj.d_cols].values`.
`d_cols`	The treatment variable(s).
`data`	The data.
`datetime_unit`	The unit of the time variable.
`force_all_d_finite`	Indicates whether to raise an error on infinite values and / or missings in the treatment variables `d`.
`force_all_x_finite`	Indicates whether to raise an error on infinite values and / or missings in the covariates `x`.
`g_col`	The treatment variable indicating the time of treatment exposure.
`g_values`	The unique values of the treatment variable (groups) `d`.
`id_col`	The id variable.
`id_var`	Array of id variable.
`id_var_unique`	Unique values of id variable.
`is_cluster_data`	Flag indicating whether this data object is being used for cluster data.
`n_cluster_vars`	The number of cluster variables.
`n_coefs`	The number of coefficients to be estimated.
`n_groups`	The number of groups.
`n_ids`	The number of unique values for id_col.
`n_instr`	The number of instruments.
`n_obs`	The number of observations.
`n_t_periods`	The number of time periods.
`n_treat`	The number of treatment variables.
`static_panel`	Indicates whether the data model corresponds to a static panel data approach.
`t`	Array of time variable.
`t_col`	The time variable.
`t_values`	The unique values of the time variable `t`.
`use_other_treat_as_covariate`	Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates.
`x`	Array of covariates; Dynamic! May depend on the currently set treatment variable; To get an array of all covariates (independent of the currently set treatment variable) call `obj.data[obj.x_cols].values`.
`x_cols`	The covariates.
`y`	Array of outcome variable.
`y_col`	The outcome variable.
`z`	Array of instrumental variables.
`z_cols`	The instrumental variable(s).

classmethod DoubleMLPanelData.from_arrays(x, y, d, t, identifier, z=None, s=None, use_other_treat_as_covariate=True, force_all_x_finite=True)#

Initialize DoubleMLData from numpy.ndarray’s.

Parameters:

x (numpy.ndarray) – Array of covariates.
y (numpy.ndarray) – Array of the outcome variable.
d (numpy.ndarray) – Array of treatment variables.
z (None or numpy.ndarray) – Array of instrumental variables. Default is None.
cluster_vars (None or numpy.ndarray) – Array of cluster variables. Default is None.
use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is True.
force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates x. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariates x. Default is True.
force_all_d_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the treatment variables d. Possible values are: True (neither missings np.nan, pd.NA nor infinite values np.inf are allowed), False (missings and infinite values are allowed), 'allow-nan' (only missings are allowed). Note that the choice False and 'allow-nan' are only reasonable if the model used allows for missing and / or infinite values in the treatment variables d (e.g. panel data models). Default is True.

Examples

>>> from doubleml import DoubleMLData
>>> from doubleml.plm.datasets import make_plr_CCDDHNR2018
>>> (x, y, d) = make_plr_CCDDHNR2018(return_type='array')
>>> obj_dml_data_from_array = DoubleMLData.from_arrays(x, y, d)

DoubleMLPanelData.set_x_d(treatment_var)#

Function that assigns the role for the treatment variables in the multiple-treatment case.

Parameters:: treatment_var (str) – Active treatment variable that will be set to d.