1.3. doubleml.data.DoubleMLPanelData#
- class doubleml.data.DoubleMLPanelData(data, y_col, d_cols, t_col, id_col, x_cols=None, z_cols=None, use_other_treat_as_covariate=True, force_all_x_finite=True, datetime_unit='M')#
Double machine learning data-backend for panel data in long format.
DoubleMLPanelData
objects can be initialized frompandas.DataFrame
as well asnumpy.ndarray
objects.- Parameters:
data (
pandas.DataFrame
) – The data.y_col (str) – The outcome variable.
d_cols (str or list) – The treatment variable(s) indicating the treatment groups in terms of first time of treatment exposure.
t_col (str) – The time variable indicating the time.
id_col (str) – Unique unit identifier.
x_cols (None, str or list) – The covariates. If
None
, all variables (columns ofdata
) which are neither specified as outcome variabley_col
, nor treatment variablesd_cols
, nor instrumental variablesz_cols
are used as covariates. Default isNone
.z_cols (None, str or list) – The instrumental variable(s). Default is
None
.use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is
True
.force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates
x
. Possible values are:True
(neither missingsnp.nan
,pd.NA
nor infinite valuesnp.inf
are allowed),False
(missings and infinite values are allowed),'allow-nan'
(only missings are allowed). Note that the choiceFalse
and'allow-nan'
are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariatesx
. Default isTrue
.datetime_unit (str) – The unit of the time and treatment variable (if datetime type).
Examples
>>> from doubleml.did.datasets import make_did_CS2021 >>> from doubleml import DoubleMLPanelData >>> df = make_did_CS2021(n_obs=500) >>> dml_data = DoubleMLPanelData( ... df, ... y_col="y", ... d_cols="d", ... id_col="id", ... t_col="t", ... x_cols=["Z1", "Z2", "Z3", "Z4"], ... datetime_unit="M" ... )
Methods
from_arrays
(x, y, d, t, identifier[, z, s, ...])Initialize
DoubleMLData
fromnumpy.ndarray
's.set_x_d
(treatment_var)Function that assigns the role for the treatment variables in the multiple-treatment case.
Attributes
all_variables
All variables available in the dataset.
binary_outcome
Logical indicating whether the outcome variable is binary with values 0 and 1.
binary_treats
Series with logical(s) indicating whether the treatment variable(s) are binary with values 0 and 1.
d
Array of treatment variable; Dynamic! Depends on the currently set treatment variable; To get an array of all treatment variables (independent of the currently set treatment variable) call
obj.data[obj.d_cols].values
.d_cols
The treatment variable(s).
data
The data.
datetime_unit
The unit of the time variable.
force_all_d_finite
Indicates whether to raise an error on infinite values and / or missings in the treatment variables
d
.force_all_x_finite
Indicates whether to raise an error on infinite values and / or missings in the covariates
x
.g_col
The treatment variable indicating the time of treatment exposure.
g_values
The unique values of the treatment variable (groups)
d
.id_col
The id variable.
id_var
Array of id variable.
id_var_unique
Unique values of id variable.
n_coefs
The number of coefficients to be estimated.
n_groups
The number of groups.
n_instr
The number of instruments.
n_obs
The number of observations.
n_t_periods
The number of time periods.
n_treat
The number of treatment variables.
s
Array of score or selection variable.
s_col
The score or selection variable.
t
Array of time variable.
t_col
The time variable.
t_values
The unique values of the time variable
t
.use_other_treat_as_covariate
Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates.
x
Array of covariates; Dynamic! May depend on the currently set treatment variable; To get an array of all covariates (independent of the currently set treatment variable) call
obj.data[obj.x_cols].values
.x_cols
The covariates.
y
Array of outcome variable.
y_col
The outcome variable.
z
Array of instrumental variables.
z_cols
The instrumental variable(s).
- classmethod DoubleMLPanelData.from_arrays(x, y, d, t, identifier, z=None, s=None, use_other_treat_as_covariate=True, force_all_x_finite=True)#
Initialize
DoubleMLData
fromnumpy.ndarray
’s.- Parameters:
x (
numpy.ndarray
) – Array of covariates.y (
numpy.ndarray
) – Array of the outcome variable.d (
numpy.ndarray
) – Array of treatment variables.z (None or
numpy.ndarray
) – Array of instrumental variables. Default isNone
.t (
numpy.ndarray
) – Array of the time variable (only relevant/used for DiD models). Default isNone
.s (
numpy.ndarray
) – Array of the score or selection variable (only relevant/used for RDD and SSM models). Default isNone
.use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is
True
.force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates
x
. Possible values are:True
(neither missingsnp.nan
,pd.NA
nor infinite valuesnp.inf
are allowed),False
(missings and infinite values are allowed),'allow-nan'
(only missings are allowed). Note that the choiceFalse
and'allow-nan'
are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariatesx
. Default isTrue
.force_all_d_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the treatment variables
d
. Possible values are:True
(neither missingsnp.nan
,pd.NA
nor infinite valuesnp.inf
are allowed),False
(missings and infinite values are allowed),'allow-nan'
(only missings are allowed). Note that the choiceFalse
and'allow-nan'
are only reasonable if the model used allows for missing and / or infinite values in the treatment variablesd
(e.g. panel data models). Default isTrue
.
Examples
>>> from doubleml import DoubleMLData >>> from doubleml.datasets import make_plr_CCDDHNR2018 >>> (x, y, d) = make_plr_CCDDHNR2018(return_type='array') >>> obj_dml_data_from_array = DoubleMLData.from_arrays(x, y, d)