1.4. doubleml.data.DoubleMLSSMData#
- class doubleml.data.DoubleMLSSMData(data, y_col, d_cols, x_cols=None, z_cols=None, s_col=None, cluster_cols=None, use_other_treat_as_covariate=True, force_all_x_finite=True, force_all_d_finite=True)#
Double machine learning data-backend for Sample Selection Models.
DoubleMLSSMDataobjects can be initialized frompandas.DataFrame’s as well asnumpy.ndarray’s.- Parameters:
data (
pandas.DataFrame) – The data.y_col (str) – The outcome variable.
s_col (str) – The selection variable for SSM models.
x_cols (None, str or list) – The covariates. If
None, all variables (columns ofdata) which are neither specified as outcome variabley_col, nor treatment variablesd_cols, nor instrumental variablesz_cols, nor selection variables_colare used as covariates. Default isNone.z_cols (None, str or list) – The instrumental variable(s). Default is
None.cluster_cols (None, str or list) – The cluster variable(s). Default is
None.use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is
True.force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates
x. Possible values are:True(neither missingsnp.nan,pd.NAnor infinite valuesnp.infare allowed),False(missings and infinite values are allowed),'allow-nan'(only missings are allowed). Note that the choiceFalseand'allow-nan'are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariatesx. Default isTrue.force_all_d_finite (bool) – Indicates whether to raise an error on infinite values and / or missings in the treatment variables
d. Default isTrue.
Examples
>>> from doubleml import DoubleMLSSMData >>> from doubleml.irm.datasets import make_ssm_data >>> # initialization from pandas.DataFrame >>> df = make_ssm_data(return_type='DataFrame') >>> obj_dml_data_from_df = DoubleMLSSMData(df, 'y', 'd', s_col='s') >>> # initialization from np.ndarray >>> (x, y, d, _, s) = make_ssm_data(return_type='array') >>> obj_dml_data_from_array = DoubleMLSSMData.from_arrays(x, y, d, s=s)
Methods
from_arrays(x, y, d[, z, s, cluster_vars, ...])Initialize
DoubleMLSSMDataobject fromnumpy.ndarray's.set_x_d(treatment_var)Function that assigns the role for the treatment variables in the multiple-treatment case.
Attributes
all_variablesAll variables available in the dataset.
binary_outcomeLogical indicating whether the outcome variable is binary with values 0 and 1.
binary_treatsSeries with logical(s) indicating whether the treatment variable(s) are binary with values 0 and 1.
cluster_colsThe cluster variable(s).
cluster_varsArray of cluster variable(s).
dArray of treatment variable; Dynamic! Depends on the currently set treatment variable; To get an array of all treatment variables (independent of the currently set treatment variable) call
obj.data[obj.d_cols].values.d_colsThe treatment variable(s).
dataThe data.
force_all_d_finiteIndicates whether to raise an error on infinite values and / or missings in the treatment variables
d.force_all_x_finiteIndicates whether to raise an error on infinite values and / or missings in the covariates
x.is_cluster_dataFlag indicating whether this data object is being used for cluster data.
n_cluster_varsThe number of cluster variables.
n_coefsThe number of coefficients to be estimated.
n_instrThe number of instruments.
n_obsThe number of observations.
n_treatThe number of treatment variables.
sArray of score or selection variable.
s_colThe selection variable.
use_other_treat_as_covariateIndicates whether in the multiple-treatment case the other treatment variables should be added as covariates.
xArray of covariates; Dynamic! May depend on the currently set treatment variable; To get an array of all covariates (independent of the currently set treatment variable) call
obj.data[obj.x_cols].values.x_colsThe covariates.
yArray of outcome variable.
y_colThe outcome variable.
zArray of instrumental variables.
z_colsThe instrumental variable(s).
- classmethod DoubleMLSSMData.from_arrays(x, y, d, z=None, s=None, cluster_vars=None, use_other_treat_as_covariate=True, force_all_x_finite=True, force_all_d_finite=True)#
Initialize
DoubleMLSSMDataobject fromnumpy.ndarray’s.- Parameters:
x (
numpy.ndarray) – Array of covariates.y (
numpy.ndarray) – Array of the outcome variable.d (
numpy.ndarray) – Array of treatment variables.s (
numpy.ndarray) – Array of the selection variable for SSM models.z (None or
numpy.ndarray) – Array of instrumental variables. Default isNone.cluster_vars (None or
numpy.ndarray) – Array of cluster variables. Default isNone.use_other_treat_as_covariate (bool) – Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates. Default is
True.force_all_x_finite (bool or str) – Indicates whether to raise an error on infinite values and / or missings in the covariates
x. Possible values are:True(neither missingsnp.nan,pd.NAnor infinite valuesnp.infare allowed),False(missings and infinite values are allowed),'allow-nan'(only missings are allowed). Note that the choiceFalseand'allow-nan'are only reasonable if the machine learning methods used for the nuisance functions are capable to provide valid predictions with missings and / or infinite values in the covariatesx. Default isTrue.force_all_d_finite (bool) – Indicates whether to raise an error on infinite values and / or missings in the treatment variables
d. Default isTrue.
Examples
>>> from doubleml import DoubleMLSSMData >>> from doubleml.irm.datasets import make_ssm_data >>> (x, y, d, _, s) = make_ssm_data(return_type='array') >>> obj_dml_data_from_array = DoubleMLSSMData.from_arrays(x, y, d, s=s)