2. Data Backend#
DoubleML provides a unified data interface via the doubleml.data module.
It supports both pandas.DataFrame objects and numpy.ndarray arrays and now allows
clustered data to be handled directly via DoubleMLData.
2.1. DoubleMLData#
The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.
Note
In Python we use
pandas.DataFrameandnumpy.ndarray. The data can be fetched viadoubleml.datasets.fetch_bonus().In R we use data.table::data.table(), data.frame(), and matrix(). The data can be fetched via DoubleML::fetch_bonus()
Important
Cluster-robust analyses no longer require the dedicated DoubleMLClusterData backend.
Use doubleml.DoubleMLData with the cluster_cols (or cluster_vars for arrays)
arguments instead. The wrapper DoubleMLClusterData remains available for backwards
compatibility but is deprecated and scheduled for removal with version 0.12.0.
In [1]: from doubleml.datasets import fetch_bonus
# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')
In [3]: df_bonus.head(5)
Out[3]:
index abdt tg inuidur1 inuidur2 ... lusd husd muld dep1 dep2
0 0 10824 0 2.890372 18 ... 0 1 0 0.0 1.0
1 3 10824 0 0.000000 1 ... 1 0 0 0.0 0.0
2 4 10747 0 3.295837 27 ... 1 0 0 0.0 0.0
3 11 10607 1 2.197225 9 ... 0 0 1 0.0 0.0
4 12 10831 0 3.295837 27 ... 1 0 0 1.0 0.0
[5 rows x 26 columns]
library(DoubleML)
# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)
# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)
| inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 5 | 3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 12 | 2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 13 | 3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 14 | 3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2.1.1. DoubleMLData from dataframes#
The DoubleMLData class serves as data-backend and can be initialized from a dataframe by
specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg'
serving as treatment variable \(D\) and the columns x_cols specifying the confounders.
Note
In Python we use
pandas.DataFrameand the API reference can be found heredoubleml.DoubleMLData.In R we use data.table::data.table() and the API reference can be found here DoubleML::DoubleMLData.
For initialization from the R base class data.frame() the API reference can be found here DoubleML::double_ml_data_from_data_frame().
In [4]: from doubleml import DoubleMLData
# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
...: y_col='inuidur1',
...: d_cols='tg',
...: x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
...: 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
...: 'durable', 'lusd', 'husd'],
...: use_other_treat_as_covariate=True)
...:
In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB
# Specify the data and the variables for the causal model
# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus
# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
Selection variable:
No. Observations: 5099
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
Selection variable:
No. Observations: 5099
Comments on detailed specifications:
If
x_colsis not specified, all variables (columns of the dataframe) which are neither specified as outcome variabley_col, nor treatment variablesd_cols, nor instrumental variablesz_colsare used as covariates.In case of multiple treatment variables, the boolean
use_other_treat_as_covariateindicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.Instrumental variables for IV models have to be provided as
z_cols.Cluster variables can directly be added via
cluster_cols; they must be distinct from all variables iny_col,d_cols,x_colsandz_cols. The object exposescluster_colsandcluster_varsproperties for convenience.The optional
force_all_d_finiteflag mirrorsforce_all_x_finiteand controls missings/infinite values in the treatment variables, which is especially relevant for panel models.
2.1.2. DoubleMLData from arrays and matrices#
To introduce the array interface we generate a data set consisting of confounding variables X, an outcome
variable y and a treatment variable d
Note
In python we use
numpy.ndarray. and the API reference can be found heredoubleml.DoubleMLData.from_arrays().In R we use the R base class matrix() and the API reference can be found here DoubleML::double_ml_data_from_matrix().
In [7]: import numpy as np
# Generate data
In [8]: np.random.seed(3141)
In [9]: n_obs = 500
In [10]: n_vars = 100
In [11]: theta = 3
In [12]: X = np.random.normal(size=(n_obs, n_vars))
In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
To specify the data and the variables for the causal model from arrays we call
In [15]: from doubleml import DoubleMLData
In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)
In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
library(DoubleML)
obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s):
Selection variable:
No. Observations: 500
In Python, cluster assignments can be supplied through the optional cluster_cols argument (or cluster_vars in the from_arrays method).
In R, one has to create a DoubleMLClusterData object instead of DoubleMLData to use clustering.
In [18]: cluster_vars = (np.arange(n_obs) // 5).reshape(-1, 1)
In [19]: obj_dml_data_sim_cluster = DoubleMLData.from_arrays(X, y, d, cluster_vars=cluster_vars)
In [20]: obj_dml_data_sim_cluster.cluster_cols
Out[20]: ['cluster_var']
In [21]: print(obj_dml_data_sim_cluster)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
Cluster variable(s): ['cluster_var']
Is cluster data: True
No. Observations: 500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 103 entries, X1 to cluster_var
dtypes: float64(102), int64(1)
memory usage: 402.5 KB
2.2. Special Data Types#
The DoubleMLData class is extended by the following classes to support special data types or allow for additional parameters.
2.2.1. DoubleMLDIDData#
The DoubleMLDIDData class tailors DoubleMLData to difference-in-differences
applications. It handles both panel settings and repeated cross-sections by tracking an optional time indicator.
2.2.1.1. Key arguments#
t_col: column containing the time variable for repeated cross-sections. It must be unique fromy_col,d_cols,x_cols,z_colsandcluster_cols.cluster_cols: optional cluster identifiers inherited fromdoubleml.DoubleMLData.force_all_d_finite: controls how missing or infinite treatment values are handled. For standard DiD applications it defaults toTrue.
DoubleMLDIDData exposes additional helpers such as the t property and
an extended from_arrays constructor that accepts the t array (and
cluster_vars) alongside the standard covariates.
2.2.1.2. Example usage#
In [1]: import doubleml as dml
In [2]: from doubleml.did.datasets import make_did_SZ2020
In [3]: df = make_did_SZ2020(n_obs=500, return_type="DataFrame")
In [4]: print(df.head())
Z1 Z2 Z3 Z4 y d
0 -0.044447 -0.088288 -0.538513 -0.166322 197.508756 1.0
1 0.275538 0.485468 -0.658612 0.366188 219.146214 0.0
2 -0.659799 1.107467 1.131928 0.689542 231.559712 1.0
3 1.000916 -0.471454 -2.471666 -1.128412 179.077920 1.0
4 0.425414 -0.079761 0.382024 0.179548 227.360965 1.0
In [5]: dml_data = dml.DoubleMLDIDData(
...: df,
...: y_col="y",
...: d_cols="d",
...: )
...:
# from arrays
In [6]: x, y, d, t = make_did_SZ2020(n_obs=200, return_type="array")
In [7]: dml_data_arrays = dml.DoubleMLDIDData.from_arrays(x, y, d)
In [8]: print(dml_data)
================== DoubleMLDIDData Object ==================
Time variable: None
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
No. Observations: 500
2.2.2. DoubleMLPanelData#
The DoubleMLPanelData class serves as data-backend for DiD models and can be initialized from a dataframe.
The class is a subclass of DoubleMLData and inherits all methods and attributes.
Furthermore, it provides additional methods and attributes to handle panel data.
2.2.2.1. Key arguments#
id_col: column to with unique identifiers for each unitt_col: column to specify the time periods of the observationdatetime_unit: unit of the time periods (e.g. ‘Y’, ‘M’, ‘D’, ‘h’, ‘m’, ‘s’)
Note
The t_col can contain float, int or datetime values.
2.2.2.2. Example usage#
In [1]: import numpy as np
In [2]: import doubleml as dml
In [3]: from doubleml.did.datasets import make_did_CS2021
In [4]: np.random.seed(42)
In [5]: df = make_did_CS2021(n_obs=500)
In [6]: dml_data = dml.data.DoubleMLPanelData(
...: df,
...: y_col="y",
...: d_cols="d",
...: id_col="id",
...: t_col="t",
...: x_cols=["Z1", "Z2", "Z3", "Z4"],
...: datetime_unit="M"
...: )
...:
In [7]: print(dml_data)
================== DoubleMLPanelData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
No. Unique Ids: 500
No. Observations: 2500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 10 entries, id to Z4
dtypes: datetime64[s](2), float64(7), int64(1)
memory usage: 195.4 KB
2.2.3. DoubleMLRDDData#
The DoubleMLRDDData class specialises DoubleMLData for
regression discontinuity designs. In addition to the standard causal roles it
tracks a mandatory running variable.
2.2.3.1. Key arguments#
score_col: column with the running/score variable.cluster_cols: optional cluster identifiers inherited from the base data class.from_arrays: expects an additionalscorearray alongsidex,yandd.
DoubleMLRDDData ensures that the running variable is kept separate from the
other feature sets and exposes the score property for convenient access.
2.2.3.2. Example usage#
In [1]: import doubleml as dml
In [2]: from doubleml.rdd.datasets import make_simple_rdd_data
In [3]: dict_rdd = make_simple_rdd_data(n_obs=500, return_type="DataFrame")
In [4]: dml_data = dml.DoubleMLRDDData.from_arrays(
...: x=dict_rdd["X"],
...: y=dict_rdd["Y"],
...: d=dict_rdd["D"],
...: score=dict_rdd["score"]
...: )
...:
In [5]: print(dml_data)
================== DoubleMLRDDData Object ==================
Score variable: score
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3']
Instrument variable(s): None
No. Observations: 500
2.2.4. DoubleMLSSMData#
The DoubleMLSSMData class covers the sample selection model backend.
It extends DoubleMLData with a dedicated selection indicator and inherits support for clustered data.
2.2.4.1. Key arguments#
s_col: column containing the selection indicator.cluster_cols: optional cluster identifiers.from_arrays: expects an additionalsarray together withx,yandd.
The object exposes the s property and keeps the selection indicator
separate from covariates and treatment variables.
2.2.4.2. Example usage#
In [1]: import doubleml as dml
In [2]: from doubleml.irm.datasets import make_ssm_data
In [3]: df = make_ssm_data(n_obs=500, return_type="DataFrame")
In [4]: dml_data = dml.DoubleMLSSMData(
...: df,
...: y_col="y",
...: d_cols="d",
...: s_col="s"
...: )
...:
In [5]: x, y, d, _, s = make_ssm_data(n_obs=200, return_type="array")
In [6]: dml_data_arrays = dml.DoubleMLSSMData.from_arrays(x, y, d, s=s)
In [7]: print(dml_data)
================== DoubleMLSSMData Object ==================
Selection variable: s
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500