2. Data Backend#

DoubleML provides a unified data interface via the doubleml.data module. It supports both pandas.DataFrame objects and numpy.ndarray arrays and now allows clustered data to be handled directly via DoubleMLData.

2.1. DoubleMLData#

The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.

Note

In Python we use pandas.DataFrame and numpy.ndarray. The data can be fetched via doubleml.datasets.fetch_bonus().
In R we use data.table::data.table(), data.frame(), and matrix(). The data can be fetched via DoubleML::fetch_bonus()

Important

Cluster-robust analyses no longer require the dedicated DoubleMLClusterData backend. Use doubleml.DoubleMLData with the cluster_cols (or cluster_vars for arrays) arguments instead. The wrapper DoubleMLClusterData remains available for backwards compatibility but is deprecated and scheduled for removal with version 0.12.0.

Python

In [1]: from doubleml.datasets import fetch_bonus

# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')

In [3]: df_bonus.head(5)
Out[3]: 
   index   abdt  tg  inuidur1  inuidur2  ...  lusd  husd  muld  dep1  dep2
0      0  10824   0  2.890372        18  ...     0     1     0   0.0   1.0
1      3  10824   0  0.000000         1  ...     1     0     0   0.0   0.0
2      4  10747   0  3.295837        27  ...     1     0     0   0.0   0.0
3     11  10607   1  2.197225         9  ...     0     0     1   0.0   0.0
4     12  10831   0  3.295837        27  ...     1     0     0   1.0   0.0

[5 rows x 26 columns]

R

library(DoubleML)

# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)

# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)

A data.table: 6 × 17
inuidur1	female	black	othrace	dep1	dep2	q2	q3	q4	q5	q6	agelt35	agegt54	durable	lusd	husd	tg
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
2.890372	0	0	0	0	1	0	0	0	1	0	0	0	0	0	1	0
0.000000	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0
3.295837	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0
2.197225	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1
3.295837	0	0	0	1	0	0	0	0	1	0	0	1	1	1	0	0
3.295837	1	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0

A data.frame: 6 × 17
	inuidur1	female	black	othrace	dep1	dep2	q2	q3	q4	q5	q6	agelt35	agegt54	durable	lusd	husd	tg
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	2.890372	0	0	0	0	1	0	0	0	1	0	0	0	0	0	1	0
4	0.000000	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0
5	3.295837	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0
12	2.197225	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1
13	3.295837	0	0	0	1	0	0	0	0	1	0	0	1	1	1	0	0
14	3.295837	1	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0

2.1.1. DoubleMLData from dataframes#

The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg' serving as treatment variable \(D\) and the columns x_cols specifying the confounders.

Note

In Python we use pandas.DataFrame and the API reference can be found here doubleml.DoubleMLData.
In R we use data.table::data.table() and the API reference can be found here DoubleML::DoubleMLData.
For initialization from the R base class data.frame() the API reference can be found here DoubleML::double_ml_data_from_data_frame().

Python

In [4]: from doubleml import DoubleMLData

# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
   ...:                                   y_col='inuidur1',
   ...:                                   d_cols='tg',
   ...:                                   x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
   ...:                                           'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
   ...:                                           'durable', 'lusd', 'husd'],
   ...:                                   use_other_treat_as_covariate=True)
   ...: 

In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099
------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB

R

# Specify the data and the variables for the causal model

# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus

# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df

================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
Selection variable: 
No. Observations: 5099

================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
Selection variable: 
No. Observations: 5099

Comments on detailed specifications:

If x_cols is not specified, all variables (columns of the dataframe) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates.
In case of multiple treatment variables, the boolean use_other_treat_as_covariate indicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.
Instrumental variables for IV models have to be provided as z_cols.
Cluster variables can directly be added via cluster_cols; they must be distinct from all variables in y_col, d_cols, x_cols and z_cols. The object exposes cluster_cols and cluster_vars properties for convenience.
The optional force_all_d_finite flag mirrors force_all_x_finite and controls missings/infinite values in the treatment variables, which is especially relevant for panel models.

2.1.2. DoubleMLData from arrays and matrices#

To introduce the array interface we generate a data set consisting of confounding variables X, an outcome variable y and a treatment variable d

Note

In python we use numpy.ndarray. and the API reference can be found here doubleml.DoubleMLData.from_arrays().
In R we use the R base class matrix() and the API reference can be found here DoubleML::double_ml_data_from_matrix().

Python

In [7]: import numpy as np

# Generate data
In [8]: np.random.seed(3141)

In [9]: n_obs = 500

In [10]: n_vars = 100

In [11]: theta = 3

In [12]: X = np.random.normal(size=(n_obs, n_vars))

In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

R

# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5)  + stats::rnorm(n_obs)

To specify the data and the variables for the causal model from arrays we call

Python

In [15]: from doubleml import DoubleMLData

In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)

In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500
------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB

R

library(DoubleML)

obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim

================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
Selection variable: 
No. Observations: 500

In Python, cluster assignments can be supplied through the optional cluster_cols argument (or cluster_vars in the from_arrays method). In R, one has to create a DoubleMLClusterData object instead of DoubleMLData to use clustering.

Python

In [18]: cluster_vars = (np.arange(n_obs) // 5).reshape(-1, 1)

In [19]: obj_dml_data_sim_cluster = DoubleMLData.from_arrays(X, y, d, cluster_vars=cluster_vars)

In [20]: obj_dml_data_sim_cluster.cluster_cols
Out[20]: ['cluster_var']

In [21]: print(obj_dml_data_sim_cluster)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
Cluster variable(s): ['cluster_var']
Is cluster data: True
No. Observations: 500
------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 103 entries, X1 to cluster_var
dtypes: float64(102), int64(1)
memory usage: 402.5 KB

2.2. Special Data Types#

The DoubleMLData class is extended by the following classes to support special data types or allow for additional parameters.

2.2.1. DoubleMLDIDData#

The DoubleMLDIDData class tailors DoubleMLData to difference-in-differences applications. It handles both panel settings and repeated cross-sections by tracking an optional time indicator.

2.2.1.1. Key arguments#

t_col: column containing the time variable for repeated cross-sections. It must be unique from y_col, d_cols, x_cols, z_cols and cluster_cols.
cluster_cols: optional cluster identifiers inherited from doubleml.DoubleMLData.
force_all_d_finite: controls how missing or infinite treatment values are handled. For standard DiD applications it defaults to True.

DoubleMLDIDData exposes additional helpers such as the t property and an extended from_arrays constructor that accepts the t array (and cluster_vars) alongside the standard covariates.

2.2.1.2. Example usage#

Python

In [1]: import doubleml as dml

In [2]: from doubleml.did.datasets import make_did_SZ2020

In [3]: df = make_did_SZ2020(n_obs=500, return_type="DataFrame")

In [4]: print(df.head())
         Z1        Z2        Z3        Z4           y    d
0 -0.044447 -0.088288 -0.538513 -0.166322  197.508756  1.0
1  0.275538  0.485468 -0.658612  0.366188  219.146214  0.0
2 -0.659799  1.107467  1.131928  0.689542  231.559712  1.0
3  1.000916 -0.471454 -2.471666 -1.128412  179.077920  1.0
4  0.425414 -0.079761  0.382024  0.179548  227.360965  1.0

In [5]: dml_data = dml.DoubleMLDIDData(
   ...:       df,
   ...:       y_col="y",
   ...:       d_cols="d",
   ...:   )
   ...: 

  # from arrays

2.2.2. DoubleMLPanelData#

The DoubleMLPanelData class serves as data-backend for DiD models, as well as the DoubleMLPLPR model, and can be initialized from a dataframe. The class is a subclass of DoubleMLData and inherits all methods and attributes. Furthermore, it provides additional methods and attributes to handle panel data.

2.2.2.1. Key arguments#

id_col: column to with unique identifiers for each unit
t_col: column to specify the time periods of the observation
static_panel: Indicates whether the data model corresponds to a static panel data approach (True, used for the DoubleMLPLPR model) or to staggered adoption panel data (False, for DiD models) which is the default option.
datetime_unit: unit of the time periods (e.g. ‘Y’, ‘M’, ‘D’, ‘h’, ‘m’, ‘s’)

Note

The t_col can contain float, int or datetime values.

2.2.2.2. Example usage#

Python

In [1]: import numpy as np

In [2]: import doubleml as dml

In [3]: from doubleml.did.datasets import make_did_CS2021

In [4]: np.random.seed(42)

In [5]: df = make_did_CS2021(n_obs=500)

In [6]: dml_data = dml.data.DoubleMLPanelData(
   ...:     df,
   ...:     y_col="y",
   ...:     d_cols="d",
   ...:     id_col="id",
   ...:     t_col="t",
   ...:     x_cols=["Z1", "Z2", "Z3", "Z4"],
   ...:     datetime_unit="M"
   ...: )
   ...: 

In [7]: print(dml_data)
================== DoubleMLPanelData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
Static panel data: False
No. Unique Ids: 500
No. Observations: 2500
------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 10 entries, id to Z4
dtypes: datetime64[s](2), float64(7), int64(1)
memory usage: 195.4 KB

Python

In [8]: import numpy as np

In [9]: import doubleml as dml

In [10]: from doubleml.plm.datasets import make_plpr_CP2025

In [11]: np.random.seed(42)

In [12]: df = make_plpr_CP2025(num_id=100, num_t=5, dim_x=5)

In [13]: dml_data = dml.data.DoubleMLPanelData(
   ....:     df,
   ....:     y_col="y",
   ....:     d_cols="d",
   ....:     id_col="id",
   ....:     t_col="time",
   ....:     x_cols=["x1", "x2", "x3", "x4", "x5"],
   ....:     static_panel=True
   ....: )
   ....: 

In [14]: print(dml_data)
================== DoubleMLPanelData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['x1', 'x2', 'x3', 'x4', 'x5']
Instrument variable(s): None
Time variable: time
Id variable: id
Static panel data: True
No. Unique Ids: 100
No. Observations: 500
------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 9 entries, id to x5
dtypes: float64(7), int64(2)
memory usage: 35.3 KB

2.2.3. DoubleMLRDDData#

The DoubleMLRDDData class specialises DoubleMLData for regression discontinuity designs. In addition to the standard causal roles it tracks a mandatory running variable.

2.2.3.1. Key arguments#

score_col: column with the running/score variable.
cluster_cols: optional cluster identifiers inherited from the base data class.
from_arrays: expects an additional score array alongside x, y and d.

DoubleMLRDDData ensures that the running variable is kept separate from the other feature sets and exposes the score property for convenient access.

2.2.3.2. Example usage#

Python

In [1]: import doubleml as dml

In [2]: from doubleml.rdd.datasets import make_simple_rdd_data

In [3]: dict_rdd = make_simple_rdd_data(n_obs=500, return_type="DataFrame")

In [4]: dml_data = dml.DoubleMLRDDData.from_arrays(
   ...:   x=dict_rdd["X"],
   ...:   y=dict_rdd["Y"],
   ...:   d=dict_rdd["D"],
   ...:   score=dict_rdd["score"]
   ...:   )
   ...: 

In [5]: print(dml_data)
================== DoubleMLRDDData Object ==================
Score variable: score
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3']
Instrument variable(s): None
No. Observations: 500

2.2.4. DoubleMLSSMData#

The DoubleMLSSMData class covers the sample selection model backend. It extends DoubleMLData with a dedicated selection indicator and inherits support for clustered data.

2.2.4.1. Key arguments#

s_col: column containing the selection indicator.
cluster_cols: optional cluster identifiers.
from_arrays: expects an additional s array together with x, y and d.

The object exposes the s property and keeps the selection indicator separate from covariates and treatment variables.

2.2.4.2. Example usage#

Python

In [1]: import doubleml as dml

In [2]: from doubleml.irm.datasets import make_ssm_data

In [3]: df = make_ssm_data(n_obs=500, return_type="DataFrame")

In [4]: dml_data = dml.DoubleMLSSMData(
   ...:     df,
   ...:     y_col="y",
   ...:     d_cols="d",
   ...:     s_col="s"
   ...: )
   ...: 

In [5]: x, y, d, _, s = make_ssm_data(n_obs=200, return_type="array")

In [6]: dml_data_arrays = dml.DoubleMLSSMData.from_arrays(x, y, d, s=s)

In [7]: print(dml_data)
================== DoubleMLSSMData Object ==================
Selection variable: s
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500