2. Data Backend#

DoubleML generally provides interfaces to dataframes as well as arrays.

2.1. DoubleMLData#

The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.

Note

In Python we use pandas.DataFrame and numpy.ndarray. The data can be fetched via doubleml.datasets.fetch_bonus().
In R we use data.table::data.table(), data.frame(), and matrix(). The data can be fetched via DoubleML::fetch_bonus()

Python

In [1]: from doubleml.datasets import fetch_bonus

# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')

In [3]: df_bonus.head(5)
Out[3]: 
   index   abdt  tg  inuidur1  inuidur2  ...  lusd  husd  muld  dep1  dep2
0      0  10824   0  2.890372        18  ...     0     1     0   0.0   1.0
1      3  10824   0  0.000000         1  ...     1     0     0   0.0   0.0
2      4  10747   0  3.295837        27  ...     1     0     0   0.0   0.0
3     11  10607   1  2.197225         9  ...     0     0     1   0.0   0.0
4     12  10831   0  3.295837        27  ...     1     0     0   1.0   0.0

[5 rows x 26 columns]

R

library(DoubleML)

# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)

# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)

A data.table: 6 × 17
inuidur1	female	black	othrace	dep1	dep2	q2	q3	q4	q5	q6	agelt35	agegt54	durable	lusd	husd	tg
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
2.890372	0	0	0	0	1	0	0	0	1	0	0	0	0	0	1	0
0.000000	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0
3.295837	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0
2.197225	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1
3.295837	0	0	0	1	0	0	0	0	1	0	0	1	1	1	0	0
3.295837	1	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0

A data.frame: 6 × 17
	inuidur1	female	black	othrace	dep1	dep2	q2	q3	q4	q5	q6	agelt35	agegt54	durable	lusd	husd	tg
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	2.890372	0	0	0	0	1	0	0	0	1	0	0	0	0	0	1	0
4	0.000000	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0
5	3.295837	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0
12	2.197225	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1
13	3.295837	0	0	0	1	0	0	0	0	1	0	0	1	1	1	0	0
14	3.295837	1	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0

2.1.1. DoubleMLData from dataframes#

The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg' serving as treatment variable \(D\) and the columns x_cols specifying the confounders.

Note

In Python we use pandas.DataFrame and the API reference can be found here doubleml.DoubleMLData.
In R we use data.table::data.table() and the API reference can be found here DoubleML::DoubleMLData.
For initialization from the R base class data.frame() the API reference can be found here DoubleML::double_ml_data_from_data_frame().

Python

In [4]: from doubleml import DoubleMLData

# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
   ...:                                   y_col='inuidur1',
   ...:                                   d_cols='tg',
   ...:                                   x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
   ...:                                           'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
   ...:                                           'durable', 'lusd', 'husd'],
   ...:                                   use_other_treat_as_covariate=True)
   ...: 

In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB

R

# Specify the data and the variables for the causal model

# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus

# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df

================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
Selection variable: 
No. Observations: 5099

================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
Selection variable: 
No. Observations: 5099

Comments on detailed specifications:

If x_cols is not specified, all variables (columns of the dataframe) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates.
In case of multiple treatment variables, the boolean use_other_treat_as_covariate indicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.
Instrumental variables for IV models have to be provided as z_cols.

2.1.2. DoubleMLData from arrays and matrices#

To introduce the array interface we generate a data set consisting of confounding variables X, an outcome variable y and a treatment variable d

Note

In python we use numpy.ndarray. and the API reference can be found here doubleml.DoubleMLData.from_arrays().
In R we use the R base class matrix() and the API reference can be found here DoubleML::double_ml_data_from_matrix().

Python

In [7]: import numpy as np

# Generate data
In [8]: np.random.seed(3141)

In [9]: n_obs = 500

In [10]: n_vars = 100

In [11]: theta = 3

In [12]: X = np.random.normal(size=(n_obs, n_vars))

In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

R

# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5)  + stats::rnorm(n_obs)

To specify the data and the variables for the causal model from arrays we call

Python

In [15]: from doubleml import DoubleMLData

In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)

In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB

R

library(DoubleML)

obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim

================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
Selection variable: 
No. Observations: 500

2.2. Special Data Types#

The DoubleMLData class is extended by the following classes to support special data types or allow for additional parameters.

2.2.1. DoubleMLPanelData#

The DoubleMLPanelData class serves as data-backend for DiD models and can be initialized from a dataframe. The class is a subclass of DoubleMLData and inherits all methods and attributes. Furthermore, it provides additional methods and attributes to handle panel data ()

id_col: column to with unique identifiers for each unit
t_col: column to specify the time periods of the observation
datetime_unit: unit of the time periods (e.g. ‘Y’, ‘M’, ‘D’, ‘h’, ‘m’, ‘s’)

Note

The t_col can contain float, int or datetime values.

Python

In [1]: from doubleml.did.datasets import make_did_CS2021

In [2]: np.random.seed(42)

In [3]: df = make_did_CS2021(n_obs=500)

In [4]: dml_data = dml.data.DoubleMLPanelData(
   ...:     df,
   ...:     y_col="y",
   ...:     d_cols="d",
   ...:     id_col="id",
   ...:     t_col="t",
   ...:     x_cols=["Z1", "Z2", "Z3", "Z4"],
   ...:     datetime_unit="M"
   ...: )
   ...: 

In [5]: print(dml_data)
================== DoubleMLPanelData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
No. Unique Ids: 500
No. Observations: 2500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 10 entries, id to Z4
dtypes: datetime64[s](2), float64(7), int64(1)
memory usage: 195.4 KB