2. Data Backend#
DoubleML generally provides interfaces to dataframes as well as arrays.
2.1. DoubleMLData#
The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.
Note
In Python we use
pandas.DataFrameandnumpy.ndarray. The data can be fetched viadoubleml.datasets.fetch_bonus().In R we use data.table::data.table(), data.frame(), and matrix(). The data can be fetched via DoubleML::fetch_bonus()
In [1]: from doubleml.datasets import fetch_bonus
# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')
In [3]: df_bonus.head(5)
Out[3]:
index abdt tg inuidur1 inuidur2 ... lusd husd muld dep1 dep2
0 0 10824 0 2.890372 18 ... 0 1 0 0.0 1.0
1 3 10824 0 0.000000 1 ... 1 0 0 0.0 0.0
2 4 10747 0 3.295837 27 ... 1 0 0 0.0 0.0
3 11 10607 1 2.197225 9 ... 0 0 1 0.0 0.0
4 12 10831 0 3.295837 27 ... 1 0 0 1.0 0.0
[5 rows x 26 columns]
library(DoubleML)
# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)
# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)
| inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 5 | 3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 12 | 2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 13 | 3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 14 | 3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2.1.1. DoubleMLData from dataframes#
The DoubleMLData class serves as data-backend and can be initialized from a dataframe by
specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg'
serving as treatment variable \(D\) and the columns x_cols specifying the confounders.
Note
In Python we use
pandas.DataFrameand the API reference can be found heredoubleml.DoubleMLData.In R we use data.table::data.table() and the API reference can be found here DoubleML::DoubleMLData.
For initialization from the R base class data.frame() the API reference can be found here DoubleML::double_ml_data_from_data_frame().
In [4]: from doubleml import DoubleMLData
# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
...: y_col='inuidur1',
...: d_cols='tg',
...: x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
...: 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
...: 'durable', 'lusd', 'husd'],
...: use_other_treat_as_covariate=True)
...:
In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB
# Specify the data and the variables for the causal model
# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus
# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
Selection variable:
No. Observations: 5099
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
Selection variable:
No. Observations: 5099
Comments on detailed specifications:
If
x_colsis not specified, all variables (columns of the dataframe) which are neither specified as outcome variabley_col, nor treatment variablesd_cols, nor instrumental variablesz_colsare used as covariates.In case of multiple treatment variables, the boolean
use_other_treat_as_covariateindicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.Instrumental variables for IV models have to be provided as
z_cols.
2.1.2. DoubleMLData from arrays and matrices#
To introduce the array interface we generate a data set consisting of confounding variables X, an outcome
variable y and a treatment variable d
Note
In python we use
numpy.ndarray. and the API reference can be found heredoubleml.DoubleMLData.from_arrays().In R we use the R base class matrix() and the API reference can be found here DoubleML::double_ml_data_from_matrix().
In [7]: import numpy as np
# Generate data
In [8]: np.random.seed(3141)
In [9]: n_obs = 500
In [10]: n_vars = 100
In [11]: theta = 3
In [12]: X = np.random.normal(size=(n_obs, n_vars))
In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
To specify the data and the variables for the causal model from arrays we call
In [15]: from doubleml import DoubleMLData
In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)
In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
library(DoubleML)
obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s):
Selection variable:
No. Observations: 500
2.2. Special Data Types#
The DoubleMLData class is extended by the following classes to support special data types or allow for additional parameters.
2.2.1. DoubleMLPanelData#
The DoubleMLPanelData class serves as data-backend for DiD models and can be initialized from a dataframe.
The class is a subclass of DoubleMLData and inherits all methods and attributes.
Furthermore, it provides additional methods and attributes to handle panel data ()
id_col: column to with unique identifiers for each unitt_col: column to specify the time periods of the observationdatetime_unit: unit of the time periods (e.g. ‘Y’, ‘M’, ‘D’, ‘h’, ‘m’, ‘s’)
Note
The t_col can contain float, int or datetime values.
In [1]: from doubleml.did.datasets import make_did_CS2021
In [2]: np.random.seed(42)
In [3]: df = make_did_CS2021(n_obs=500)
In [4]: dml_data = dml.data.DoubleMLPanelData(
...: df,
...: y_col="y",
...: d_cols="d",
...: id_col="id",
...: t_col="t",
...: x_cols=["Z1", "Z2", "Z3", "Z4"],
...: datetime_unit="M"
...: )
...:
In [5]: print(dml_data)
================== DoubleMLPanelData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
No. Unique Ids: 500
No. Observations: 2500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 10 entries, id to Z4
dtypes: datetime64[s](2), float64(7), int64(1)
memory usage: 195.4 KB