2. Data Backend#
DoubleML generally provides interfaces to dataframes as well as arrays.
2.1. DoubleMLData#
The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.
Note
In Python we use
pandas.DataFrame
andnumpy.ndarray
. The data can be fetched viadoubleml.datasets.fetch_bonus()
.In R we use data.table::data.table(), data.frame(), and matrix(). The data can be fetched via DoubleML::fetch_bonus()
In [1]: from doubleml.datasets import fetch_bonus
# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')
In [3]: df_bonus.head(5)
Out[3]:
index abdt tg inuidur1 inuidur2 ... lusd husd muld dep1 dep2
0 0 10824 0 2.890372 18 ... 0 1 0 0.0 1.0
1 3 10824 0 0.000000 1 ... 1 0 0 0.0 0.0
2 4 10747 0 3.295837 27 ... 1 0 0 0.0 0.0
3 11 10607 1 2.197225 9 ... 0 0 1 0.0 0.0
4 12 10831 0 3.295837 27 ... 1 0 0 1.0 0.0
[5 rows x 26 columns]
library(DoubleML)
# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)
# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)
inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 | 3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12 | 2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
13 | 3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
14 | 3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2.1.1. DoubleMLData from dataframes#
The DoubleMLData
class serves as data-backend and can be initialized from a dataframe by
specifying the column y_col='inuidur1'
serving as outcome variable \(Y\), the column(s) d_cols = 'tg'
serving as treatment variable \(D\) and the columns x_cols
specifying the confounders.
Note
In Python we use
pandas.DataFrame
and the API reference can be found heredoubleml.DoubleMLData
.In R we use data.table::data.table() and the API reference can be found here DoubleML::DoubleMLData.
For initialization from the R base class data.frame() the API reference can be found here DoubleML::double_ml_data_from_data_frame().
In [4]: from doubleml import DoubleMLData
# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
...: y_col='inuidur1',
...: d_cols='tg',
...: x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
...: 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
...: 'durable', 'lusd', 'husd'],
...: use_other_treat_as_covariate=True)
...:
In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB
# Specify the data and the variables for the causal model
# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus
# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
Selection variable:
No. Observations: 5099
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
Selection variable:
No. Observations: 5099
Comments on detailed specifications:
If
x_cols
is not specified, all variables (columns of the dataframe) which are neither specified as outcome variabley_col
, nor treatment variablesd_cols
, nor instrumental variablesz_cols
are used as covariates.In case of multiple treatment variables, the boolean
use_other_treat_as_covariate
indicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.Instrumental variables for IV models have to be provided as
z_cols
.
2.1.2. DoubleMLData from arrays and matrices#
To introduce the array interface we generate a data set consisting of confounding variables X
, an outcome
variable y
and a treatment variable d
Note
In python we use
numpy.ndarray
. and the API reference can be found heredoubleml.DoubleMLData.from_arrays()
.In R we use the R base class matrix() and the API reference can be found here DoubleML::double_ml_data_from_matrix().
In [7]: import numpy as np
# Generate data
In [8]: np.random.seed(3141)
In [9]: n_obs = 500
In [10]: n_vars = 100
In [11]: theta = 3
In [12]: X = np.random.normal(size=(n_obs, n_vars))
In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
To specify the data and the variables for the causal model from arrays we call
In [15]: from doubleml import DoubleMLData
In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)
In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
library(DoubleML)
obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s):
Selection variable:
No. Observations: 500
2.2. Special Data Types#
The DoubleMLData class is extended by the following classes to support special data types or allow for additional parameters.
2.2.1. DoubleMLPanelData#
The DoubleMLPanelData
class serves as data-backend for DiD models and can be initialized from a dataframe.
The class is a subclass of DoubleMLData and inherits all methods and attributes.
Furthermore, it provides additional methods and attributes to handle panel data ()
id_col
: column to with unique identifiers for each unitt_col
: column to specify the time periods of the observationdatetime_unit
: unit of the time periods (e.g. ‘Y’, ‘M’, ‘D’, ‘h’, ‘m’, ‘s’)
Note
The t_col
can contain float
, int
or datetime
values.
In [1]: from doubleml.did.datasets import make_did_CS2021
In [2]: np.random.seed(42)
In [3]: df = make_did_CS2021(n_obs=500)
In [4]: dml_data = dml.data.DoubleMLPanelData(
...: df,
...: y_col="y",
...: d_cols="d",
...: id_col="id",
...: t_col="t",
...: x_cols=["Z1", "Z2", "Z3", "Z4"],
...: datetime_unit="M"
...: )
...:
In [5]: print(dml_data)
================== DoubleMLPanelData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
No. Observations: 500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 10 entries, id to Z4
dtypes: datetime64[s](2), float64(7), int64(1)
memory usage: 195.4 KB