2. The data-backend DoubleMLData#
DoubleML provides interfaces to dataframes as well as arrays. The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.
Note
In Python we use
pandas.DataFrame
andnumpy.ndarray
. The data can be fetched viadoubleml.datasets.fetch_bonus()
.In R we use data.table::data.table(), data.frame(), and matrix(). The data can be fetched via DoubleML::fetch_bonus()
In [1]: from doubleml.datasets import fetch_bonus
# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')
In [3]: df_bonus.head(5)
Out[3]:
index abdt tg inuidur1 inuidur2 ... lusd husd muld dep1 dep2
0 0 10824 0 2.890372 18 ... 0 1 0 0.0 1.0
1 3 10824 0 0.000000 1 ... 1 0 0 0.0 0.0
2 4 10747 0 3.295837 27 ... 1 0 0 0.0 0.0
3 11 10607 1 2.197225 9 ... 0 0 1 0.0 0.0
4 12 10831 0 3.295837 27 ... 1 0 0 1.0 0.0
[5 rows x 26 columns]
library(DoubleML)
# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)
# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)
inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
inuidur1 | female | black | othrace | dep1 | dep2 | q2 | q3 | q4 | q5 | q6 | agelt35 | agegt54 | durable | lusd | husd | tg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 2.890372 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 | 3.295837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12 | 2.197225 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
13 | 3.295837 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
14 | 3.295837 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2.1. DoubleMLData from dataframes#
The DoubleMLData
class serves as data-backend and can be initialized from a dataframe by
specifying the column y_col='inuidur1'
serving as outcome variable \(Y\), the column(s) d_cols = 'tg'
serving as treatment variable \(D\) and the columns x_cols
specifying the confounders.
Note
In Python we use
pandas.DataFrame
and the API reference can be found heredoubleml.DoubleMLData
.In R we use data.table::data.table() and the API reference can be found here DoubleML::DoubleMLData.
For initialization from the R base class data.frame() the API reference can be found here DoubleML::double_ml_data_from_data_frame().
In [4]: from doubleml import DoubleMLData
# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
...: y_col='inuidur1',
...: d_cols='tg',
...: x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
...: 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
...: 'durable', 'lusd', 'husd'],
...: use_other_treat_as_covariate=True)
...:
In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB
# Specify the data and the variables for the causal model
# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus
# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"),
use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
No. Observations: 5099
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s):
No. Observations: 5099
Comments on detailed specifications:
If
x_cols
is not specified, all variables (columns of the dataframe) which are neither specified as outcome variabley_col
, nor treatment variablesd_cols
, nor instrumental variablesz_cols
are used as covariates.In case of multiple treatment variables, the boolean
use_other_treat_as_covariate
indicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.Instrumental variables for IV models have to be provided as
z_cols
.
2.2. DoubleMLData from arrays and matrices#
To introduce the array interface we generate a data set consisting of confounding variables X
, an outcome
variable y
and a treatment variable d
Note
In python we use
numpy.ndarray
. and the API reference can be found heredoubleml.DoubleMLData.from_arrays()
.In R we use the R base class matrix() and the API reference can be found here DoubleML::double_ml_data_from_matrix().
In [7]: import numpy as np
# Generate data
In [8]: np.random.seed(3141)
In [9]: n_obs = 500
In [10]: n_vars = 100
In [11]: theta = 3
In [12]: X = np.random.normal(size=(n_obs, n_vars))
In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
To specify the data and the variables for the causal model from arrays we call
In [15]: from doubleml import DoubleMLData
In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)
In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500
------------------ DataFrame info ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim
================= DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s):
No. Observations: 500