2. The data-backend DoubleMLData#

DoubleML provides interfaces to dataframes as well as arrays. The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.

Note

In [1]: from doubleml.datasets import fetch_bonus

# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')

In [3]: df_bonus.head(5)
Out[3]: 
   index   abdt  tg  inuidur1  inuidur2  ...  lusd  husd  muld  dep1  dep2
0      0  10824   0  2.890372        18  ...     0     1     0   0.0   1.0
1      3  10824   0  0.000000         1  ...     1     0     0   0.0   0.0
2      4  10747   0  3.295837        27  ...     1     0     0   0.0   0.0
3     11  10607   1  2.197225         9  ...     0     0     1   0.0   0.0
4     12  10831   0  3.295837        27  ...     1     0     0   1.0   0.0

[5 rows x 26 columns]
library(DoubleML)

# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)

# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)
A data.table: 6 × 17
inuidur1femaleblackothracedep1dep2q2q3q4q5q6agelt35agegt54durablelusdhusdtg
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2.8903720000100010000010
0.0000000000000010000100
3.2958370000000100000100
2.1972250000001000100001
3.2958370001000010011100
3.2958371000000010010100
A data.frame: 6 × 17
inuidur1femaleblackothracedep1dep2q2q3q4q5q6agelt35agegt54durablelusdhusdtg
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
12.8903720000100010000010
40.0000000000000010000100
53.2958370000000100000100
122.1972250000001000100001
133.2958370001000010011100
143.2958371000000010010100

2.1. DoubleMLData from dataframes#

The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg' serving as treatment variable \(D\) and the columns x_cols specifying the confounders.

Note

In [4]: from doubleml import DoubleMLData

# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
   ...:                                   y_col='inuidur1',
   ...:                                   d_cols='tg',
   ...:                                   x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
   ...:                                           'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
   ...:                                           'durable', 'lusd', 'husd'],
   ...:                                   use_other_treat_as_covariate=True)
   ...: 

In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB
# Specify the data and the variables for the causal model

# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus

# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
No. Observations: 5099
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
No. Observations: 5099

Comments on detailed specifications:

  • If x_cols is not specified, all variables (columns of the dataframe) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates.

  • In case of multiple treatment variables, the boolean use_other_treat_as_covariate indicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.

  • Instrumental variables for IV models have to be provided as z_cols.

2.2. DoubleMLData from arrays and matrices#

To introduce the array interface we generate a data set consisting of confounding variables X, an outcome variable y and a treatment variable d

Note

In [7]: import numpy as np

# Generate data
In [8]: np.random.seed(3141)

In [9]: n_obs = 500

In [10]: n_vars = 100

In [11]: theta = 3

In [12]: X = np.random.normal(size=(n_obs, n_vars))

In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5)  + stats::rnorm(n_obs)

To specify the data and the variables for the causal model from arrays we call

In [15]: from doubleml import DoubleMLData

In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)

In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
No. Observations: 500