2. Data Backend#

DoubleML generally provides interfaces to dataframes as well as arrays.

2.1. DoubleMLData#

The usage of both interfaces is demonstrated in the following. We download the Bonus data set from the Pennsylvania Reemployment Bonus experiment.

Note

In [1]: from doubleml.datasets import fetch_bonus

# Load data
In [2]: df_bonus = fetch_bonus('DataFrame')

In [3]: df_bonus.head(5)
Out[3]: 
   index   abdt  tg  inuidur1  inuidur2  ...  lusd  husd  muld  dep1  dep2
0      0  10824   0  2.890372        18  ...     0     1     0   0.0   1.0
1      3  10824   0  0.000000         1  ...     1     0     0   0.0   0.0
2      4  10747   0  3.295837        27  ...     1     0     0   0.0   0.0
3     11  10607   1  2.197225         9  ...     0     0     1   0.0   0.0
4     12  10831   0  3.295837        27  ...     1     0     0   1.0   0.0

[5 rows x 26 columns]
library(DoubleML)

# Load data as data.table
dt_bonus = fetch_bonus(return_type = "data.table")
head(dt_bonus)

# Load data as data.frame
df_bonus = fetch_bonus(return_type = "data.frame")
head(df_bonus)
A data.table: 6 × 17
inuidur1femaleblackothracedep1dep2q2q3q4q5q6agelt35agegt54durablelusdhusdtg
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2.8903720000100010000010
0.0000000000000010000100
3.2958370000000100000100
2.1972250000001000100001
3.2958370001000010011100
3.2958371000000010010100
A data.frame: 6 × 17
inuidur1femaleblackothracedep1dep2q2q3q4q5q6agelt35agegt54durablelusdhusdtg
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
12.8903720000100010000010
40.0000000000000010000100
53.2958370000000100000100
122.1972250000001000100001
133.2958370001000010011100
143.2958371000000010010100

2.1.1. DoubleMLData from dataframes#

The DoubleMLData class serves as data-backend and can be initialized from a dataframe by specifying the column y_col='inuidur1' serving as outcome variable \(Y\), the column(s) d_cols = 'tg' serving as treatment variable \(D\) and the columns x_cols specifying the confounders.

Note

In [4]: from doubleml import DoubleMLData

# Specify the data and the variables for the causal model
In [5]: obj_dml_data_bonus = DoubleMLData(df_bonus,
   ...:                                   y_col='inuidur1',
   ...:                                   d_cols='tg',
   ...:                                   x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
   ...:                                           'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
   ...:                                           'durable', 'lusd', 'husd'],
   ...:                                   use_other_treat_as_covariate=True)
   ...: 

In [6]: print(obj_dml_data_bonus)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB
# Specify the data and the variables for the causal model

# From data.table object
obj_dml_data_bonus = DoubleMLData$new(dt_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus

# From dat.frame object
obj_dml_data_bonus_df = double_ml_data_from_data_frame(df_bonus,
                            y_col = "inuidur1",
                            d_cols = "tg",
                            x_cols = c("female", "black", "othrace", "dep1", "dep2",
                                          "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
                                          "durable", "lusd", "husd"),
                            use_other_treat_as_covariate=TRUE)
obj_dml_data_bonus_df
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
Selection variable: 
No. Observations: 5099
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): tg
Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
Instrument(s): 
Selection variable: 
No. Observations: 5099

Comments on detailed specifications:

  • If x_cols is not specified, all variables (columns of the dataframe) which are neither specified as outcome variable y_col, nor treatment variables d_cols, nor instrumental variables z_cols are used as covariates.

  • In case of multiple treatment variables, the boolean use_other_treat_as_covariate indicates whether the other treatment variables should be added as covariates in each treatment-variable-specific learning task.

  • Instrumental variables for IV models have to be provided as z_cols.

2.1.2. DoubleMLData from arrays and matrices#

To introduce the array interface we generate a data set consisting of confounding variables X, an outcome variable y and a treatment variable d

Note

In [7]: import numpy as np

# Generate data
In [8]: np.random.seed(3141)

In [9]: n_obs = 500

In [10]: n_vars = 100

In [11]: theta = 3

In [12]: X = np.random.normal(size=(n_obs, n_vars))

In [13]: d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))

In [14]: y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
# Generate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3, drop = FALSE] %*% c(5, 5, 5) + stats::rnorm(n_obs)
y = theta * d + X[, 1:3, drop = FALSE] %*% c(5, 5, 5)  + stats::rnorm(n_obs)

To specify the data and the variables for the causal model from arrays we call

In [15]: from doubleml import DoubleMLData

In [16]: obj_dml_data_sim = DoubleMLData.from_arrays(X, y, d)

In [17]: print(obj_dml_data_sim)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): None
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 102 entries, X1 to d
dtypes: float64(102)
memory usage: 398.6 KB
library(DoubleML)

obj_dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
obj_dml_data_sim
================= DoubleMLData Object ==================


------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
Instrument(s): 
Selection variable: 
No. Observations: 500

2.2. Special Data Types#

The DoubleMLData class is extended by the following classes to support special data types or allow for additional parameters.

2.2.1. DoubleMLPanelData#

The DoubleMLPanelData class serves as data-backend for DiD models and can be initialized from a dataframe. The class is a subclass of DoubleMLData and inherits all methods and attributes. Furthermore, it provides additional methods and attributes to handle panel data ()

  • id_col: column to with unique identifiers for each unit

  • t_col: column to specify the time periods of the observation

  • datetime_unit: unit of the time periods (e.g. ‘Y’, ‘M’, ‘D’, ‘h’, ‘m’, ‘s’)

Note

The t_col can contain float, int or datetime values.

In [1]: from doubleml.did.datasets import make_did_CS2021

In [2]: np.random.seed(42)

In [3]: df = make_did_CS2021(n_obs=500)

In [4]: dml_data = dml.data.DoubleMLPanelData(
   ...:     df,
   ...:     y_col="y",
   ...:     d_cols="d",
   ...:     id_col="id",
   ...:     t_col="t",
   ...:     x_cols=["Z1", "Z2", "Z3", "Z4"],
   ...:     datetime_unit="M"
   ...: )
   ...: 

In [5]: print(dml_data)
================== DoubleMLPanelData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['Z1', 'Z2', 'Z3', 'Z4']
Instrument variable(s): None
Time variable: t
Id variable: id
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 10 entries, id to Z4
dtypes: datetime64[s](2), float64(7), int64(1)
memory usage: 195.4 KB