Python: Hyperparametertuning with Optuna#

This notebook explains how to use the implmented tune_ml_models() method to tune hyperparameters using the Optuna package.

In this example, we will focus on the DoubleMLAPO model to estimate average potential outcomes (APOs) in an interactive regression model (see DoubleMLIRM).

The goal is to estimate the average potential outcome

\[\theta_0 =\mathbb{E}[Y(d)]\]

for a given treatment level \(d\) and and discrete valued treatment \(D\).

For a more detailed description of the DoubleMLAPO model, see Average Potential Outcome Model or Example Gallery.

Remark that the untuned settings and hyperparameter spaces are mainly chosen for display of the tuning possibilities and not a blueprint for standard tuning.

[1]:
import optuna
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.io import show

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression
from sklearn.ensemble import StackingRegressor, StackingClassifier
from lightgbm import LGBMRegressor, LGBMClassifier

from doubleml.data import DoubleMLData
from doubleml.irm import DoubleMLAPO, DoubleMLAPOS
from doubleml.irm.datasets import make_irm_data_discrete_treatments

palette = sns.color_palette("colorblind")

import warnings
warnings.filterwarnings("ignore")

Data Generating Process (DGP)#

At first, let us generate data according to the make_irm_data_discrete_treatments data generating process. The process generates data with a continuous treatment variable and contains the true individual treatment effects (ITEs) with respect to option of not getting treated.

According to the continuous treatment variable, the treatment is discretized into multiple levels, based on quantiles. Using the oracle ITEs, enables the comparison to the true APOs and averate treatment effects (ATEs) for the different levels of the treatment variable.

Remark: The average potential outcome model does not require an underlying continuous treatment variable. The model will work identically if the treatment variable is discrete by design.

[2]:
# Parameters
n_obs = 500
n_levels = 3
treatment_lvl = 1.0

np.random.seed(42)
data_apo = make_irm_data_discrete_treatments(n_obs=n_obs,n_levels=n_levels, linear=False)

y0 = data_apo['oracle_values']['y0']
cont_d = data_apo['oracle_values']['cont_d']
ite = data_apo['oracle_values']['ite']
d = data_apo['d']
potential_level = data_apo['oracle_values']['potential_level']
level_bounds = data_apo['oracle_values']['level_bounds']

average_ites = np.full(n_levels + 1, np.nan)
apos = np.full(n_levels + 1, np.nan)
mid_points = np.full(n_levels, np.nan)

for i in range(n_levels + 1):
    average_ites[i] = np.mean(ite[d == i]) * (i > 0)
    apos[i] = np.mean(y0) + average_ites[i]

print(f"Average Individual effects in each group:\n{np.round(average_ites,2)}\n")
print(f"Average Potential Outcomes in each group:\n{np.round(apos,2)}\n")
print(f"Levels and their counts:\n{np.unique(d, return_counts=True)}")
Average Individual effects in each group:
[ 0.    3.44  9.32 10.49]

Average Potential Outcomes in each group:
[210.05 213.49 219.38 220.54]

Levels and their counts:
(array([0., 1., 2., 3.]), array([171, 110, 109, 110]))

As for all DoubleML models, we specify a DoubleMLData object to handle the data.

[3]:
y = data_apo['y']
x = data_apo['x']
d = data_apo['d']
df_apo = pd.DataFrame(
    np.column_stack((y, d, x)),
    columns=['y', 'd'] + ['x' + str(i) for i in range(data_apo['x'].shape[1])]
)

dml_data = DoubleMLData(df_apo, 'y', 'd')
print(dml_data)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['d']
Covariates: ['x0', 'x1', 'x2', 'x3', 'x4']
Instrument variable(s): None
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 7 entries, y to x4
dtypes: float64(7)
memory usage: 27.5 KB

Basic Tuning Example#

At first, we will take a look at a very basic tuning example without much customization.

Define Nuisance Learners#

For our example, we will choose LightGBM learners, which are typical non-parametric choice.

[4]:
ml_g = LGBMRegressor(random_state=314, verbose=-1)
ml_m = LGBMClassifier(random_state=314, verbose=-1)

Untuned Model#

Now let us take a look at the standard workflow, focusing on a single treatment level and using default hyperparameters.

[5]:
dml_obj_untuned = DoubleMLAPO(
    dml_data,
    ml_g,
    ml_m,
    treatment_level=treatment_lvl,
)

dml_obj_untuned.fit()
dml_obj_untuned.summary
[5]:
coef std err t P>|t| 2.5 % 97.5 %
d 211.232659 15.657431 13.490888 1.769555e-41 180.544657 241.920661

Hyperparameter Tuning#

Now, let us take a look at the basic hyperparameter tuning. We will initialize a separate model to compare the results.

[6]:
dml_obj_tuned = DoubleMLAPO(
    dml_data,
    ml_g,
    ml_m,
    treatment_level=treatment_lvl,
)

The required input for tuning is a parameter space dictionary for the hyperparameters for each learner that should be tuned. This dictionary should include a callable for each learner you want to have (only a subset is also possible, i.e. only tuning ml_g).

The parameter spaces should be a callable and suggest the search spaces via a trial object.

Generally, the hyperparameter structure should follow the definitions in Optuna, but instead of the objective the hyperparameters have to be specified as a callable. The corresponding DoubleML object then assigns a corresponding objective for each learning using the supplied parameter space.

To keep this example relatively fast and simple, we keep the n_estimators fix and only tune a small number of other hyperparameters.

[7]:
# parameter space for the outcome regression tuning
def ml_g_params(trial):
    return {
        'n_estimators': 100,
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1, log=True),
        'max_depth': 5,
        'min_child_samples': trial.suggest_int('min_child_samples', 20, 50, step=10),
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-2, 10.0, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-2, 10.0, log=True),
    }

# parameter space for the propensity score tuning
def ml_m_params(trial):
    return {
        'n_estimators': 100,
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1, log=True),
        'max_depth': 5,
        'min_child_samples': trial.suggest_int('min_child_samples', 20, 50, step=10),
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-2, 10.0, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-2, 10.0, log=True),
    }

param_space = {
    'ml_g': ml_g_params,
    'ml_m': ml_m_params
}

To tune the hyperparameters the tune_ml_models() with the ml_param_space argument should be called. Further, to define the number of trials and other optuna options you can use the optuna_setttings argument.

[8]:
optuna_settings = {
    'n_trials': 200,
    'show_progress_bar': True,
    'verbosity': optuna.logging.WARNING,  # Suppress Optuna logs
}

dml_obj_tuned.tune_ml_models(
    ml_param_space=param_space,
    optuna_settings=optuna_settings,
)
[8]:
<doubleml.irm.apo.DoubleMLAPO at 0x7f029668ba10>

Per default, the model will set the best hyperparameters automatically (identical hyperparameters for each fold), and you can directly call the fit() method afterwards.

[9]:
dml_obj_tuned.fit()
dml_obj_tuned.summary
[9]:
coef std err t P>|t| 2.5 % 97.5 %
d 215.273433 2.485 86.62916 0.0 210.402924 220.143943

Remark: Even if the initialization and tuning only requires the learners ml_g and ml_m, the models in the irm submodule generally, copy the learner for ml_g and fit different response surfaces for treatment and control (or not-treatment) groups. These different learners are tuned separately but with the same parameter space. To see which parameter spaces can be tuned you can take a look at the params_names property.

In this example, we specified the parameter spaces for ml_m and ml_g, but actually three sets of hyperparameters were tuned, i.e. ml_m, ml_g_d_lvl0 and ml_g_d_lvl1 (two response surfaces for the outcome).

[10]:
dml_obj_tuned.params_names
[10]:
['ml_g_d_lvl0', 'ml_g_d_lvl1', 'ml_m']

Each hyperparameter combination is set for each fold.

[11]:
dml_obj_tuned.params
[11]:
{'ml_g_d_lvl0': {'d': [[{'learning_rate': 0.09801463544687879,
     'min_child_samples': 20,
     'lambda_l1': 3.1247297280098136,
     'lambda_l2': 0.1676705013704926},
    {'learning_rate': 0.09801463544687879,
     'min_child_samples': 20,
     'lambda_l1': 3.1247297280098136,
     'lambda_l2': 0.1676705013704926},
    {'learning_rate': 0.09801463544687879,
     'min_child_samples': 20,
     'lambda_l1': 3.1247297280098136,
     'lambda_l2': 0.1676705013704926},
    {'learning_rate': 0.09801463544687879,
     'min_child_samples': 20,
     'lambda_l1': 3.1247297280098136,
     'lambda_l2': 0.1676705013704926},
    {'learning_rate': 0.09801463544687879,
     'min_child_samples': 20,
     'lambda_l1': 3.1247297280098136,
     'lambda_l2': 0.1676705013704926}]]},
 'ml_g_d_lvl1': {'d': [[{'learning_rate': 0.09957868943595276,
     'min_child_samples': 20,
     'lambda_l1': 1.0312190390696285,
     'lambda_l2': 0.058903541934281406},
    {'learning_rate': 0.09957868943595276,
     'min_child_samples': 20,
     'lambda_l1': 1.0312190390696285,
     'lambda_l2': 0.058903541934281406},
    {'learning_rate': 0.09957868943595276,
     'min_child_samples': 20,
     'lambda_l1': 1.0312190390696285,
     'lambda_l2': 0.058903541934281406},
    {'learning_rate': 0.09957868943595276,
     'min_child_samples': 20,
     'lambda_l1': 1.0312190390696285,
     'lambda_l2': 0.058903541934281406},
    {'learning_rate': 0.09957868943595276,
     'min_child_samples': 20,
     'lambda_l1': 1.0312190390696285,
     'lambda_l2': 0.058903541934281406}]]},
 'ml_m': {'d': [[{'learning_rate': 0.0732109605604835,
     'min_child_samples': 30,
     'lambda_l1': 9.590467244372398,
     'lambda_l2': 0.078138007883929},
    {'learning_rate': 0.0732109605604835,
     'min_child_samples': 30,
     'lambda_l1': 9.590467244372398,
     'lambda_l2': 0.078138007883929},
    {'learning_rate': 0.0732109605604835,
     'min_child_samples': 30,
     'lambda_l1': 9.590467244372398,
     'lambda_l2': 0.078138007883929},
    {'learning_rate': 0.0732109605604835,
     'min_child_samples': 30,
     'lambda_l1': 9.590467244372398,
     'lambda_l2': 0.078138007883929},
    {'learning_rate': 0.0732109605604835,
     'min_child_samples': 30,
     'lambda_l1': 9.590467244372398,
     'lambda_l2': 0.078138007883929}]]}}

Comparison#

Let us compare the results for both models. If we take a look at the predictive performance of the learners, the main difference can be observed in the log loss of the propensity score ml_m

[12]:
dml_obj_untuned.evaluate_learners()
[12]:
{'ml_g_d_lvl0': array([[14.49934441]]),
 'ml_g_d_lvl1': array([[23.50828321]]),
 'ml_m': array([[0.44349955]])}
[13]:
dml_obj_tuned.evaluate_learners()
[13]:
{'ml_g_d_lvl0': array([[14.73579164]]),
 'ml_g_d_lvl1': array([[23.25821083]]),
 'ml_m': array([[0.4101895]])}

As a result the standard error is reduced and confidence intervals are much tighter.

[14]:
ci_untuned = dml_obj_untuned.confint()
ci_tuned = dml_obj_tuned.confint()

# Create comparison dataframe
comparison_data = {
    'Model': ['Untuned', 'Tuned'],
    'theta': [dml_obj_untuned.coef[0], dml_obj_tuned.coef[0]],
    'se': [dml_obj_untuned.se[0], dml_obj_tuned.se[0]],
    'ci_lower': [ci_untuned.iloc[0, 0], ci_tuned.iloc[0, 0]],
    'ci_upper': [ci_untuned.iloc[0, 1], ci_tuned.iloc[0, 1]]
}
df_comparison = pd.DataFrame(comparison_data)

print(f"\nTrue APO at treatment level {treatment_lvl}: {apos[int(treatment_lvl)]:.4f}\n")
print(df_comparison.to_string(index=False))

plt.figure(figsize=(10, 6))
plt.errorbar(0, df_comparison.loc[0, 'theta'],
             yerr=[[df_comparison.loc[0, 'theta'] - df_comparison.loc[0, 'ci_lower']],
                   [df_comparison.loc[0, 'ci_upper'] - df_comparison.loc[0, 'theta']]],
             fmt='o', capsize=5, capthick=2, ecolor=palette[0], color=palette[0],
             label='Untuned', markersize=10, zorder=2)
plt.errorbar(1, df_comparison.loc[1, 'theta'],
             yerr=[[df_comparison.loc[1, 'theta'] - df_comparison.loc[1, 'ci_lower']],
                   [df_comparison.loc[1, 'ci_upper'] - df_comparison.loc[1, 'theta']]],
             fmt='o', capsize=5, capthick=2, ecolor=palette[1], color=palette[1],
             label='Tuned', markersize=10, zorder=2)
plt.axhline(y=apos[int(treatment_lvl)], color=palette[4], linestyle='--',
            linewidth=2, label='True APO', zorder=1)

plt.title(f'Estimated APO Coefficients with and without Tuning for Treatment Level {treatment_lvl}')
plt.ylabel('Coefficient Value')
plt.xticks([0, 1], ['Untuned', 'Tuned'])
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

True APO at treatment level 1.0: 213.4904

  Model      theta        se   ci_lower   ci_upper
Untuned 211.232659 15.657431 180.544657 241.920661
  Tuned 215.273433  2.485000 210.402924 220.143943
../../_images/examples_learners_py_optuna_29_1.png

Detailed Hyperparameter Tuning Guide#

In this section, we explore tuning options in more detail and employ a more complicated learning pipeline.

Define Nuisance Learners#

For our example, we will choose Pipelines to generate a complex StackingRegressor or StackingClassifier.

[15]:
base_regressors = [
    ('linear_regression', LinearRegression()),
    ('lgbm', LGBMRegressor(random_state=42, verbose=-1))
]

stacking_regressor = StackingRegressor(
    estimators=base_regressors,
    final_estimator=Ridge()
)

ml_g_pipeline = Pipeline([
    ('scaler', RobustScaler()),
    ('stacking', stacking_regressor)
])
ml_g_pipeline
[15]:
Pipeline(steps=[('scaler', RobustScaler()),
                ('stacking',
                 StackingRegressor(estimators=[('linear_regression',
                                                LinearRegression()),
                                               ('lgbm',
                                                LGBMRegressor(random_state=42,
                                                              verbose=-1))],
                                   final_estimator=Ridge()))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[16]:
base_classifiers = [
    ('logistic_regression', LogisticRegression(max_iter=1000, random_state=42)),
    ('lgbm', LGBMClassifier(random_state=42, verbose=-1))
]

stacking_classifier = StackingClassifier(
    estimators=base_classifiers,
    final_estimator=LogisticRegression(),
)

ml_m_pipeline = Pipeline([
    ('scaler', RobustScaler()),
    ('stacking', stacking_classifier)
])
ml_g_pipeline
[16]:
Pipeline(steps=[('scaler', RobustScaler()),
                ('stacking',
                 StackingRegressor(estimators=[('linear_regression',
                                                LinearRegression()),
                                               ('lgbm',
                                                LGBMRegressor(random_state=42,
                                                              verbose=-1))],
                                   final_estimator=Ridge()))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Untuned Model with Pipeline#

Now let us take a look at the standard workflow, focusing on a single treatment level and using default hyperparameters.

[17]:
dml_obj_untuned_pipeline = DoubleMLAPO(
    dml_data,
    ml_g_pipeline,
    ml_m_pipeline,
    treatment_level=treatment_lvl,
)

dml_obj_untuned_pipeline.fit()
dml_obj_untuned_pipeline.summary
[17]:
coef std err t P>|t| 2.5 % 97.5 %
d 213.512885 2.335125 91.435306 0.0 208.936124 218.089647

Hyperparameter Tuning with Pipelines#

Now, let us take a look at more complex. Again, we will initialize a separate model to compare the results.

[18]:
dml_obj_tuned_pipeline = DoubleMLAPO(
    dml_data,
    ml_g_pipeline,
    ml_m_pipeline,
    treatment_level=treatment_lvl,
)

As before the tuning input is a parameter space dictionary for the hyperparameters for each learner that should be tuned. This dictionary should include a callable for each learner you want to have (only a subset is also possible, i.e. only tuning ml_g).

Since we have now a much more complicated learner the tuning inputs have to passed correctly into the pipeline.

[19]:
# parameter space for the outcome regression tuning
def ml_g_params_pipeline(trial):
    return {
        'stacking__lgbm__n_estimators': 100,
        'stacking__lgbm__learning_rate': trial.suggest_float('stacking__lgbm__learning_rate', 0.001, 0.1, log=True),
        'stacking__lgbm__max_depth': 5,
        'stacking__lgbm__min_child_samples': trial.suggest_int('stacking__lgbm__min_child_samples', 20, 50, step=10),
        'stacking__lgbm__lambda_l1': trial.suggest_float('stacking__lgbm__lambda_l1', 1e-3, 10.0, log=True),
        'stacking__lgbm__lambda_l2': trial.suggest_float('stacking__lgbm__lambda_l2', 1e-3, 10.0, log=True),
        'stacking__final_estimator__alpha': trial.suggest_float('stacking__final_estimator__alpha', 0.001, 10.0, log=True),
    }

# parameter space for the propensity score tuning
def ml_m_params_pipeline(trial):
    return {
        'stacking__lgbm__n_estimators': 100,
        'stacking__lgbm__learning_rate': trial.suggest_float('stacking__lgbm__learning_rate', 0.001, 0.1, log=True),
        'stacking__lgbm__max_depth': 5,
        'stacking__lgbm__min_child_samples': trial.suggest_int('stacking__lgbm__min_child_samples', 20, 50, step=10),
        'stacking__lgbm__lambda_l1': trial.suggest_float('stacking__lgbm__lambda_l1', 1e-3, 10.0, log=True),
        'stacking__lgbm__lambda_l2': trial.suggest_float('stacking__lgbm__lambda_l2', 1e-3, 10.0, log=True),
        'stacking__final_estimator__C': trial.suggest_float('stacking__final_estimator__C', 0.01, 100.0, log=True),
        'stacking__final_estimator__max_iter': 1000,
    }

param_space_pipeline = {
    'ml_g': ml_g_params_pipeline,
    'ml_m': ml_m_params_pipeline
}

As before, we can pass the arguments for optuna via optuna_settings. For possible option please take a look at the Optuna Documenation. For each learner you can pass local settings which will override the settings.

Here, we will reduce the number of trials for ml_g as it did already perform quite well before. In principle, we could also use different samplers, but generally we recommend to use the TPESampler, which is used by default.

[20]:
optuna_settings_pipeline = {
    'n_trials': 200,
    'show_progress_bar': True,
    'verbosity': optuna.logging.WARNING,  # Suppress Optuna logs
    'ml_g': {
        'n_trials': 100
    }
}

As before, we can tune the hyperparameters via the tune_ml_models() method. If we would like to inspect the optuna.study results, we can return all tuning results via the return_tune_res argument.

We will have a detailed look at the returned results later in the notebook.

[21]:
tuning_results = dml_obj_tuned_pipeline.tune_ml_models(
    ml_param_space=param_space_pipeline,
    optuna_settings=optuna_settings_pipeline,
    return_tune_res=True,
)
[22]:
dml_obj_tuned_pipeline.fit()
dml_obj_tuned_pipeline.summary
[22]:
coef std err t P>|t| 2.5 % 97.5 %
d 213.86343 2.276774 93.932651 0.0 209.401035 218.325825

Remark: All settings (optuna_settings and ml_param_space) can also be set on the params_names level instead of the learner_names level, i.e. ml_g_d_lvl1 instead of ml_g. Generally, more specific settings will override more general settings.

Comparison#

[23]:
ci_untuned = dml_obj_untuned.confint()
ci_tuned = dml_obj_tuned.confint()
ci_untuned_pipeline = dml_obj_untuned_pipeline.confint()
ci_tuned_pipeline = dml_obj_tuned_pipeline.confint()

# Create comparison dataframe
comparison_data = {
    'Model': ['Untuned', 'Tuned', 'Untuned Pipeline', 'Tuned Pipeline'],
    'theta': [dml_obj_untuned.coef[0], dml_obj_tuned.coef[0], dml_obj_untuned_pipeline.coef[0], dml_obj_tuned_pipeline.coef[0]],
    'se': [dml_obj_untuned.se[0], dml_obj_tuned.se[0], dml_obj_untuned_pipeline.se[0], dml_obj_tuned_pipeline.se[0]],
    'ci_lower': [ci_untuned.iloc[0, 0], ci_tuned.iloc[0, 0],
                 ci_untuned_pipeline.iloc[0, 0], ci_tuned_pipeline.iloc[0, 0]],
    'ci_upper': [ci_untuned.iloc[0, 1], ci_tuned.iloc[0, 1],
                 ci_untuned_pipeline.iloc[0, 1], ci_tuned_pipeline.iloc[0, 1]]
}
df_comparison = pd.DataFrame(comparison_data)

print(f"\nTrue APO at treatment level {treatment_lvl}: {apos[int(treatment_lvl)]:.4f}\n")
print(df_comparison.to_string(index=False))

# Create plot with all 4 models
plt.figure(figsize=(12, 6))

for i in range(len(df_comparison)):
    plt.errorbar(i, df_comparison.loc[i, 'theta'],
                 yerr=[[df_comparison.loc[i, 'theta'] - df_comparison.loc[i, 'ci_lower']],
                       [df_comparison.loc[i, 'ci_upper'] - df_comparison.loc[i, 'theta']]],
                 fmt='o', capsize=5, capthick=2, ecolor=palette[i], color=palette[i],
                 label=df_comparison.loc[i, 'Model'], markersize=10, zorder=2)

plt.axhline(y=apos[int(treatment_lvl)], color=palette[4], linestyle='--',
            linewidth=2, label='True APO', zorder=1)

plt.title('Estimated APO Coefficients: Comparison Across All Models')
plt.ylabel('Coefficient Value')
plt.xticks(range(4), df_comparison['Model'], rotation=15, ha='right')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

True APO at treatment level 1.0: 213.4904

           Model      theta        se   ci_lower   ci_upper
         Untuned 211.232659 15.657431 180.544657 241.920661
           Tuned 215.273433  2.485000 210.402924 220.143943
Untuned Pipeline 213.512885  2.335125 208.936124 218.089647
  Tuned Pipeline 213.863430  2.276774 209.401035 218.325825
../../_images/examples_learners_py_optuna_47_1.png

Detailed Tuning Result Analysis#

The tune_ml_models() method creates several Optuna Studies, which can be inspected in detail via the returned results.

The results are a list of dictionaries, which contain a corresponding DMLOptunaResult for each treatment variable on the param_names level, i.e. for each Optuna Study object a separate DMLOptunaResult is constructed.

[24]:
# Optuna results for the single treatment
print(tuning_results[0].keys())
dict_keys(['ml_g_d_lvl0', 'ml_g_d_lvl1', 'ml_m'])

In this example, we take a more detailed look in the tuning of ml_m

[25]:
print(tuning_results[0]['ml_m'])
================== DMLOptunaResult ==================
Learner name: ml_m
Params name: ml_m
Tuned: True
Best score: -0.5210215656500876
Scoring method: neg_log_loss

------------------ Best parameters    ------------------
{'stacking__final_estimator__C': 99.9143065172164,
 'stacking__lgbm__lambda_l1': 9.771463014326052,
 'stacking__lgbm__lambda_l2': 0.0013978426220758982,
 'stacking__lgbm__learning_rate': 0.0011563701553192595,
 'stacking__lgbm__min_child_samples': 30}

As we have access to the saved Optuna Study object, it is possible to access all trials and hyperparameter combinations

[26]:
ml_m_study = tuning_results[0]['ml_m'].study
ml_m_study.trials_dataframe()
[26]:
number value datetime_start datetime_complete duration params_stacking__final_estimator__C params_stacking__lgbm__lambda_l1 params_stacking__lgbm__lambda_l2 params_stacking__lgbm__learning_rate params_stacking__lgbm__min_child_samples state
0 0 -0.528553 2025-11-26 15:53:57.489511 2025-11-26 15:53:57.890548 0 days 00:00:00.401037 0.037301 0.498575 0.001738 0.060706 30 COMPLETE
1 1 -0.523453 2025-11-26 15:53:57.891547 2025-11-26 15:53:58.267444 0 days 00:00:00.375897 24.451158 2.007596 0.013849 0.065453 50 COMPLETE
2 2 -0.528991 2025-11-26 15:53:58.268406 2025-11-26 15:53:58.600445 0 days 00:00:00.332039 0.014914 1.540299 0.242604 0.030858 40 COMPLETE
3 3 -0.521994 2025-11-26 15:53:58.601158 2025-11-26 15:53:58.929636 0 days 00:00:00.328478 43.646318 7.867187 0.004868 0.010726 20 COMPLETE
4 4 -0.528075 2025-11-26 15:53:58.930497 2025-11-26 15:53:59.343117 0 days 00:00:00.412620 0.235419 0.636639 6.249970 0.014593 30 COMPLETE
... ... ... ... ... ... ... ... ... ... ... ...
195 195 -0.521644 2025-11-26 15:55:14.129881 2025-11-26 15:55:14.449172 0 days 00:00:00.319291 97.760868 7.085697 0.001097 0.001008 30 COMPLETE
196 196 -0.521990 2025-11-26 15:55:14.449715 2025-11-26 15:55:14.810118 0 days 00:00:00.360403 99.447375 6.005734 0.001353 0.001181 30 COMPLETE
197 197 -0.521118 2025-11-26 15:55:14.810960 2025-11-26 15:55:15.111526 0 days 00:00:00.300566 99.675451 7.589184 0.001779 0.001080 30 COMPLETE
198 198 -0.525316 2025-11-26 15:55:15.112415 2025-11-26 15:55:15.543287 0 days 00:00:00.430872 99.982456 0.016786 0.001159 0.001184 30 COMPLETE
199 199 -0.521044 2025-11-26 15:55:15.544058 2025-11-26 15:55:15.826366 0 days 00:00:00.282308 80.453189 9.930357 0.001576 0.001073 30 COMPLETE

200 rows × 11 columns

Additionally, we can access all Optuna visualization options

[27]:
fig = optuna.visualization.plot_optimization_history(ml_m_study)
show(fig)
[28]:
fig = optuna.visualization.plot_parallel_coordinate(ml_m_study)
show(fig)
[29]:
fig = optuna.visualization.plot_param_importances(ml_m_study)
show(fig)

DoubleMLAPOS Tuning Example#

We will repeat the tuning procedure for all treatment level with the DoubleMLAPOS object. Combined DoubleML object, such as DoubleMLAPOS or DoubleMLDIDMulti just pass the tuning arguments to the underlying submodels.

[30]:
treatment_lvls = np.unique(d)
print(treatment_lvls)
[0. 1. 2. 3.]

Consequently, we are tuning \(3\) learners for each treatment level.

Untuned Model#

Again, let’s start with the untuned model. We will focus on just on the boosted trees to highlight the ideas.

[31]:
dml_apos_untuned = DoubleMLAPOS(
    dml_data,
    ml_g,
    ml_m,
    treatment_levels=treatment_lvls,
)
dml_apos_untuned.fit()
dml_apos_untuned.summary
[31]:
coef std err t P>|t| 2.5 % 97.5 %
0.0 199.244627 9.335496 21.342693 0.0 180.947391 217.541863
1.0 206.488551 10.468016 19.725662 0.0 185.971617 227.005485
2.0 210.440867 9.128061 23.054279 0.0 192.550196 228.331538
3.0 236.296077 19.486692 12.126023 0.0 198.102863 274.489292

Hyperparameter Tuning#

Let’s initialize a second DoubleMLAPOS object.

[32]:
dml_apos_tuned = DoubleMLAPOS(
    dml_data,
    ml_g,
    ml_m,
    treatment_levels=treatment_lvls,
)

Again, we can directly call the tune_ml_models() method for tuning. This will take some time as each submodel is tuned.

[33]:
dml_apos_tuned.tune_ml_models(
    ml_param_space=param_space,
    optuna_settings=optuna_settings_pipeline,
)
[33]:
<doubleml.irm.apos.DoubleMLAPOS at 0x7f028a1bb4a0>

Afterwards, all hyperparameters are set for each submodel and we can proceed as usual.

[34]:
dml_apos_tuned.fit()
dml_apos_tuned.summary
[34]:
coef std err t P>|t| 2.5 % 97.5 %
0.0 211.422244 2.233931 94.641355 0.0 207.043820 215.800668
1.0 215.370889 2.581675 83.422933 0.0 210.310899 220.430878
2.0 219.233573 2.243180 97.733404 0.0 214.837022 223.630124
3.0 215.927956 2.442976 88.387268 0.0 211.139811 220.716100
[35]:
dml_apos_tuned.causal_contrast(reference_levels=[0]).summary
[35]:
coef std err t P>|t| 2.5 % 97.5 %
1.0 vs 0.0 3.948644 2.615180 1.509894 0.131071 -1.177015 9.074303
2.0 vs 0.0 7.811329 2.375106 3.288833 0.001006 3.156205 12.466452
3.0 vs 0.0 4.505712 2.579080 1.747023 0.080633 -0.549192 9.560616

Finally, lets compare the estimates including confidence intervals.

[36]:
# --- Plot APOs and 95% CIs for all treatment levels (tuned vs. untuned pipeline) ---

plt.figure(figsize=(12, 7))
palette = sns.color_palette("colorblind")

# Collect results for each treatment level from both models
treatment_levels_plot = np.unique(d)
n_levels = len(treatment_levels_plot)

# Prepare DataFrame for plotting
df_apo_plot = pd.DataFrame({
    'treatment_level': np.tile(treatment_levels_plot, 2),
    'APO': np.concatenate([dml_apos_untuned.coef, dml_apos_tuned.coef]),
    'ci_lower': np.concatenate([dml_apos_untuned.confint().iloc[:, 0], dml_apos_tuned.confint().iloc[:, 0]]),
    'ci_upper': np.concatenate([dml_apos_untuned.confint().iloc[:, 1], dml_apos_tuned.confint().iloc[:, 1]]),
    'Model': ['Untuned Pipeline'] * n_levels + ['Tuned Pipeline'] * n_levels
})

jitter_strength = 0.12
models = df_apo_plot['Model'].unique()
n_models = len(models)

for i, model in enumerate(models):
    df = df_apo_plot[df_apo_plot['Model'] == model]
    jitter = (i - (n_models - 1) / 2) * jitter_strength
    x_jittered = df['treatment_level'] + jitter
    plt.errorbar(
        x_jittered,
        df['APO'],
        yerr=[df['APO'] - df['ci_lower'], df['ci_upper'] - df['APO']],
        fmt='o',
        capsize=5,
        capthick=2,
        ecolor=palette[i % len(palette)],
        color=palette[i % len(palette)],
        label=f"{model} APO ±95% CI",
        zorder=2
    )

# Add true APOs as horizontal lines
x_range = plt.xlim()
total_width = x_range[1] - x_range[0]
for i, level in enumerate(treatment_levels_plot):
    line_width = 0.6
    x_center = level
    x_start = x_center - line_width/2
    x_end = x_center + line_width/2
    xmin_rel = max(0, (x_start - x_range[0]) / total_width)
    xmax_rel = min(1, (x_end - x_range[0]) / total_width)
    plt.axhline(y=apos[int(level)], color='red', linestyle='-', alpha=0.7,
                xmin=xmin_rel, xmax=xmax_rel,
                linewidth=3, label='True APO' if i == 0 else "")

plt.title('Estimated APO and 95% Confidence Interval by Treatment Level\n(Pipeline: Tuned vs. Untuned)')
plt.xlabel('Treatment Level')
plt.ylabel('APO')
plt.xticks(treatment_levels_plot)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
../../_images/examples_learners_py_optuna_71_0.png