SelfDiscovery#

../_images/selfjoin.png

Apply DataRobot automated feature discovery to a single training dataset.

Partitions the original training dataset into a primary and secondary dataset connected via a self-join. This allows feature discovery to synthetically but automatically explore transformations and aggregations on the features.

In the OTV case, the primary dataset will include the target variable, the join key(s) and the date feature. In the non-OTV case, the primary dataset will also include the original features.

The secondary dataset includes the join keys, the date feature (if applicable), and all non-target features. This dataset will be dynamically generated and a new AI Catalog entry created.

Orchestration of autopilot is delegated to the provided base model.

Problem types that do not support feature discovery

  • AutoTS

  • Clustering

Using with non-panel data

For non-OTV problems, it is recommended to use an external holdout to eliminate any possible target leakage resulting from the self join. For these problems the DataRobot Feature Discovery process may also exhibit long runtimes.

Usage#

Train#

import pandas as pd
import datarobotx as drx

# Load data
df = pd.read_csv('https://s3.amazonaws.com/datarobot_public/datasets/ppe_next_7_days.csv')

# Define lookback windows
feature_windows = feature_windows=[(-5, 0, 'DAY'), (-3, 0, 'DAY'), (-1, 0, 'WEEK'),]

# Define base model, partitioning fetaure, and partioning method
base_model = drx.AutoMLModel(cv_method='datetime')

# Set model to use Feature Discovery on dataset
model = drx.SelfDiscoveryModel(base_model, feature_windows=feature_windows)

# Train model
model.fit(
   df, 
   target='Quantity_Used_Next_7_Days', 
   keys=['Product', 'Location'], 
   kia_features=["Total_Patients"],
   datetime_partition_column="Date",
   )

Retrieve derived features#

sd_df = model.get_derived_features()

Retrieve sql for derived features#

sql = model.get_derived_sql()
print(sql)

Predict#

preds = model.predict(df)

Caveats#

SelfDiscovery models rely on dynamically creating separate AI catalog entries from a single dataset. For that reason, there are a few caveats to consider when using a SelfDiscovery model

Data Partitioning#

For non-time-aware SelfDiscovery models, it’s recommended to use an external holdout to validate prediction performance to reduce the likelihood of undetected target leakage.

This holdout should be a dataset that is not used in the original fitting feature discovery process.

Deployment#

For feature discovery models, DataRobot uses the secondary dataset to generate the derived features at prediction time. This means that the AI catalog entry for the secondary dataset must be up-to-date at prediction time in order to generate informative features.

For SelfDiscovery models, this poses a challenge: the secondary dataset is generated dynamically at training time, and at prediction time the dataset would either need to be replaced or updated in some way.

For this reason, SelfDiscovery models are not currently supported for deployment.

API Reference#

SelfDiscoveryModel(base_model[, feature_windows])

Self-join feature discovery orchestrator.