FeatureDiscovery#

Model and discover features across multiple datasets.

Motivation#

DataRobot feature discovery performs automatic feature engineering across multiple datasets, discovering new features and augmenting the predictive power of a model. drx provides helpers for getting started quickly with Feature Discovery, including:

Automatic upload and registration of pandas DataFrames to AI Catalog
Succinct syntax for specifying primary-secondary joins (including time-aware joins)

For complex schemas (e.g. joining secondary datasets to tertiary datasets) drx can be used in tandem with the official DataRobot Python SDK.

Problem types that do not support feature discovery

AutoTS
Clustering

Usage#

Running feature discovery with drx#

For many feature discovery problems, each secondary dataset is joined directly to the dataset with the target variable. In these cases, drx can be used to quickly define these relationships and begin modeling. In the example below we fit a model by defining relationships between dataframes. Note that users can also pass in AI catalog ids for previously uploaded data when defining relationships or fitting the model.

import datarobotx as drx
import pandas as pd
# Lending Club Datasets
target_path = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Lending+Club+Target.csv"
profile_path = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Lending+Club+Profile.csv"
transactions_path = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Lending+Club+Transactions.csv"

# Read into dataframes
target_df = pd.read_csv(target_path)
profile_df = pd.read_csv(profile_path)
transactions_df = pd.read_csv(transactions_path)

base_model = drx.AutoMLModel(name='drx FD')
fd_model = drx.FeatureDiscoveryModel(base_model)

# Set lookback windows for transactions dataset 
windows = [(-14, -7, 'DAY'), (-7, 0, 'DAY')]
relationships = []

# Define relationship between transactions dataset and target dataset
relationships.append(drx.Relationship(
    transactions_df, 
    keys='CustomerID', 
    temporal_key='Date', 
    feature_windows=windows,
    dataset_name='transactions'
))

# Define relationship between profile dataset and target dataset
relationships.append(drx.Relationship(
    profile_df, 
    keys='CustomerID',
    dataset_name='profile'
    ))

# Set prediction point and begin modeling
fd_model.fit(
    target_df, 
    target='BadLoan',
    feature_engineering_prediction_point='date',
    relationships_configuration=relationships
    )

Non-time-aware feature discovery

For feature discovery without temporal considerations, the temporal_key and feature_engineering_prediction_point arguments can be omitted.

Retrieve derived features#

fd_df = fd_model.get_derived_features()

Retrieve sql for derived features#

sql = fd_model.get_derived_sql()
print(sql)

Predict#

preds = fd_model.predict(df)

Using drx with the Python SDK#

For more complex schemas where a tertiary dataset is joined to a secondary dataset the DataRobot Python SDK can be used in tandem with drx. In the below example, we use the same datasets as above but add a secondary-tertiary join for illustrative purposes.

import datarobot as dr

# Create AI Catalog entries for each dataset
target_catalog_entry = dr.Dataset.create_from_url(target_path)
profile_catalog_entry = dr.Dataset.create_from_url(profile_path)
transactions_catalog_entry = dr.Dataset.create_from_url(transactions_path)

# Create secondary dataset definitions 
dataset_definitions = [
    {
        'identifier': 'profile',
        'catalogId': profile_catalog_entry.id,
        'catalogVersionId': profile_catalog_entry.version_id,
        'snapshotPolicy': 'latest',
    },
    {
        'identifier': 'transactions',
        'catalogId': transactions_catalog_entry.id,
        'catalogVersionId': transactions_catalog_entry.version_id,
        'primaryTemporalKey': 'Date',
    },
]

relationships = [
    {
        'dataset2Identifier': 'transactions',
        'dataset1Keys': ['CustomerID'],
        'dataset2Keys': ['CustomerID'],
        'feature_derivation_windows': [
            {"start":-2,
            "end":-0,
            "unit":'WEEK'},
            {"start":-7,
            "end":-3,
            "unit":'DAY'},
        ]
    },
    {
        'dataset1Identifier': 'transactions',
        'dataset2Identifier': 'profile',
        'dataset1Keys': ['CustomerID'],
        'dataset2Keys': ['CustomerID'],
    }
]
relationship_config = dr.RelationshipsConfiguration.create(
    dataset_definitions=dataset_definitions,
    relationships=relationships
    )
print('relationship config id: '+ relationship_config.id)

Once we have our relationships configured, we pass the relationship configuration id to a drx FeatureDiscoveryModel as normal.

import datarobotx as drx

model = drx.AutoMLModel(name='drx FD from SDK')
fd_model = drx.FeatureDiscoveryModel(model)

fd_model.fit(
    X=target_catalog_entry.id, 
    target='BadLoan',
    relationships_configuration_id=relationship_config.id,
    feature_engineering_prediction_point='date', 
    )

API Reference#

`FeatureDiscoveryModel`(base_model[, remove_udfs])	Feature discovery orchestrator.
`Relationship`(X, keys[, temporal_key, ...])	Secondary dataset relationship definition.