FeatureDiscovery#
Model and discover features across multiple datasets.
Motivation#
DataRobot feature discovery performs automatic feature engineering across multiple
datasets, discovering new features and augmenting the predictive power of a model.
drx
provides helpers for getting started quickly with Feature Discovery,
including:
Automatic upload and registration of pandas DataFrames to AI Catalog
Succinct syntax for specifying primary-secondary joins (including time-aware joins)
For complex schemas (e.g. joining secondary datasets to tertiary datasets) drx
can be used in tandem with the official DataRobot Python SDK.
Problem types that do not support feature discovery
AutoTS
Clustering
Usage#
Running feature discovery with drx#
For many feature discovery problems, each secondary dataset is joined directly to the dataset with the target variable. In these cases, drx can be used to quickly define these relationships and begin modeling. In the example below we fit a model by defining relationships between dataframes. Note that users can also pass in AI catalog ids for previously uploaded data when defining relationships or fitting the model.
import datarobotx as drx
import pandas as pd
# Lending Club Datasets
target_path = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Lending+Club+Target.csv"
profile_path = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Lending+Club+Profile.csv"
transactions_path = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Lending+Club+Transactions.csv"
# Read into dataframes
target_df = pd.read_csv(target_path)
profile_df = pd.read_csv(profile_path)
transactions_df = pd.read_csv(transactions_path)
base_model = drx.AutoMLModel(name='drx FD')
fd_model = drx.FeatureDiscoveryModel(base_model)
# Set lookback windows for transactions dataset
windows = [(-14, -7, 'DAY'), (-7, 0, 'DAY')]
relationships = []
# Define relationship between transactions dataset and target dataset
relationships.append(drx.Relationship(
transactions_df,
keys='CustomerID',
temporal_key='Date',
feature_windows=windows,
dataset_name='transactions'
))
# Define relationship between profile dataset and target dataset
relationships.append(drx.Relationship(
profile_df,
keys='CustomerID',
dataset_name='profile'
))
# Set prediction point and begin modeling
fd_model.fit(
target_df,
target='BadLoan',
feature_engineering_prediction_point='date',
relationships_configuration=relationships
)
Non-time-aware feature discovery
For feature discovery without temporal considerations, the temporal_key
and
feature_engineering_prediction_point
arguments can be omitted.
Retrieve derived features#
fd_df = fd_model.get_derived_features()
Retrieve sql for derived features#
sql = fd_model.get_derived_sql()
print(sql)
Predict#
preds = fd_model.predict(df)
Using drx with the Python SDK#
For more complex schemas where a tertiary dataset is joined to a secondary dataset the DataRobot Python SDK can be used in tandem with drx. In the below example, we use the same datasets as above but add a secondary-tertiary join for illustrative purposes.
import datarobot as dr
# Create AI Catalog entries for each dataset
target_catalog_entry = dr.Dataset.create_from_url(target_path)
profile_catalog_entry = dr.Dataset.create_from_url(profile_path)
transactions_catalog_entry = dr.Dataset.create_from_url(transactions_path)
# Create secondary dataset definitions
dataset_definitions = [
{
'identifier': 'profile',
'catalogId': profile_catalog_entry.id,
'catalogVersionId': profile_catalog_entry.version_id,
'snapshotPolicy': 'latest',
},
{
'identifier': 'transactions',
'catalogId': transactions_catalog_entry.id,
'catalogVersionId': transactions_catalog_entry.version_id,
'primaryTemporalKey': 'Date',
},
]
relationships = [
{
'dataset2Identifier': 'transactions',
'dataset1Keys': ['CustomerID'],
'dataset2Keys': ['CustomerID'],
'feature_derivation_windows': [
{"start":-2,
"end":-0,
"unit":'WEEK'},
{"start":-7,
"end":-3,
"unit":'DAY'},
]
},
{
'dataset1Identifier': 'transactions',
'dataset2Identifier': 'profile',
'dataset1Keys': ['CustomerID'],
'dataset2Keys': ['CustomerID'],
}
]
relationship_config = dr.RelationshipsConfiguration.create(
dataset_definitions=dataset_definitions,
relationships=relationships
)
print('relationship config id: '+ relationship_config.id)
Once we have our relationships configured, we pass the relationship configuration id
to a drx
FeatureDiscoveryModel as normal.
import datarobotx as drx
model = drx.AutoMLModel(name='drx FD from SDK')
fd_model = drx.FeatureDiscoveryModel(model)
fd_model.fit(
X=target_catalog_entry.id,
target='BadLoan',
relationships_configuration_id=relationship_config.id,
feature_engineering_prediction_point='date',
)
API Reference#
|
Feature discovery orchestrator. |
|
Secondary dataset relationship definition. |