Spark Ingest#
Spark is used by many organizations to work with and transform large datasets.
The SparkIngestModel
is a simple wrapper designed to
make it easier to work with arbitrarily large Spark datasets and DataRobot.
Usage#
Simply instantiate (and optionally configure) a drx
base model, wrap the model, and pass a Spark dataframe to fit()
.
base = drx.AutoMLModel()
model = drx.SparkIngestModel(base)
model.fit(my_spark_df, target='my_target')
The SparkIngestModel
wrapper performs the following steps:
If needed, downsample in the Spark cluster so data can fit in DataRobot
Upload the resulting downsampled Spark dataframe to AI Catalog
Create a new DataRobot project from the new AI Catalog entry
Orchestrate Autopilot as normal using whatever parameters the base model was configured with (and any additionally specified fit-time keyword arguments)
The desired sampling strategy and the name for the intermediate AI Catalog dataset can also be specified explicitly.
base = drx.AutoTSModel()
model = drx.SparkIngestModel(base, dataset_name='My downsampled TS data', sampling_strategy='most_recent')
model.fit(spark_df, target='my_target', datetime_partition_column='date')
It is strongly recommended that you enable the ‘uploading of data files in stages’ feature flag. This will allow a multi-part upload of the downsampled dataset to AI Catalog. In the absence of this feature flag, large HTTP uploads to DataRobot can intermittently fail.
Training weights, required fit-time arguments, ingest limit
The sampling strategies ‘smart’ and ‘smart_zero_inflated’ require a target variable to be specified at fit-time. The sampling strategy ‘most_recent” requires a ‘datetime_partition_column’ to be be specified at fit-time.
If ‘smart’ or ‘smart_zero_inflated’ is specified as the sampling strategy, the resulting sampling weights are stored in the column ‘dr_sample_weights’ and Autopilot will automatically be configured to use these weights.
The default assumed DataRobot ingest limit is 5GB. To change this behavior, set the private property
_max_dr_ingest
ondrx.Context()
to a new value in bytes.
Users desiring to work directly with Spark downsampled dataframes or wishing to upload other Spark
dataframes to AI catalog can use the downsample_spark
and
spark_to_ai_catalog
helper functions.
API Reference#
|
Train on a Spark dataframe. |
|
Downsample Spark dataframe for DataRobot. |
|
Upload Spark dataframe to AI Catalog. |