Spark Ingest#

../_images/sparkingest.png

Spark is used by many organizations to work with and transform large datasets. The SparkIngestModel is a simple wrapper designed to make it easier to work with arbitrarily large Spark datasets and DataRobot.

Usage#

Simply instantiate (and optionally configure) a drx base model, wrap the model, and pass a Spark dataframe to fit().

base = drx.AutoMLModel()
model = drx.SparkIngestModel(base)
model.fit(my_spark_df, target='my_target')

The SparkIngestModel wrapper performs the following steps:

  1. If needed, downsample in the Spark cluster so data can fit in DataRobot

  2. Upload the resulting downsampled Spark dataframe to AI Catalog

  3. Create a new DataRobot project from the new AI Catalog entry

  4. Orchestrate Autopilot as normal using whatever parameters the base model was configured with (and any additionally specified fit-time keyword arguments)

The desired sampling strategy and the name for the intermediate AI Catalog dataset can also be specified explicitly.

base = drx.AutoTSModel()
model = drx.SparkIngestModel(base, dataset_name='My downsampled TS data', sampling_strategy='most_recent')
model.fit(spark_df, target='my_target', datetime_partition_column='date')

It is strongly recommended that you enable the ‘uploading of data files in stages’ feature flag. This will allow a multi-part upload of the downsampled dataset to AI Catalog. In the absence of this feature flag, large HTTP uploads to DataRobot can intermittently fail.

Training weights, required fit-time arguments, ingest limit

  • The sampling strategies ‘smart’ and ‘smart_zero_inflated’ require a target variable to be specified at fit-time. The sampling strategy ‘most_recent” requires a ‘datetime_partition_column’ to be be specified at fit-time.

  • If ‘smart’ or ‘smart_zero_inflated’ is specified as the sampling strategy, the resulting sampling weights are stored in the column ‘dr_sample_weights’ and Autopilot will automatically be configured to use these weights.

  • The default assumed DataRobot ingest limit is 5GB. To change this behavior, set the private property _max_dr_ingest on drx.Context() to a new value in bytes.

Users desiring to work directly with Spark downsampled dataframes or wishing to upload other Spark dataframes to AI catalog can use the downsample_spark and spark_to_ai_catalog helper functions.

API Reference#

SparkIngestModel(base_model[, dataset_name, ...])

Train on a Spark dataframe.

downsample_spark(spark_df[, ...])

Downsample Spark dataframe for DataRobot.

spark_to_ai_catalog(spark_df, name[, max_rows])

Upload Spark dataframe to AI Catalog.