downsample_spark#

datarobotx.downsample_spark(spark_df, sampling_strategy='uniform', target=None, datetime_partition_column=None)#

Downsample Spark dataframe for DataRobot.

Row limits are estimated statistically for performance and are not guaranteed to precisely match the ingest limit. The private drx.Context() property _max_dr_ingest specifies the DR ingest limit in bytes that will be used for row limit estimation.

When using this function standalone, training weights produced by smart sampling must be manually passed by the user into downstream model fitting operations.

Parameters:
  • spark_df (pyspark.sql.DataFrame) – Data to be ingested into DataRobot AI Catalog

  • sampling_strategy ({'uniform', 'most_recent', 'smart', 'smart_zero_inflated'}) – Downsampling strategy to be used if sampling is needed to meet ingest limit. When using smart sampling, training weights will be calculated and stored in the column ‘dr_sample_weights’

  • target (str, optional) – Target column name; required with smart sampling

  • datetime_partition_column (str, optional) – Primary date feature column name; required with most recent sampling

Return type:

Tuple[DataFrame, int]

Returns:

  • df (pyspark.sql.DataFrame) – Downsampled dataframe

  • max_rows (int) – Max number of rows that should be retained from the dataframe; to avoid OOM on the driver, this limit should be enforced by whatever mechanism streams the data out of the spark cluster

Notes

‘uniform’ samples uniformly at random from the provided dataframe

‘most_recent’ samples after ordering the data by the ‘datetime_partition_column’

‘smart’ samples attempting to preserve as many minority target examples as possible

‘smart_zero_inflated’ performs smart sampling, but treats all non-zero values as the same class

The resulting dataframe is not directly row limited by Spark because Spark limit() collapses to a single partition and can OOM the driver. Instead the limit is enforced by the uploading function while iterating over Spark partitions at upload time.

In most cases ‘most_recent’ is the only form of sampling for which this limit needs to be enforced at upload time, however ‘uniform’ may occasionally require row limit enforcement because Spark sample() does not provide row count guarantees.