SparkIngestModel#
- class datarobotx.SparkIngestModel(base_model, dataset_name=None, sampling_strategy='uniform')#
Train on a Spark dataframe.
Ingests a Spark dataframe into DataRobot for model training, downsampling if needed.
An AI catalog entry will automatically be created for the ingested data and Autopilot will subsequently be orchestrated as normal.
- Parameters:
base_model (AutopilotModel or IntraProjectModel) – Base model for orchestrating Autopilot after feature discovery. Clustering and AutoTS are not supported.
dataset_name (str) – Name for the automatically-created AI Catalog entry containing the ingested data from Spark
sampling_strategy ({'uniform', 'most_recent', 'smart', 'smart_zero_inflated'}, default='uniform') –
Downsampling strategy to be used if sampling is needed to meet ingest limit. When using smart sampling, training weights will be calculated and stored in the column ‘dr_sample_weights’ and automatically used at fit-time.
’smart’ sampling requires a target variable to be passed at fit-time and ‘most_recent’ sampling requires a datetime_partition_column at fit-time.
Notes
‘uniform’ samples uniformly at random from the provided dataframe
‘most_recent’ samples after ordering the data by the ‘datetime_partition_column’
‘smart’ samples attempting to preserve as many minority target examples as possible
‘smart_zero_inflated’ performs smart sampling, but treats all non-zero values as the same class
Inherited attributes:
Base model used for fitting.
DataRobot python client datarobot.Model object for the present champion.
DataRobot python client datarobot.Project object.
Methods:
fit
(X, *args, **kwargs)Fit model from a Spark dataframe.
Inherited methods:
deploy
([wait_for_autopilot, name])Deploy the model into ML Ops.
Retrieve configuration parameters for the intra-project model.
predict
(X[, wait_for_autopilot])Make batch predictions using the present champion.
predict_proba
(X[, wait_for_autopilot])Calculate class probabilities using the present champion.
set_params
(**kwargs)Set configuration parameters for the intra-project model.
share
(emails)Share a project with other users.
- property base_model: ModelOperator#
Base model used for fitting.
- Returns:
Base model instance
- Return type:
AutopilotModel or IntraProjectModel
- deploy(wait_for_autopilot=False, name=None)#
Deploy the model into ML Ops.
- Return type:
- Returns:
Deployment – Resulting ML Ops deployment
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before deploying the model In non-notebook environments, fit() will always block until complete
name (str, optional, default=None) – Name for the deployment. If None, a name will be generated
- property dr_model: Model#
DataRobot python client datarobot.Model object for the present champion.
- Returns:
datarobot.Model object associated with this drx model
- Return type:
datarobot.Model
- property dr_project: Project#
DataRobot python client datarobot.Project object.
- Returns:
datarobot.Project object associated with this drx.Model
- Return type:
datarobot.Project
- fit(X, *args, **kwargs)#
Fit model from a Spark dataframe.
- get_params()#
Retrieve configuration parameters for the intra-project model.
- Returns:
config – Configuration object containing the parameters for intra project model
- Return type:
Notes
Access configuration parameters for the underlying base model by calling get_params() on the base_model attribute
- predict(X, wait_for_autopilot=False, **kwargs)#
Make batch predictions using the present champion.
Predictions are calculated asynchronously - returns immediately but reinitializes the returned DataFrame with data once predictions are completed.
Predictions are made within the project containing the model using modeling workers. For real-time predictions, first deploy the model.
- Parameters:
X (pandas.DataFrame or str) – Dataset to be scored - target column can be included or omitted. If str, can be AI catalog dataset id or name (if unambiguous)
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before making predictions In non-notebook environments, fit() will always block until complete
**kwargs (Any) – Other key word arguments to pass to the _predict function
- Returns:
Resulting predictions (contained in the column ‘predictions’) Returned immediately, updated automatically when results are completed.
- Return type:
- predict_proba(X, wait_for_autopilot=False, **kwargs)#
Calculate class probabilities using the present champion.
Only available for classifier and clustering models.
- Parameters:
X (pandas.DataFrame or str) – Dataset to be scored - target column can be included or omitted. If str, can be AI catalog dataset id or name (if unambiguous)
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before making predictions In non-notebook environments, fit() will always block until complete
**kwargs (Any) – Other key word arguments to pass to the _predict function
- Returns:
Resulting predictions; probabilities for each label are contained in the column ‘class_{label}’; returned immediately, updated automatically when results are completed.
- Return type:
See also
- set_params(**kwargs)#
Set configuration parameters for the intra-project model.
- Parameters:
**kwargs (
Any
) – Configuration parameters to be set or updated for this model.- Returns:
self – IntraProjectModel instance
- Return type:
IntraProjectModel
Notes
Configuration parameters for the underlying base model can be set by calling set_params() on the base_model attribute