Data Dictionary Generation#
Infer a data dictionary using a DataRobot project
Motivation#
In order to integrate LLMs with ML workflows, natural language context on the use case must often be provided to the LLM.
While it is not a significant burden to provide a succinct description of an ML use case, it can be considerably more cumbersome to provide a description for each field in the dataset(s).
To bootstrap this process, we can use DataRobot and LLMs to infer an initial data dictionary using the use case description and EDA outputs from DR.
Potential use cases#
Aid initial DS understanding of a new or unfamiliar dataset
Use as an input into other LLM-ML hybrid workflows:
Extracting natural language insights, summaries from a DS project
Target auto-detection and/or suggestion
Feature engineering and data enrichment auto-suggest & orchestrate
Data cleaning and data problem auto-detection
Relationship auto-detection
Usage#
This helper is provided as a langchain.chains.base.Chain
where previous
definitions are included in each successive definition prompt to encourage
consistency.
A DataRobot project id can optionally be provided to enhance the dictionary generation prompts with DataRobot EDA outputs. Also, if a project id is provided a list of features to include in the data dictionary is not required.
Warning
LLM completions are not always correct and often require a human in the loop for validation; downstream applications of the auto-generated data dictionary should be mindful of this consideration
import json
import langchain
import os
from datarobotx.llm import DataDictChain
use_case_context = "Predicting hospital readmissions"
dr_project_id = "XXX"
os.environ["OPENAI_API_KEY"] = "XXX"
llm = langchain.llms.OpenAI(model_name="gpt-3.5-turbo")
chain = DataDictChain(as_json=True, llm=llm)
outputs = chain(inputs=dict(project_id=dr_project_id, context=use_case_context))
data_dict = json.loads(outputs['data_dict'])
# async completion
outputs_from_async = await chain.acall(inputs=dict(project_id=dr_project_id, context=use_case_context))
# only define specific features
selected_outputs = chain(inputs=dict(project_id=dr_project_id, context=use_case_context, features='age, race'))
Notes#
The chain has an optional
as_json
constructor argument that governs whether the output is returned as a natural language string or a json string. Default isFalse
Similar to
langchain.chains.llm.LLMChain
, the chain has averbose
constructor argument that determines whether langchain verbose output is used during chain execution; default isFalse
The
def_feature_chain
constructor argument can be used to optionally specify a customlangchain.chains.llm.LLMChain
to be used for completing an individual feature definition; expected inputs to this chain should becontext
(use case context), andfeature
(feature name + optional DR EDA summary)
API Reference#
|
Generate a data dictionary using an LLM. |