Data Dictionary Generation#


Infer a data dictionary using a DataRobot project


In order to integrate LLMs with ML workflows, natural language context on the use case must often be provided to the LLM.

While it is not a significant burden to provide a succinct description of an ML use case, it can be considerably more cumbersome to provide a description for each field in the dataset(s).

To bootstrap this process, we can use DataRobot and LLMs to infer an initial data dictionary using the use case description and EDA outputs from DR.

Potential use cases#

  1. Aid initial DS understanding of a new or unfamiliar dataset

  2. Use as an input into other LLM-ML hybrid workflows:

    • Extracting natural language insights, summaries from a DS project

    • Target auto-detection and/or suggestion

    • Feature engineering and data enrichment auto-suggest & orchestrate

    • Data cleaning and data problem auto-detection

    • Relationship auto-detection


This helper is provided as a langchain.chains.base.Chain where previous definitions are included in each successive definition prompt to encourage consistency.

A DataRobot project id can optionally be provided to enhance the dictionary generation prompts with DataRobot EDA outputs. Also, if a project id is provided a list of features to include in the data dictionary is not required.


LLM completions are not always correct and often require a human in the loop for validation; downstream applications of the auto-generated data dictionary should be mindful of this consideration

import json
import langchain
import os
from datarobotx.llm import DataDictChain

use_case_context = "Predicting hospital readmissions"
dr_project_id = "XXX"
os.environ["OPENAI_API_KEY"] = "XXX"

llm = langchain.llms.OpenAI(model_name="gpt-3.5-turbo")
chain = DataDictChain(as_json=True, llm=llm)
outputs = chain(inputs=dict(project_id=dr_project_id, context=use_case_context))
data_dict = json.loads(outputs['data_dict'])

# async completion
outputs_from_async = await chain.acall(inputs=dict(project_id=dr_project_id, context=use_case_context))

# only define specific features
selected_outputs = chain(inputs=dict(project_id=dr_project_id, context=use_case_context, features='age, race'))


  • The chain has an optional as_json constructor argument that governs whether the output is returned as a natural language string or a json string. Default is False

  • Similar to langchain.chains.llm.LLMChain, the chain has a verbose constructor argument that determines whether langchain verbose output is used during chain execution; default is False

  • The def_feature_chain constructor argument can be used to optionally specify a custom langchain.chains.llm.LLMChain to be used for completing an individual feature definition; expected inputs to this chain should be context (use case context), and feature (feature name + optional DR EDA summary)

API Reference#


Generate a data dictionary using an LLM.