Enrich#

../_images/enrich.png

Enrich structured datasets with completions from LLMs, chains, or tools

Motivation#

For many predictive modeling problems, enriching the training dataset with additional, relevant features can yield significant improvements in model performance.

Large language models possess capabilities that suggest they may be useful in helping accelerate these predictive modeling data enrichment workflows:

  • Breadth of general knowledge Many LLMs are able to retrieve answers to questions on an extremely broad range of topics and may also be able to perform basic reasoning and planning tasks

  • Natural language interface LLMs can be instructed in natural language which further accelerates the time to go from human-intent to results

In the past, integrating general knowledge into a predictive modeling use case might have been prohibitively costly and/or cumbersome (e.g. manual labelling, or writing scraping and parsing boilerplate). LLMs represent the opportunity to greatly accelerate experimenting with these workflows.

Potential use cases#

  1. Retrieving additional or related context using a feature or combinations of features

  2. Cleaning inconsistently formatted or specified features

  3. Applying transformations that require common sense or general knowledge (e.g. simple labeling)

  4. Extracting entities/values from text features or classifying them

Usage#

This helper is designed to further accelerate and enhance experimental enrichment workflows relative to direct usage of an LLM API; default capabilities include

  • Caching duplicative enrichment completions

  • Progress updating when performing many row-level completions

  • Mapping pandas row or column values to format a templated question automatically

  • ML-oriented default contextual prompts and chains:

    • Attempting to infer and instruct around an appropriate completion type: numeric, categorical, date, or free-text

    • Including prior completions in successive prompts to encourage consistency (e.g. of date formatting and categorical levels)

  • Customizable and LLM-agnostic: interoperates with custom LangChain Chains, Tools, LLMs

import pandas as pd
import langchain
import os
from datarobotx.llm import enrich

os.environ['OPENAI_API_KEY'] = 'XXX'
llm = langchain.llms.OpenAI(model_name='text-davinci-003')
df = pd.read_csv('https://s3.amazonaws.com/datarobot_public_datasets/' +
                 '10K_2007_to_2011_Lending_Club_Loans_v2_mod_80.csv')
df_test = df[:5].copy(deep=True)
df_test['f500_or_gov'] = df_test.apply(enrich('Is "{emp_title}" a Fortune 500 company or ' +
                                              'large government organization (Y/N)?', llm),
                                       axis=1)

Warning

Using LLM completions as inputs to a predictive model introduces multiple potential complications, including (but not limited to):

  • Target leakage

  • Label and/or feature noise (e.g. incorrect completions and hallucinations)

  • Deployment complexity

For these reasons, we presently encourage using these capabilities primarily as an experimental mechanism for feature discovery.

Embeddings#

Using a similar syntax, dataframes can be augmented with embeddings from LangChain LLMs. By default, all-MiniLM-L6-v2 from HuggingFace SentenceTransformerEmbeddings is used because of its generally quick inference and high performance in sentence embedding use cases. Benchmarks for other choices of embeddings are shown on HuggingFace embeddings leaderboard. Additional supported embeddings can be found in the LangChain docs.

from langchain.embeddings import OpenAIEmbeddings
from datarobotx.llm import embed

embedding_llm = OpenAIEmbeddings(model="text-embedding-ada-002")
df_with_embeddings = df_test.join(df_test.apply(embed('Applicant works at {emp_title} and needs a loan for {title}', embedding_llm), axis=1))

API Reference#

llm.enrich(question, using[, default_cache, ...])

Enrich structured data with completions from an LLM or chain.