在进行查询分析时如何处理高基数分类数据
您可能需要进行查询分析,以便在分类列上创建过滤器。这里的一个难题是,您通常需要指定确切的分类值。问题在于,您需要确保 LLM 完全生成该分类值。当只有少数几个有效值时,通过提示可以相对容易地完成此操作。当有大量有效值时,它会变得更加困难,因为这些值可能不适合 LLM 上下文,或者(如果适合)可能太多,导致 LLM 无法正确关注。
在本笔记本中,我们将介绍如何解决这个问题。
设置
安装依赖项
%pip install -qU langchain langchain-community langchain-openai faker langchain-chroma
Note: you may need to restart the kernel to use updated packages.
设置环境变量
在此示例中,我们将使用 OpenAI
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass()
# Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
设置数据
我们将生成一些虚假姓名
from faker import Faker
fake = Faker()
names = [fake.name() for _ in range(10000)]
让我们看看一些姓名
names[0]
'Jacob Adams'
names[567]
'Eric Acevedo'
查询分析
我们现在可以设置一个基准查询分析
from pydantic import BaseModel, Field, model_validator
class Search(BaseModel):
query: str
author: str
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
system = """Generate a relevant search query for a library system"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Search)
query_analyzer = {"question": RunnablePassthrough()} | prompt | structured_llm
我们可以看到,如果我们拼写的名字完全正确,它知道如何处理它
query_analyzer.invoke("what are books about aliens by Jesse Knight")
Search(query='aliens', author='Jesse Knight')
问题是你想过滤的值可能拼写不完全正确
query_analyzer.invoke("what are books about aliens by jess knight")
Search(query='aliens', author='Jess Knight')
添加所有值
解决这个问题的一个方法是将所有可能的值添加到提示中。这通常会引导查询朝着正确的方向发展
system = """Generate a relevant search query for a library system.
`author` attribute MUST be one of:
{authors}
Do NOT hallucinate author name!"""
base_prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
prompt = base_prompt.partial(authors=", ".join(names))
query_analyzer_all = {"question": RunnablePassthrough()} | prompt | structured_llm
然而... 如果分类值的列表足够长,它可能会出错!
try:
res = query_analyzer_all.invoke("what are books about aliens by jess knight")
except Exception as e:
print(e)
我们可以尝试使用更长的上下文窗口... 但其中包含如此多的信息,并不能保证它能可靠地提取出来
llm_long = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
structured_llm_long = llm_long.with_structured_output(Search)
query_analyzer_all = {"question": RunnablePassthrough()} | prompt | structured_llm_long
query_analyzer_all.invoke("what are books about aliens by jess knight")
Search(query='aliens', author='jess knight')
查找所有相关值
相反,我们可以做的是在相关值上创建一个索引,然后查询该索引以获取 N 个最相关的值,
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(names, embeddings, collection_name="author_names")
API 参考:OpenAIEmbeddings
def select_names(question):
_docs = vectorstore.similarity_search(question, k=10)
_names = [d.page_content for d in _docs]
return ", ".join(_names)
create_prompt = {
"question": RunnablePassthrough(),
"authors": select_names,
} | base_prompt
query_analyzer_select = create_prompt | structured_llm
create_prompt.invoke("what are books by jess knight")
ChatPromptValue(messages=[SystemMessage(content='Generate a relevant search query for a library system.\n\n`author` attribute MUST be one of:\n\nJennifer Knight, Jill Knight, John Knight, Dr. Jeffrey Knight, Christopher Knight, Andrea Knight, Brandy Knight, Jennifer Keller, Becky Chambers, Sarah Knapp\n\nDo NOT hallucinate author name!'), HumanMessage(content='what are books by jess knight')])
query_analyzer_select.invoke("what are books about aliens by jess knight")
Search(query='books about aliens', author='Jennifer Knight')
选择后替换
另一种方法是让 LLM 填写任何值,然后将该值转换为有效值。这实际上可以使用 Pydantic 类本身完成!
class Search(BaseModel):
query: str
author: str
@model_validator(mode="before")
@classmethod
def double(cls, values: dict) -> dict:
author = values["author"]
closest_valid_author = vectorstore.similarity_search(author, k=1)[
0
].page_content
values["author"] = closest_valid_author
return values
system = """Generate a relevant search query for a library system"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
corrective_structure_llm = llm.with_structured_output(Search)
corrective_query_analyzer = (
{"question": RunnablePassthrough()} | prompt | corrective_structure_llm
)
corrective_query_analyzer.invoke("what are books about aliens by jes knight")
Search(query='aliens', author='John Knight')
# TODO: show trigram similarity