OpenAI 元数据标记器
通常,将摄入的文档标记为结构化元数据(例如文档的标题、语气或长度)会很有用,以便后续进行更具针对性的相似性搜索。然而,对于大量文档,手动执行此标记过程可能会很繁琐。
OpenAIMetadataTagger
文档转换器通过根据提供的模式从每个文档中提取元数据来自动化此过程。它在底层使用可配置的由 OpenAI Functions
支持的链,因此如果您传递自定义 LLM 实例,它必须是支持函数的 OpenAI
模型。
注意:此文档转换器最适合处理完整文档,因此最好在进行任何其他拆分或处理之前,先使用完整文档运行它!
例如,假设您想索引一组电影评论。您可以按照以下方式使用有效的 JSON Schema
对象初始化文档转换器
from langchain_community.document_transformers.openai_functions import (
create_metadata_tagger,
)
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
schema = {
"properties": {
"movie_title": {"type": "string"},
"critic": {"type": "string"},
"tone": {"type": "string", "enum": ["positive", "negative"]},
"rating": {
"type": "integer",
"description": "The number of stars the critic rated the movie",
},
},
"required": ["movie_title", "critic", "tone"],
}
# Must be an OpenAI model that supports functions
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)
然后,您可以简单地将文档列表传递给文档转换器,它将从内容中提取元数据
original_documents = [
Document(
page_content="Review of The Bee Movie\nBy Roger Ebert\n\nThis is the greatest movie ever made. 4 out of 5 stars."
),
Document(
page_content="Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata={"reliable": False},
),
]
enhanced_documents = document_transformer.transform_documents(original_documents)
import json
print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars.
{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}
---------------
Review of The Godfather
By Anonymous
This movie was super boring. 1 out of 5 stars.
{"movie_title": "The Godfather", "critic": "Anonymous", "tone": "negative", "rating": 1, "reliable": false}
然后,新文档可以在加载到向量存储之前,由文本分割器进一步处理。提取的字段不会覆盖现有元数据。
您也可以使用 Pydantic 模式初始化文档转换器
from typing import Literal
from pydantic import BaseModel, Field
class Properties(BaseModel):
movie_title: str
critic: str
tone: Literal["positive", "negative"]
rating: int = Field(description="Rating out of 5 stars")
document_transformer = create_metadata_tagger(Properties, llm)
enhanced_documents = document_transformer.transform_documents(original_documents)
print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars.
{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}
---------------
Review of The Godfather
By Anonymous
This movie was super boring. 1 out of 5 stars.
{"movie_title": "The Godfather", "critic": "Anonymous", "tone": "negative", "rating": 1, "reliable": false}
自定义
您可以在文档转换器构造函数中将标准的 LLMChain 参数传递给底层的标记链。例如,如果您想让 LLM 专注于输入文档中的特定细节,或者以某种特定样式提取元数据,您可以传入自定义提示。
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"""Extract relevant information from the following text.
Anonymous critics are actually Roger Ebert.
{input}
"""
)
document_transformer = create_metadata_tagger(schema, llm, prompt=prompt)
enhanced_documents = document_transformer.transform_documents(original_documents)
print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
API 参考:ChatPromptTemplate
Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars.
{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}
---------------
Review of The Godfather
By Anonymous
This movie was super boring. 1 out of 5 stars.
{"movie_title": "The Godfather", "critic": "Roger Ebert", "tone": "negative", "rating": 1, "reliable": false}