跳到主要内容
Open In ColabOpen on GitHub

OpenAI metadata tagger

通常,为摄取的文档标记结构化元数据(例如文档的标题、语气或长度)可能很有用,以便稍后进行更有针对性的相似性搜索。但是,对于大量文档,手动执行此标记过程可能很繁琐。

OpenAIMetadataTagger 文档转换器通过根据提供的模式从每个提供的文档中提取元数据来自动化此过程。它在底层使用可配置的 OpenAI Functions 驱动的链,因此如果您传递自定义 LLM 实例,则它必须是支持函数的 OpenAI 模型。

注意: 此文档转换器最适合处理完整文档,因此最好先对整个文档运行它,然后再进行任何其他拆分或处理!

例如,假设您想索引一组电影评论。您可以按如下方式使用有效的 JSON Schema 对象初始化文档转换器

from langchain_community.document_transformers.openai_functions import (
create_metadata_tagger,
)
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
schema = {
"properties": {
"movie_title": {"type": "string"},
"critic": {"type": "string"},
"tone": {"type": "string", "enum": ["positive", "negative"]},
"rating": {
"type": "integer",
"description": "The number of stars the critic rated the movie",
},
},
"required": ["movie_title", "critic", "tone"],
}

# Must be an OpenAI model that supports functions
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")

document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)

然后,您可以简单地将文档列表传递给文档转换器,它将从内容中提取元数据

original_documents = [
Document(
page_content="Review of The Bee Movie\nBy Roger Ebert\n\nThis is the greatest movie ever made. 4 out of 5 stars."
),
Document(
page_content="Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata={"reliable": False},
),
]

enhanced_documents = document_transformer.transform_documents(original_documents)
import json

print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
Review of The Bee Movie
By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.

{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}

---------------

Review of The Godfather
By Anonymous

This movie was super boring. 1 out of 5 stars.

{"movie_title": "The Godfather", "critic": "Anonymous", "tone": "negative", "rating": 1, "reliable": false}

然后,新文档可以由文本拆分器进一步处理,然后再加载到向量存储中。提取的字段不会覆盖现有的元数据。

您还可以使用 Pydantic 模式初始化文档转换器

from typing import Literal

from pydantic import BaseModel, Field


class Properties(BaseModel):
movie_title: str
critic: str
tone: Literal["positive", "negative"]
rating: int = Field(description="Rating out of 5 stars")


document_transformer = create_metadata_tagger(Properties, llm)
enhanced_documents = document_transformer.transform_documents(original_documents)

print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
Review of The Bee Movie
By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.

{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}

---------------

Review of The Godfather
By Anonymous

This movie was super boring. 1 out of 5 stars.

{"movie_title": "The Godfather", "critic": "Anonymous", "tone": "negative", "rating": 1, "reliable": false}

自定义

您可以在文档转换器构造函数中将标准 LLMChain 参数传递给底层的标记链。例如,如果您想要求 LLM 关注输入文档中的特定细节,或者以某种风格提取元数据,则可以传入自定义提示

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
"""Extract relevant information from the following text.
Anonymous critics are actually Roger Ebert.

{input}
"""
)

document_transformer = create_metadata_tagger(schema, llm, prompt=prompt)
enhanced_documents = document_transformer.transform_documents(original_documents)

print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
API 参考:ChatPromptTemplate
Review of The Bee Movie
By Roger Ebert

This is the greatest movie ever made. 4 out of 5 stars.

{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}

---------------

Review of The Godfather
By Anonymous

This movie was super boring. 1 out of 5 stars.

{"movie_title": "The Godfather", "critic": "Roger Ebert", "tone": "negative", "rating": 1, "reliable": false}

此页是否对您有帮助?