在进行提取时如何处理长文本

在处理诸如 PDF 之类的文件时，你可能会遇到超出语言模型上下文窗口的文本。为了处理这些文本，可以考虑以下策略：

更改大语言模型 选择一个支持更大上下文窗口的不同大语言模型。
暴力方法 将文档分块，并从每个分块中提取内容。
RAG 将文档分块，为分块建立索引，并且只从看起来“相关”的分块子集中提取内容。

请记住，这些策略有不同的权衡，最佳策略可能取决于你正在设计的应用程序！

本指南演示了如何实现策略 2 和 3。

设置

首先，我们将安装本指南所需的依赖项

%pip install -qU langchain-community lxml faiss-cpu langchain-openai

Note: you may need to restart the kernel to use updated packages.

现在我们需要一些示例数据！让我们从维基百科下载一篇关于汽车的文章，并将其作为 LangChain 文档加载。

import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)

API 参考:BSHTMLLoader

print(len(document.page_content))

定义模式

遵循提取教程，我们将使用 Pydantic 来定义我们希望提取的信息模式。在本例中，我们将提取一个“关键发展”列表（例如，重要的历史事件），其中包含年份和描述。

请注意，我们还包含了一个 evidence 键，并指示模型逐字提供文章中相关的文本句子。这使我们能够将提取结果与原始文档中的文本（模型的重构）进行比较。

from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field


class KeyDevelopment(BaseModel):
    """Information about a development in the history of cars."""

    year: int = Field(
        ..., description="The year when there was an important historic development."
    )
    description: str = Field(
        ..., description="What happened in this year? What was the development?"
    )
    evidence: str = Field(
        ...,
        description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
    )


class ExtractionData(BaseModel):
    """Extracted information about key developments in the history of cars."""

    key_developments: List[KeyDevelopment]


# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert at identifying key historic development in text. "
            "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
        ),
        ("human", "{text}"),
    ]
)

API 参考:ChatPromptTemplate | MessagesPlaceholder

创建提取器

让我们选择一个大语言模型。因为我们正在使用工具调用，所以我们需要一个支持工具调用特性的模型。请参阅此表以了解可用的大语言模型。

选择聊天模型

pip install -qU "langchain[google-genai]"

import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    include_raw=False,
)

暴力方法

将文档分割成块，使每个块都能适应大语言模型的上下文窗口。

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    # Controls the size of each chunk
    chunk_size=2000,
    # Controls overlap between chunks
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

API 参考:TokenTextSplitter

使用批处理功能对每个块并行运行提取！

提示

你通常可以使用 .batch() 来并行化提取！.batch 在底层使用线程池来帮助你并行处理工作负载。

如果你的模型通过 API 暴露，这可能会大大加快你的提取流程！

# Limit just to the first 3 chunks
# so the code can be re-run quickly
first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # limit the concurrency by passing max concurrency!
)

合并结果

从各个分块中提取数据后，我们需要将提取结果合并在一起。

key_developments = []

for extraction in extractions:
    key_developments.extend(extraction.key_developments)

key_developments[:10]

[KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz invented the modern car, a practical, marketable automobile for everyday use, and patented his Benz Patent-Motorwagen.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when the German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1901, description='The Oldsmobile Curved Dash became the first mass-produced car.', evidence='The 1901 Oldsmobile Curved Dash and the 1908 Ford Model T, both American cars, are widely considered the first mass-produced[3][4] and mass-affordable[5][6][7] cars, respectively.'),
 KeyDevelopment(year=1908, description='The Ford Model T became the first mass-affordable car.', evidence='The 1901 Oldsmobile Curved Dash and the 1908 Ford Model T, both American cars, are widely considered the first mass-produced[3][4] and mass-affordable[5][6][7] cars, respectively.'),
 KeyDevelopment(year=1885, description='Carl Benz built the original Benz Patent-Motorwagen, the first modern car.', evidence='The original Benz Patent-Motorwagen, the first modern car, built in 1885 and awarded the patent for the concept'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife and business partner of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='The first motor car in central Europe and one of the first factory-made cars in the world was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra), the Präsident automobil.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.')]

基于 RAG 的方法

另一个简单的想法是将文本分块，但不是从每个分块中提取信息，而是只关注最相关的分块。

注意

识别哪些分块是相关的可能很困难。

例如，在我们这里使用的汽车文章中，大部分内容都包含关键发展信息。因此，通过使用 RAG，我们很可能会丢弃大量相关信息。

我们建议你根据自己的用例进行实验，并确定此方法是否有效。

要实现基于 RAG 的方法

将你的文档分块并为它们建立索引（例如，在向量数据库中）；
在提取器链前面添加一个使用向量数据库的检索步骤。

这是一个依赖于 FAISS 向量数据库的简单示例。

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Only extract from first document

API 参考:FAISS | 文档 | RunnableLambda | OpenAIEmbeddings | CharacterTextSplitter

在这种情况下，RAG 提取器只查看最顶部的文档。

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # fetch content of top doc
} | extractor

results = rag_extractor.invoke("Key developments associated with cars")

for key_development in results.key_developments:
    print(key_development)

year=2006 description='Car-sharing services in the US experienced double-digit growth in revenue and membership.' evidence='in the US, some car-sharing services have experienced double-digit growth in revenue and membership growth between 2006 and 2007.'
year=2020 description='56 million cars were manufactured worldwide, with China producing the most.' evidence='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year. The automotive industry in China produces by far the most (20 million in 2020).'

常见问题

不同的方法在成本、速度和准确性方面各有优缺点。

请注意以下问题

内容分块意味着如果信息分散在多个分块中，大语言模型可能无法提取信息。
大量分块重叠可能会导致相同信息被提取两次，因此请准备好进行去重！
大语言模型可能会捏造数据。如果在大量文本中查找单一事实并使用暴力方法，你最终可能会得到更多捏造的数据。

设置​

定义模式​

创建提取器​

暴力方法​

合并结果​

基于 RAG 的方法​

常见问题​

设置