如何分割HTML

将HTML文档拆分为可管理的小块对于各种文本处理任务至关重要，例如自然语言处理、搜索索引等。本指南将探讨LangChain提供的三种不同的文本拆分器，您可以有效地使用它们来拆分HTML内容。

HTMLHeaderTextSplitter
HTMLSectionSplitter
HTMLSemanticPreservingSplitter

这些拆分器中的每一个都具有独特的功能和用例。本指南将帮助您了解它们之间的差异，为什么您可能会选择其中一个而不是其他，以及如何有效地使用它们。

%pip install -qU langchain-text-splitters

拆分器概述

HTMLHeaderTextSplitter

信息

当您希望根据文档的标题保留其分层结构时非常有用。

描述：根据标题标签（例如 `

`、`

` 等）拆分HTML文本，并为每个与给定块相关的标题添加元数据。

功能:

在HTML元素级别拆分文本。

保留文档结构中编码的上下文丰富信息。

可以逐个元素返回块，或者将具有相同元数据的元素合并。

HTMLSectionSplitter

信息

当您希望将HTML文档拆分为更大的部分时非常有用，例如 `section`、`div` 或自定义定义的部分。

描述：类似于 HTMLHeaderTextSplitter，但侧重于根据指定标签将HTML拆分为部分。

功能:

使用XSLT转换来检测和拆分部分。
内部对大节使用 `RecursiveCharacterTextSplitter`。
考虑字体大小来确定部分。

HTMLSemanticPreservingSplitter

信息

当您需要确保结构化元素不会跨块拆分，从而保留上下文相关性时，它是理想选择。

描述：将HTML内容拆分为可管理的小块，同时保留表格、列表和其他HTML组件等重要元素的语义结构。

功能:

保留表格、列表和其他指定的HTML元素。
允许为特定的HTML标签定义自定义处理程序。
确保文档的语义意义得以保持。
内置标准化和停用词删除

选择正确的拆分器

当您需要使用 `HTMLHeaderTextSplitter` 时：您需要根据HTML文档的标题层级拆分文档并维护有关标题的元数据。
当您需要使用 `HTMLSectionSplitter` 时：您需要将文档拆分为更大、更通用的部分，可能基于自定义标签或字体大小。
当您需要使用 `HTMLSemanticPreservingSplitter` 时：您需要将文档拆分为块，同时保留表格和列表等语义元素，确保它们不被拆分且其上下文得到维护。

功能	HTMLHeaderTextSplitter	HTMLSectionSplitter	HTMLSemanticPreservingSplitter
基于标题拆分	是	是	是
保留语义元素（表格、列表）	否	否	是
为标题添加元数据	是	是	是
HTML标签的自定义处理程序	否	否	是
保留媒体（图像、视频）	否	否	是
考虑字体大小	否	是	否
使用XSLT转换	否	是	否

HTML文档示例

让我们使用以下HTML文档作为示例

html_string = """
<!DOCTYPE html>
  <html lang='en'>
  <head>
    <meta charset='UTF-8'>
    <meta name='viewport' content='width=device-width, initial-scale=1.0'>
    <title>Fancy Example HTML Page</title>
  </head>
  <body>
    <h1>Main Title</h1>
    <p>This is an introductory paragraph with some basic content.</p>
    
    <h2>Section 1: Introduction</h2>
    <p>This section introduces the topic. Below is a list:</p>
    <ul>
      <li>First item</li>
      <li>Second item</li>
      <li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
    </ul>
    
    <h3>Subsection 1.1: Details</h3>
    <p>This subsection provides additional details. Here's a table:</p>
    <table border='1'>
      <thead>
        <tr>
          <th>Header 1</th>
          <th>Header 2</th>
          <th>Header 3</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Row 1, Cell 1</td>
          <td>Row 1, Cell 2</td>
          <td>Row 1, Cell 3</td>
        </tr>
        <tr>
          <td>Row 2, Cell 1</td>
          <td>Row 2, Cell 2</td>
          <td>Row 2, Cell 3</td>
        </tr>
      </tbody>
    </table>
    
    <h2>Section 2: Media Content</h2>
    <p>This section contains an image and a video:</p>
      <img src='example_image_link.mp4' alt='Example Image'>
      <video controls width='250' src='example_video_link.mp4' type='video/mp4'>
      Your browser does not support the video tag.
    </video>

    <h2>Section 3: Code Example</h2>
    <p>This section contains a code block:</p>
    <pre><code data-lang="html">
    &lt;div&gt;
      &lt;p&gt;This is a paragraph inside a div.&lt;/p&gt;
    &lt;/div&gt;
    </code></pre>

    <h2>Conclusion</h2>
    <p>This is the conclusion of the document.</p>
  </body>
  </html>
"""

使用 HTMLHeaderTextSplitter

HTMLHeaderTextSplitter 是一种“结构感知”文本拆分器，它在HTML元素级别拆分文本，并为每个与给定块“相关”的标题添加元数据。它可以逐个元素返回块，或将具有相同元数据的元素合并，其目的是 (a) 在语义上（或多或少地）将相关文本分组，以及 (b) 保留文档结构中编码的上下文丰富信息。它可以与其他文本拆分器一起用作分块管道的一部分。

它类似于Markdown文件的MarkdownHeaderTextSplitter。

要指定要拆分的标题，请在实例化 `HTMLHeaderTextSplitter` 时指定 `headers_to_split_on`，如下所示。

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

API 参考：HTMLHeaderTextSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list:  \nFirst item Second item Third item with bold text and a link'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

要将每个元素及其关联的标题一起返回，请在实例化 `HTMLHeaderTextSplitter` 时指定 `return_each_element=True`

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on,
    return_each_element=True,
)
html_header_splits_elements = html_splitter.split_text(html_string)

与上面按标题聚合元素的情况进行比较

for element in html_header_splits[:2]:
    print(element)

page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:  
First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}

现在每个元素都作为独立的 `Document` 返回

for element in html_header_splits_elements[:3]:
    print(element)

page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
page_content='First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}

如何从URL或HTML文件拆分：

要直接从URL读取，请将URL字符串传递给 `split_text_from_url` 方法。

同样，本地HTML文件可以传递给 `split_text_from_file` 方法。

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

如何限制块大小：

`HTMLHeaderTextSplitter`（基于HTML标题进行拆分）可以与另一个基于字符长度限制拆分的拆分器（例如 `RecursiveCharacterTextSplitter`）组合使用。

这可以通过使用第二个拆分器的 `.split_documents` 方法来完成

from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

API 参考：RecursiveCharacterTextSplitter

[Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='This account of Gödel’s discovery was told to Hao Wang very much after the fact; but in Gödel’s contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel’s publication of that theorem.'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'}, page_content='We now describe the proof of the two theorems, formulating Gödel’s results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel’s notation.')]

限制

不同的HTML文档之间可能存在相当大的结构差异，虽然 `HTMLHeaderTextSplitter` 会尝试将所有“相关”标题附加到任何给定的块，但它有时可能会遗漏某些标题。例如，该算法假设一种信息层级结构，其中标题始终位于关联文本“上方”的节点，即先前的同级、祖先及其组合。在以下新闻文章中（截至本文档撰写时），文档的结构使得顶级标题的文本（虽然标记为“h1”）与我们预期它应位于“上方”的文本元素位于一个不同的子树中——因此我们可以观察到“h1”元素及其关联文本未出现在块元数据中（但，在适用情况下，我们确实看到了“h2”及其关联文本）。

url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
print(html_header_splits[1].page_content[:500])

No two El Niño winters are the same, but many have temperature and precipitation trends in common.  
Average conditions during an El Niño winter across the continental US.  
One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA.  
Because the jet stream is essentially a river of air that storms flow through, they c

使用 HTMLSectionSplitter

与 HTMLHeaderTextSplitter 的概念相似，`HTMLSectionSplitter` 是一种“结构感知”文本拆分器，它在元素级别拆分文本，并为每个与给定块“相关”的标题添加元数据。它允许您按节拆分HTML。

它可以逐个元素返回块，或者将具有相同元数据的元素合并，其目的是 (a) 在语义上（或多或少地）将相关文本分组，以及 (b) 保留文档结构中编码的上下文丰富信息。

使用 `xslt_path` 提供HTML转换的绝对路径，以便它能够根据提供的标签检测部分。默认是使用 `data_connection/document_transformers` 目录中的 `converting_to_header.xslt` 文件。这是为了将HTML转换为更容易检测部分的格式/布局。例如，可以根据字体大小将 `span` 转换为标题标签，以便检测为部分。

如何拆分HTML字符串：

from langchain_text_splitters import HTMLSectionSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

API 参考：HTMLSectionSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with  bold text  and  a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n      Your browser does not support the video tag.'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n    <div>\n      <p>This is a paragraph inside a div.</p>\n    </div>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]

如何限制块大小：

`HTMLSectionSplitter` 可以与其他文本拆分器一起用作分块管道的一部分。在内部，当节大小大于块大小时，它会使用 `RecursiveCharacterTextSplitter`。它还会考虑文本的字体大小，根据确定的字体大小阈值来判断它是否是一个节。

from langchain_text_splitters import RecursiveCharacterTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 50
chunk_overlap = 5
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits

API 参考：RecursiveCharacterTextSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title'),
 Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some'),
 Document(metadata={'Header 1': 'Main Title'}, page_content='some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Section 1: Introduction'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='is a list:'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='First item \n Second item'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Third item with  bold text  and  a link'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Subsection 1.1: Details'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='This subsection provides additional details.'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content="Here's a table:"),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Header 1 \n Header 2 \n Header 3'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 1 \n Row 1, Cell 2'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 3 \n \n \n Row 2, Cell 1'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 2, Cell 2 \n Row 2, Cell 3'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Your browser does not support the video'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='tag.'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: \n \n    <div>'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='<p>This is a paragraph inside a div.</p>'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='</div>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

使用 HTMLSemanticPreservingSplitter

`HTMLSemanticPreservingSplitter` 旨在将HTML内容拆分为可管理的小块，同时保留表格、列表和其他HTML组件等重要元素的语义结构。这确保了此类元素不会跨块拆分，从而避免了上下文相关性（如表格标题、列表标题等）的丢失。

这个拆分器从本质上旨在创建上下文相关的块。使用 `HTMLHeaderTextSplitter` 进行通用递归拆分可能会导致表格、列表和其他结构化元素在中间被拆分，从而丢失大量上下文并产生不良块。

`HTMLSemanticPreservingSplitter` 对于拆分包含表格和列表等结构化元素的HTML内容至关重要，特别是当完整保留这些元素至关重要时。此外，它能够为特定HTML标签定义自定义处理程序，使其成为处理复杂HTML文档的多功能工具。

重要提示：`max_chunk_size` 不是块的明确最大大小，最大大小的计算发生在被保留内容不作为块一部分时，以确保其不被拆分。当我们将保留的数据添加回块中时，块大小有可能超出 `max_chunk_size`。这对于确保我们保持原始文档的结构至关重要。

信息

备注

我们定义了一个自定义处理程序来重新格式化代码块的内容
我们为特定的HTML元素定义了一个拒绝列表，用于在预处理阶段分解它们及其内容
我们有意设置了一个小的块大小，以演示元素的不拆分特性

# BeautifulSoup is required to use the custom handlers
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]


def code_handler(element: Tag) -> str:
    data_lang = element.get("data-lang")
    code_format = f"<code:{data_lang}>{element.get_text()}</code>"

    return code_format


splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    separators=["\n\n", "\n", ". ", "! ", "? "],
    max_chunk_size=50,
    preserve_images=True,
    preserve_videos=True,
    elements_to_preserve=["table", "ul", "ol", "code"],
    denylist_tags=["script", "style", "head"],
    custom_handlers={"code": code_handler},
)

documents = splitter.split_text(html_string)
documents

API 参考：HTMLSemanticPreservingSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

保留表格和列表

在此示例中，我们将演示 `HTMLSemanticPreservingSplitter` 如何在HTML文档中保留表格和大型列表。块大小将设置为50个字符，以说明拆分器如何确保这些元素不被拆分，即使它们超过定义的最大块大小。

from langchain_text_splitters import HTMLSemanticPreservingSplitter

html_string = """
<!DOCTYPE html>
<html>
    <body>
        <div>
            <h1>Section 1</h1>
            <p>This section contains an important table and list that should not be split across chunks.</p>
            <table>
                <tr>
                    <th>Item</th>
                    <th>Quantity</th>
                    <th>Price</th>
                </tr>
                <tr>
                    <td>Apples</td>
                    <td>10</td>
                    <td>$1.00</td>
                </tr>
                <tr>
                    <td>Oranges</td>
                    <td>5</td>
                    <td>$0.50</td>
                </tr>
                <tr>
                    <td>Bananas</td>
                    <td>50</td>
                    <td>$1.50</td>
                </tr>
            </table>
            <h2>Subsection 1.1</h2>
            <p>Additional text in subsection 1.1 that is separated from the table and list.</p>
            <p>Here is a detailed list:</p>
            <ul>
                <li>Item 1: Description of item 1, which is quite detailed and important.</li>
                <li>Item 2: Description of item 2, which also contains significant information.</li>
                <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
            </ul>
        </div>
    </body>
</html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    max_chunk_size=50,
    elements_to_preserve=["table", "ul"],
)

documents = splitter.split_text(html_string)
print(documents)

API 参考：HTMLSemanticPreservingSplitter

[Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'), Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'), Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]

解释

在此示例中，`HTMLSemanticPreservingSplitter` 确保整个表格和无序列表（`

这在处理数据表格或列表时尤为重要，因为拆分内容可能会导致上下文丢失或混淆。生成的 `Document` 对象保留了这些元素的完整结构，确保了信息的上下文相关性得以维持。

使用自定义处理程序

`HTMLSemanticPreservingSplitter` 允许您为特定的HTML元素定义自定义处理程序。某些平台具有 `BeautifulSoup` 无法原生解析的自定义HTML标签，在这种情况下，您可以利用自定义处理程序轻松添加格式化逻辑。

这对于需要特殊处理的元素特别有用，例如 `` 标签或特定的 'data-' 元素。在此示例中，我们将为 `iframe` 标签创建一个自定义处理程序，将其转换为类似Markdown的链接。 <div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#000000;--prism-background-color:#F5F5F5"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#000000;background-color:#F5F5F5"><code class="codeBlockLines_e6Vv">def custom_iframe_extractor(iframe_tag): iframe_src = iframe_tag.get("src", "") return f"[iframe:{iframe_src}]({iframe_src})" splitter = HTMLSemanticPreservingSplitter( headers_to_split_on=headers_to_split_on, max_chunk_size=50, separators=["\n\n", "\n", ". "], elements_to_preserve=["table", "ul", "ol"], custom_handlers={"iframe": custom_iframe_extractor}, ) html_string = """ <!DOCTYPE html> <html> <body> <div> <h1>Section with Iframe</h1> <iframe src="https://example.com/embed"></iframe> Some text after the iframe. <ul> <li>Item 1: Description of item 1, which is quite detailed and important.</li> <li>Item 2: Description of item 2, which also contains significant information.</li> <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li> </ul> </div> </body> </html> """ documents = splitter.split_text(html_string) print(documents) </code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></button></div></div></div> <div class="language-output codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#000000;--prism-background-color:#F5F5F5"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-output codeBlock_bY9V thin-scrollbar" style="color:#000000;background-color:#F5F5F5"><code class="codeBlockLines_e6Vv">[Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'), Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")] </code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></button></div></div></div> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="explanation-1">解释<a href="#explanation-1" class="hash-link" aria-label="Direct link to Explanation" title="Direct link to Explanation"></a></h4> 在此示例中，我们为 `iframe` 标签定义了一个自定义处理程序，将其转换为类似Markdown的链接。当拆分器处理HTML内容时，它会使用此自定义处理程序来转换 `iframe` 标签，同时保留表格和列表等其他元素。生成的 `Document` 对象展示了 iframe 如何根据您提供的自定义逻辑进行处理。 重要提示：在保留链接等项目时，您应注意不要在分隔符中包含 `.`，或将分隔符留空。`RecursiveCharacterTextSplitter` 会在句号处拆分，这会将链接一分为二。请确保提供一个包含 `. ` 的分隔符列表。 <h3 class="anchor anchorWithStickyNavbar_LWe7" id="using-a-custom-handler-to-analyze-an-image-with-an-llm">使用自定义处理程序与LLM分析图像<a href="#using-a-custom-handler-to-analyze-an-image-with-an-llm" class="hash-link" aria-label="Direct link to Using a custom handler to analyze an image with an LLM" title="Direct link to Using a custom handler to analyze an image with an LLM"></a></h3> 通过自定义处理程序，我们还可以覆盖任何元素的默认处理。一个很好的例子是在分块流程中，直接插入文档中图像的语义分析。 由于我们的函数在发现标签时被调用，我们可以覆盖 `<img>` 标签并关闭 `preserve_images`，以插入我们希望嵌入到块中的任何内容。 <div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#000000;--prism-background-color:#F5F5F5"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#000000;background-color:#F5F5F5"><code class="codeBlockLines_e6Vv">"""This example assumes you have helper methods `load_image_from_url` and an LLM agent `llm` that can process image data.""" from langchain.agents import AgentExecutor # This example needs to be replaced with your own agent llm = AgentExecutor(...) # This method is a placeholder for loading image data from a URL and is not implemented here def load_image_from_url(image_url: str) -> bytes: # Assuming this method fetches the image data from the URL return b"image_data" html_string = """ <!DOCTYPE html> <html> <body> <div> <h1>Section with Image and Link</h1> <img src="https://example.com/image.jpg" alt="An example image" /> Some text after the image. <ul> <li>Item 1: Description of item 1, which is quite detailed and important.</li> <li>Item 2: Description of item 2, which also contains significant information.</li> <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li> </ul> </div> </body> </html> """ def custom_image_handler(img_tag) -> str: img_src = img_tag.get("src", "") img_alt = img_tag.get("alt", "No alt text provided") image_data = load_image_from_url(img_src) semantic_meaning = llm.invoke(image_data) markdown_text = f"[Image Alt Text: {img_alt} | Image Source: {img_src} | Image Semantic Meaning: {semantic_meaning}]" return markdown_text splitter = HTMLSemanticPreservingSplitter( headers_to_split_on=headers_to_split_on, max_chunk_size=50, separators=["\n\n", "\n", ". "], elements_to_preserve=["ul"], preserve_images=False, custom_handlers={"img": custom_image_handler}, ) documents = splitter.split_text(html_string) print(documents) </code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></button></div></div></div><div style="padding-top:1.3rem;background:var(--prism-background-color);color:var(--prism-color);margin-top:calc(-1 * var(--ifm-leading) - 5px);margin-bottom:var(--ifm-leading);box-shadow:var(--ifm-global-shadow-lw);border-bottom-left-radius:var(--ifm-code-border-radius);border-bottom-right-radius:var(--ifm-code-border-radius)">API 参考：<a href="https://python.langchain.ac.cn/api_reference/langchain/agents/langchain.agents.agent.AgentExecutor.html">AgentExecutor</a></div> <div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#000000;--prism-background-color:#F5F5F5"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#000000;background-color:#F5F5F5"><code class="codeBlockLines_e6Vv">[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: An example image | Image Source: https://example.com/image.jpg | Image Semantic Meaning: semantic-meaning] Some text after the image'), Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")] </code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></button></div></div></div> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="explanation-2">解释：<a href="#explanation-2" class="hash-link" aria-label="Direct link to Explanation:" title="Direct link to Explanation:"></a></h4> 通过我们编写的自定义处理程序从HTML的 `<img>` 元素中提取特定字段，我们可以使用代理进一步处理数据，并将结果直接插入到我们的块中。重要的是要确保 `preserve_images` 设置为 `False`，否则将执行 `<img>` 字段的默认处理。</div><footer class="theme-doc-footer docusaurus-mt-lg"><div class="row margin-top--sm theme-doc-footer-edit-meta-row"><div class="col"><a href="https://github.com/langchain-ai/langchain/edit/master/docs/docs/how_to/split_html.ipynb" target="_blank" rel="noopener noreferrer" class="theme-edit-this-page"><svg fill="currentColor" height="20" width="20" viewBox="0 0 40 40" class="iconEdit_Z9Sw" aria-hidden="true"><g><path d="m34.5 11.7l-3 3.1-6.3-6.3 3.1-3q0.5-0.5 1.2-0.5t1.1 0.5l3.9 3.9q0.5 0.4 0.5 1.1t-0.5 1.2z m-29.5 17.1l18.4-18.5 6.3 6.3-18.4 18.4h-6.3v-6.2z"></path></g></svg>编辑此页面</a></div><div class="col lastUpdated_JAkA"></div></div></footer></article><nav class="pagination-nav docusaurus-mt-lg" aria-label="Docs pages"><a class="pagination-nav__link pagination-nav__link--prev" href="/docs/how_to/split_by_token/"><div class="pagination-nav__sublabel">上一页</div><div class="pagination-nav__label">如何按token拆分文本</div></a><a class="pagination-nav__link pagination-nav__link--next" href="/docs/how_to/sql_csv/"><div class="pagination-nav__sublabel">下一页</div><div class="pagination-nav__label">如何对CSV进行问答</div></a></nav></div></div><div class="col col--3"><div class="tableOfContents_bqdL thin-scrollbar theme-doc-toc-desktop"><ul class="table-of-contents table-of-contents__left-border"><li><a href="#overview-of-the-splitters" class="table-of-contents__link toc-highlight">拆分器概述</a><ul><li><a href="#htmlheadertextsplitter" class="table-of-contents__link toc-highlight">HTMLHeaderTextSplitter</a></li><li><a href="#htmlsectionsplitter" class="table-of-contents__link toc-highlight">HTMLSectionSplitter</a></li><li><a href="#htmlsemanticpreservingsplitter" class="table-of-contents__link toc-highlight">HTMLSemanticPreservingSplitter</a></li><li><a href="#choosing-the-right-splitter" class="table-of-contents__link toc-highlight">选择正确的拆分器</a></li></ul></li><li><a href="#example-html-document" class="table-of-contents__link toc-highlight">HTML文档示例</a></li><li><a href="#using-htmlheadertextsplitter" class="table-of-contents__link toc-highlight">使用 HTMLHeaderTextSplitter</a><ul><li><a href="#how-to-split-from-a-url-or-html-file" class="table-of-contents__link toc-highlight">如何从URL或HTML文件拆分</a></li><li><a href="#how-to-constrain-chunk-sizes" class="table-of-contents__link toc-highlight">如何限制块大小</a></li><li><a href="#limitations" class="table-of-contents__link toc-highlight">限制</a></li></ul></li><li><a href="#using-htmlsectionsplitter" class="table-of-contents__link toc-highlight">使用 HTMLSectionSplitter</a><ul><li><a href="#how-to-split-html-strings" class="table-of-contents__link toc-highlight">如何拆分HTML字符串</a></li><li><a href="#how-to-constrain-chunk-sizes-1" class="table-of-contents__link toc-highlight">如何限制块大小</a></li></ul></li><li><a href="#using-htmlsemanticpreservingsplitter" class="table-of-contents__link toc-highlight">使用 HTMLSemanticPreservingSplitter</a><ul><li><a href="#preserving-tables-and-lists" class="table-of-contents__link toc-highlight">保留表格和列表</a></li><li><a href="#using-a-custom-handler" class="table-of-contents__link toc-highlight">使用自定义处理程序</a></li><li><a href="#using-a-custom-handler-to-analyze-an-image-with-an-llm" class="table-of-contents__link toc-highlight">使用自定义处理程序与LLM分析图像</a></li></ul></li></ul></div></div></div></div></main></div></div></div><footer class="footer"><div class="container container-fluid"><div class="row footer__links"><div class="col footer__col"><div class="footer__title">社区</div><ul class="footer__items clean-list"><li class="footer__item"><a href="https://forum.langchain.com/" target="_blank" rel="noopener noreferrer" class="footer__link-item">LangChain 论坛<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://twitter.com/LangChainAI" target="_blank" rel="noopener noreferrer" class="footer__link-item">Twitter<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://langchain.ac.cn/join-community" target="_blank" rel="noopener noreferrer" class="footer__link-item">Slack<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li></ul></div><div class="col footer__col"><div class="footer__title">GitHub</div><ul class="footer__items clean-list"><li class="footer__item"><a href="https://github.com/langchain-ai" target="_blank" rel="noopener noreferrer" class="footer__link-item">组织<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://github.com/langchain-ai/langchain" target="_blank" rel="noopener noreferrer" class="footer__link-item">Python<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://github.com/langchain-ai/langchainjs" target="_blank" rel="noopener noreferrer" class="footer__link-item">JS/TS<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li></ul></div><div class="col footer__col"><div class="footer__title">更多</div><ul class="footer__items clean-list"><li class="footer__item"><a href="https://langchain.ac.cn" target="_blank" rel="noopener noreferrer" class="footer__link-item">主页<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://blog.langchain.ac.cn" target="_blank" rel="noopener noreferrer" class="footer__link-item">博客<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li><li class="footer__item"><a href="https://www.youtube.com/@LangChain" target="_blank" rel="noopener noreferrer" class="footer__link-item">YouTube<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></li></ul></div></div><div class="footer__bottom text--center"><div class="footer__copyright">版权所有 © 2025 LangChain, Inc.</div></div></div></footer></div> </body></html>

拆分器概述​

HTMLHeaderTextSplitter​

`、`

`、`

` 等）拆分HTML文本，并为每个与给定块相关的标题添加元数据。 功能: 在HTML元素级别拆分文本。 保留文档结构中编码的上下文丰富信息。 可以逐个元素返回块，或者将具有相同元数据的元素合并。

HTMLSectionSplitter​

HTMLSemanticPreservingSplitter​

选择正确的拆分器​

HTML文档示例​

使用 HTMLHeaderTextSplitter​

如何从URL或HTML文件拆分：​

如何限制块大小：​

限制​

使用 HTMLSectionSplitter​

如何拆分HTML字符串：​

如何限制块大小：​

使用 HTMLSemanticPreservingSplitter​

保留表格和列表​

解释​

使用自定义处理程序​

拆分器概述

HTMLHeaderTextSplitter

` 等）拆分HTML文本，并为每个与给定块相关的标题添加元数据。

功能:

在HTML元素级别拆分文本。

保留文档结构中编码的上下文丰富信息。

可以逐个元素返回块，或者将具有相同元数据的元素合并。

HTMLSectionSplitter

HTMLSemanticPreservingSplitter

选择正确的拆分器

HTML文档示例

使用 HTMLHeaderTextSplitter

如何从URL或HTML文件拆分：

如何限制块大小：

限制

使用 HTMLSectionSplitter

如何拆分HTML字符串：

如何限制块大小：

使用 HTMLSemanticPreservingSplitter

保留表格和列表

解释

使用自定义处理程序