如何加载 PDF
可移植文档格式(PDF),标准化为 ISO 32000,是由 Adobe 于 1992 年开发的文件格式,用于以独立于应用程序软件、硬件和操作系统的方式呈现文档,包括文本格式和图像。
本指南介绍了如何将 PDF
文档加载到 LangChain 文档格式中,供我们下游使用。
PDF 中的文本通常通过文本框表示。它们也可能包含图像。PDF 解析器可能会执行以下操作的某种组合:
- 通过启发式方法或机器学习推断将文本框聚合成行、段落和其他结构;
- 对图像运行 OCR 以检测其中的文本;
- 将文本分类为属于段落、列表、表格或其他结构;
- 将文本组织成表格行和列,或键值对。
LangChain 集成了许多 PDF 解析器。有些解析器简单且相对底层;另一些则支持 OCR 和图像处理,或者执行高级文档布局分析。正确的选择将取决于您的需求。下面我们列举了各种可能性。
我们将使用示例文件来演示这些方法。
file_path = (
"../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
许多现代大型语言模型 (LLM) 支持对多模态输入(例如,图像)进行推理。在某些应用中,例如对具有复杂布局、图表或扫描件的 PDF 进行问答时,跳过 PDF 解析,而是将 PDF 页面转换为图像并直接传递给模型可能更有利。我们在下面的使用多模态模型部分中演示了一个示例。
简单快速的文本提取
如果您正在寻找 PDF 中嵌入的文本的简单字符串表示形式,则以下方法是合适的。它将返回一个Document对象列表——每页一个——其中包含文档的page_content
属性中该页面的单个文本字符串。它不会解析图像或扫描的 PDF 页面中的文本。在底层,它使用pypdf Python 库。
LangChain 文档加载器实现了 lazy_load
及其异步变体 alazy_load
,它们返回 Document
对象的迭代器。我们将在下面使用它们。
%pip install -qU pypdf
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
pages.append(page)
print(f"{pages[0].metadata}\n")
print(pages[0].page_content)
{'source': '../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf', 'page': 0}
LayoutParser : A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1( �), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
[email protected]
2Brown University
ruochen [email protected]
3Harvard University
{melissadell,jacob carlson }@fas.harvard.edu
4University of Washington
[email protected]
5University of Waterloo
[email protected]
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model configurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
efforts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing and computer
vision, none of them are optimized for challenges in the domain of DIA.
This represents a major gap in the existing toolkit, as DIA is central to
academic research across a wide range of disciplines in the social sciences
and humanities. This paper introduces LayoutParser , an open-source
library for streamlining the usage of DL in DIA research and applica-
tions. The core LayoutParser library comes with a set of simple and
intuitive interfaces for applying and customizing DL models for layout de-
tection, character recognition, and many other document processing tasks.
To promote extensibility, LayoutParser also incorporates a community
platform for sharing both pre-trained models and full document digiti-
zation pipelines. We demonstrate that LayoutParser is helpful for both
lightweight and large-scale digitization pipelines in real-word use cases.
The library is publicly available at https://layout-parser.github.io .
Keywords: Document Image Analysis ·Deep Learning ·Layout Analysis
·Character Recognition ·Open Source library ·Toolkit.
1 Introduction
Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
document image analysis (DIA) tasks including document image classification [ 11,arXiv:2103.15348v2 [cs.CV] 21 Jun 2021
请注意,每个文档的元数据都存储了相应的页码。
基于 PDF 的向量搜索
一旦我们将 PDF 加载到 LangChain Document
对象中,我们就可以像往常一样对它们进行索引(例如,RAG 应用程序)。下面我们使用 OpenAI 嵌入,尽管任何 LangChain 嵌入模型都足够了。
%pip install -qU langchain-openai
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
vector_store = InMemoryVectorStore.from_documents(pages, OpenAIEmbeddings())
docs = vector_store.similarity_search("What is LayoutParser?", k=2)
for doc in docs:
print(f'Page {doc.metadata["page"]}: {doc.page_content[:300]}\n')
Page 13: 14 Z. Shen et al.
6 Conclusion
LayoutParser provides a comprehensive toolkit for deep learning-based document
image analysis. The off-the-shelf library is easy to install, and can be used to
build flexible and accurate pipelines for processing documents with complicated
structures. It also supports hi
Page 0: LayoutParser : A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1( �), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
[email protected]
2Brown University
ruochen [email protected]
3Harvard University
布局分析和从图像中提取文本
如果您需要更精细的文本分割(例如,分成不同的段落、标题、表格或其他结构)或需要从图像中提取文本,则以下方法是合适的。它将返回一个Document对象列表,其中每个对象代表页面上的一个结构。文档的元数据存储页码和与该对象相关的其他信息(例如,在表格对象的情况下,它可能会存储表格的行和列)。
在底层,它使用 langchain-unstructured
库。有关使用 LangChain 的Unstructured 的更多信息,请参阅集成文档。
Unstructured 支持多个 PDF 解析参数
strategy
(例如,"fast"
或"hi-res"
)- API 或本地处理。您将需要一个 API 密钥才能使用 API。
hi-res 策略提供了对文档布局分析和 OCR 的支持。我们将在下面通过 API 演示它。有关本地运行时需要考虑的事项,请参阅下面的本地解析部分。
%pip install -qU langchain-unstructured
import getpass
import os
if "UNSTRUCTURED_API_KEY" not in os.environ:
os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass("Unstructured API Key:")
Unstructured API Key: ········
和以前一样,我们初始化一个加载器并惰性加载文档
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path=file_path,
strategy="hi_res",
partition_via_api=True,
coordinates=True,
)
docs = []
for doc in loader.lazy_load():
docs.append(doc)
INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 16 (16 total)
INFO: Determined optimal split size of 4 pages.
INFO: Partitioning 4 files with 4 page(s) each.
INFO: Partitioning set #1 (pages 1-4).
INFO: Partitioning set #2 (pages 5-8).
INFO: Partitioning set #3 (pages 9-12).
INFO: Partitioning set #4 (pages 13-16).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
这里,我们在 16 页的文档中恢复了 171 个不同的结构
print(len(docs))
171
我们可以使用文档元数据从单个页面中恢复内容
first_page_docs = [doc for doc in docs if doc.metadata.get("page_number") == 1]
for doc in first_page_docs:
print(doc.page_content)
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
1 2 0 2 n u J 1 2 ] V C . s c [ 2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a
Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®
1 Allen Institute for AI [email protected] 2 Brown University ruochen [email protected] 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington [email protected] 5 University of Waterloo [email protected]
Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.
Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.
1 Introduction
Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classification [11,
提取表格和其他结构
我们加载的每个 Document
代表一个结构,例如标题、段落或表格。
某些结构可能对索引或问答任务特别重要。这些结构可能是
- 为了易于识别而分类;
- 被解析成更结构化的表示形式。
下面,我们识别并提取一个表格
单击以展开代码以呈现页面
%pip install -qU matplotlib PyMuPDF pillow
import fitz
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from PIL import Image
def plot_pdf_with_boxes(pdf_page, segments):
pix = pdf_page.get_pixmap()
pil_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.imshow(pil_image)
categories = set()
category_to_color = {
"Title": "orchid",
"Image": "forestgreen",
"Table": "tomato",
}
for segment in segments:
points = segment["coordinates"]["points"]
layout_width = segment["coordinates"]["layout_width"]
layout_height = segment["coordinates"]["layout_height"]
scaled_points = [
(x * pix.width / layout_width, y * pix.height / layout_height)
for x, y in points
]
box_color = category_to_color.get(segment["category"], "deepskyblue")
categories.add(segment["category"])
rect = patches.Polygon(
scaled_points, linewidth=1, edgecolor=box_color, facecolor="none"
)
ax.add_patch(rect)
# Make legend
legend_handles = [patches.Patch(color="deepskyblue", label="Text")]
for category in ["Title", "Image", "Table"]:
if category in categories:
legend_handles.append(
patches.Patch(color=category_to_color[category], label=category)
)
ax.axis("off")
ax.legend(handles=legend_handles, loc="upper right")
plt.tight_layout()
plt.show()
def render_page(doc_list: list, page_number: int, print_text=True) -> None:
pdf_page = fitz.open(file_path).load_page(page_number - 1)
page_docs = [
doc for doc in doc_list if doc.metadata.get("page_number") == page_number
]
segments = [doc.metadata for doc in page_docs]
plot_pdf_with_boxes(pdf_page, segments)
if print_text:
for doc in page_docs:
print(f"{doc.page_content}\n")
render_page(docs, 5)
LayoutParser: A Unified Toolkit for DL-Based DIA
5
Table 1: Current layout detection models in the LayoutParser model zoo
Dataset Base Model1 Large Model Notes PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31] F / M M F F F / M M - - F - Layouts of modern scientific documents Layouts of scanned modern magazines and scientific reports Layouts of scanned US newspapers from the 20th century Table region on modern scientific and business document Layouts of history Japanese documents
1 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (F) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.
layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility functions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training. We now provide detailed descriptions for each component.
3.1 Layout Detection Models
In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Different from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art models like Faster R-CNN [28] and Mask R-CNN [12] are used. This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron2 [35], provides a minimal API that can perform layout detection with only four lines of code in Python:
1 import layoutparser as lp 2 image = cv2 . imread ( " image_file " ) # load images 3 model = lp . De t e c tro n2 Lay outM odel ( " lp :// PubLayNet / f as t er _ r c nn _ R _ 50 _ F P N_ 3 x / config " ) 4 5 layout = model . detect ( image )
LayoutParser provides a wealth of pre-trained model weights using various datasets covering different languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are significantly different from the training dataset. As document structures and layouts vary greatly in different domains, it is important to select models trained on a dataset similar to the test samples. A semantic syntax is used for initializing the model weights in LayoutParser, using both the dataset name and model name lp://<dataset-name>/<model-architecture-name>.
请注意,尽管表格文本在文档内容中被折叠成一个字符串,但元数据包含其行和列的表示形式
from IPython.display import HTML, display
segments = [
doc.metadata
for doc in docs
if doc.metadata.get("page_number") == 5 and doc.metadata.get("category") == "Table"
]
display(HTML(segments[0]["text_as_html"]))
表格 1. LUllclll 1ayoul actCCLloll 1110AdCs 111 L1C LayoOulralsel 1110U4cl 200 | ||
---|---|---|
数据集 | | 基本模型'|' | 备注 |
PubLayNet [38] | F/M | 现代科学文档的布局 |
PRImA | M | 扫描的现代杂志和科学报告的布局 |
报纸 | F | 20 世纪美国扫描报纸的布局 |
TableBank [18] | F | 现代科学和商业文档上的表格区域 |
HJDataset | F/M | 日本历史文档的布局 |
从特定部分提取文本
结构可能具有父子关系——例如,段落可能属于带有标题的部分。如果某个部分特别重要(例如,用于索引),我们可以隔离相应的 Document
对象。
下面,我们提取与文档“结论”部分相关的所有文本
render_page(docs, 14, print_text=False)
conclusion_docs = []
parent_id = -1
for doc in docs:
if doc.metadata["category"] == "Title" and "Conclusion" in doc.page_content:
parent_id = doc.metadata["element_id"]
if doc.metadata.get("parent_id") == parent_id:
conclusion_docs.append(doc)
for doc in conclusion_docs:
print(doc.page_content)
LayoutParser provides a comprehensive toolkit for deep learning-based document image analysis. The off-the-shelf library is easy to install, and can be used to build flexible and accurate pipelines for processing documents with complicated structures. It also supports high-level customization and enables easy labeling and training of DL models on unique document image datasets. The LayoutParser community platform facilitates sharing DL models and DIA pipelines, inviting discussion and promoting code reproducibility and reusability. The LayoutParser team is committed to keeping the library updated continuously and bringing the most recent advances in DL-based DIA, such as multi-modal document modeling [37, 36, 9] (an upcoming priority), to a diverse audience of end-users.
Acknowledgements We thank the anonymous reviewers for their comments and suggestions. This project is supported in part by NSF Grant OIA-2033558 and funding from the Harvard Data Science Initiative and Harvard Catalyst. Zejiang Shen thanks Doug Downey for suggestions.
从图像中提取文本
OCR 在图像上运行,从而可以提取其中的文本
render_page(docs, 11)