跳到主要内容
Open In ColabOpen on GitHub

VectorizeRetriever

本笔记本展示了如何使用 LangChain Vectorize 检索器。

Vectorize 帮助您更快、更轻松地构建 AI 应用程序。它自动化数据提取,使用 RAG 评估找到最佳向量化策略,并让您快速部署用于非结构化数据的实时 RAG 管道。您的向量搜索索引保持最新,并且它与您现有的向量数据库集成,因此您保持对数据的完全控制。Vectorize 处理繁重的工作,让您专注于构建强大的 AI 解决方案,而无需陷入数据管理的困境。

设置

在以下步骤中,我们将设置 Vectorize 环境并创建 RAG 管道。

创建 Vectorize 账户并获取您的访问令牌

在此处注册免费的 Vectorize 账户 在此访问令牌部分生成访问令牌 获取您的组织 ID。从浏览器 URL 中,在 /organization/ 之后提取 UUID

配置令牌和组织 ID

import getpass

VECTORIZE_ORG_ID = getpass.getpass("Enter Vectorize organization ID: ")
VECTORIZE_API_TOKEN = getpass.getpass("Enter Vectorize API Token: ")

安装

此检索器位于 langchain-vectorize 包中

!pip install -qU langchain-vectorize

下载 PDF 文件

!wget "https://raw.githubusercontent.com/vectorize-io/vectorize-clients/refs/tags/python-0.1.3/tests/python/tests/research.pdf"

初始化 vectorize 客户端

import vectorize_client as v

api = v.ApiClient(v.Configuration(access_token=VECTORIZE_API_TOKEN))

创建文件上传源连接器

import json
import os

import urllib3

connectors_api = v.ConnectorsApi(api)
response = connectors_api.create_source_connector(
VECTORIZE_ORG_ID, [{"type": "FILE_UPLOAD", "name": "From API"}]
)
source_connector_id = response.connectors[0].id

上传 PDF 文件

file_path = "research.pdf"

http = urllib3.PoolManager()
uploads_api = v.UploadsApi(api)
metadata = {"created-from-api": True}

upload_response = uploads_api.start_file_upload_to_connector(
VECTORIZE_ORG_ID,
source_connector_id,
v.StartFileUploadToConnectorRequest(
name=file_path.split("/")[-1],
content_type="application/pdf",
# add additional metadata that will be stored along with each chunk in the vector database
metadata=json.dumps(metadata),
),
)

with open(file_path, "rb") as f:
response = http.request(
"PUT",
upload_response.upload_url,
body=f,
headers={
"Content-Type": "application/pdf",
"Content-Length": str(os.path.getsize(file_path)),
},
)

if response.status != 200:
print("Upload failed: ", response.data)
else:
print("Upload successful")

连接到 AI 平台和向量数据库

ai_platforms = connectors_api.get_ai_platform_connectors(VECTORIZE_ORG_ID)
builtin_ai_platform = [
c.id for c in ai_platforms.ai_platform_connectors if c.type == "VECTORIZE"
][0]

vector_databases = connectors_api.get_destination_connectors(VECTORIZE_ORG_ID)
builtin_vector_db = [
c.id for c in vector_databases.destination_connectors if c.type == "VECTORIZE"
][0]

配置和部署管道

pipelines = v.PipelinesApi(api)
response = pipelines.create_pipeline(
VECTORIZE_ORG_ID,
v.PipelineConfigurationSchema(
source_connectors=[
v.SourceConnectorSchema(
id=source_connector_id, type="FILE_UPLOAD", config={}
)
],
destination_connector=v.DestinationConnectorSchema(
id=builtin_vector_db, type="VECTORIZE", config={}
),
ai_platform=v.AIPlatformSchema(
id=builtin_ai_platform, type="VECTORIZE", config={}
),
pipeline_name="My Pipeline From API",
schedule=v.ScheduleSchema(type="manual"),
),
)
pipeline_id = response.data.id

配置跟踪(可选)

如果您想获取单个查询的自动化追踪,您还可以通过取消注释下方内容来设置您的 LangSmith API 密钥。

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

实例化

from langchain_vectorize.retrievers import VectorizeRetriever

retriever = VectorizeRetriever(
api_token=VECTORIZE_API_TOKEN,
organization=VECTORIZE_ORG_ID,
pipeline_id=pipeline_id,
)

使用

query = "Apple Shareholders equity"
retriever.invoke(query, num_results=2)

在链中使用

与其他检索器一样,VectorizeRetriever 可以通过集成到 LLM 应用程序中。

我们需要一个 LLM 或聊天模型

pip install -qU "langchain[google-genai]"
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke("...")

API 参考

有关所有 VectorizeRetriever 功能和配置的详细文档,请参阅 API 参考