Vectara self-querying

Vectara 是值得信赖的 AI 助手和代理平台，专注于为关键任务应用提供企业级准备。Vectara 无服务器 RAG 即服务通过易于使用的 API 提供 RAG 的所有组件，包括

从文件（PDF、PPT、DOCX 等）中提取文本的方法
基于机器学习的分块，提供最先进的性能。
Boomerang 嵌入模型。
其内部向量数据库，用于存储文本块和嵌入向量。
一项查询服务，可自动将查询编码为嵌入，并检索最相关的文本片段，包括对混合搜索 (Hybrid Search) 的支持，以及多语言相关性重排器 (multi-lingual relevance reranker)、MMR、UDF 重排器等多种重排选项。
一个用于创建生成式摘要的大语言模型 (LLM)，基于检索到的文档（上下文），包括引用。

欲了解更多信息

本笔记展示了如何将 Vectara 用作 SelfQueryRetriever。

设置

要使用 VectaraVectorStore，您首先需要安装合作伙伴包。

!uv pip install -U pip && uv pip install -qU langchain-vectara

开始使用

要开始使用，请遵循以下步骤：

如果您还没有账户，请注册免费的 Vectara 试用版。
在您的账户中，您可以创建一个或多个语料库 (corpus)。每个语料库代表一个区域，用于存储从输入文档摄入的文本数据。要创建语料库，请使用“创建语料库 (Create Corpus)”按钮。然后您为语料库提供一个名称和描述。您可以选择定义过滤属性并应用一些高级选项。如果您点击已创建的语料库，您可以在顶部看到其名称和语料库 ID。
接下来，您需要创建 API 密钥以访问语料库。在语料库视图中点击“访问控制 (Access Control)”选项卡，然后点击“创建 API 密钥 (Create API Key)”按钮。为您的密钥命名，并选择您希望密钥是只读查询 (query-only) 还是查询加索引 (query+index)。点击“创建 (Create)”，您现在就拥有了一个活动的 API 密钥。请妥善保管此密钥，切勿泄露。

要将 LangChain 与 Vectara 结合使用，您需要这两个值：corpus_key 和 api_key。您可以通过两种方式向 LangChain 提供 VECTARA_API_KEY：

在您的环境中包含这两个变量：VECTARA_API_KEY。

例如，您可以使用 os.environ 和 getpass 如下设置这些变量：

import os
import getpass

os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")

将它们添加到 Vectara 向量存储构造函数中

vectara = Vectara(
    vectara_api_key=vectara_api_key
)

在本笔记中，我们假设它们是在环境中提供的。

从 LangChain 连接到 Vectara

在此示例中，我们假设您已创建账户和语料库，并将您的 VECTARA_CORPUS_KEY 和 VECTARA_API_KEY（创建时具有索引和查询权限）添加为环境变量。

我们进一步假设语料库中定义了 4 个可过滤的元数据属性字段：year（年份）、director（导演）、rating（评分）和 genre（流派）

import os

from langchain_core.documents import Document

os.environ["VECTARA_API_KEY"] = "VECTARA_API_KEY"
os.environ["VECTARA_CORPUS_KEY"] = "VECTARA_CORPUS_KEY"

from langchain_vectara import Vectara

API 参考：Document

数据集

我们首先定义一个电影示例数据集，并将其与元数据一起上传到语料库中

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "rating": 9.9,
            "director": "Andrei Tarkovsky",
            "genre": "science fiction",
        },
    ),
]

corpus_key = os.getenv("VECTARA_CORPUS_KEY")
vectara = Vectara()
for doc in docs:
    vectara.add_texts(
        [doc.page_content], corpus_key=corpus_key, doc_metadata=doc.metadata
    )

Vectara 自查询

您不需要通过 LangChain 机制进行自查询——在 Vectara 平台上启用 intelligent_query_rewriting 即可实现相同的结果。Vectara 提供智能查询重写 (Intelligent Query Rewriting) 选项，通过从自然语言查询自动生成元数据过滤表达式来提高搜索精度。此功能可分析用户查询，提取相关的元数据过滤器，并重写查询以关注核心信息需求。欲了解更多详情。

通过在 VectaraQueryConfig 中将 intelligent_query_rewriting 参数设置为 true，可以对每个查询启用智能查询重写。

from langchain_vectara.vectorstores import (
    CorpusConfig,
    SearchConfig,
    VectaraQueryConfig,
)

config = VectaraQueryConfig(
    search=SearchConfig(corpora=[CorpusConfig(corpus_key=corpus_key)]),
    generation=None,
    intelligent_query_rewriting=True,
)

查询

现在我们可以实际尝试使用我们的 vectara_queries 方法了！

# This example only specifies a relevant query
vectara.vectara_query("What are movies about scientists", config)

[(Document(metadata={'year': 1995, 'genre': 'animated', 'source': 'langchain'}, page_content='Toys come alive and have a blast doing so'),
  0.4141285717487335),
 (Document(metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
  0.4046250879764557),
 (Document(metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2, 'source': 'langchain'}, page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...'),
  0.227469339966774),
 (Document(metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3, 'source': 'langchain'}, page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them'),
  0.19208428263664246),
 (Document(metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction', 'source': 'langchain'}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
  0.1902722418308258),
 (Document(metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6, 'source': 'langchain'}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),
  0.08151976019144058)]

# This example only specifies a filter
vectara.vectara_query("I want to watch a movie rated higher than 8.5", config)

[(Document(metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6, 'source': 'langchain'}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),
  0.34279149770736694),
 (Document(metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
  0.242923304438591)]

# This example specifies a query and a filter
vectara.vectara_query("Has Greta Gerwig directed any movies about women", config)

[(Document(metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3, 'source': 'langchain'}, page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them'),
  0.10141132771968842)]

# This example specifies a composite filter
vectara.vectara_query("What's a highly rated (above 8.5) science fiction film?", config)

[(Document(metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
  0.9508692026138306)]

# This example specifies a query and composite filter
vectara.vectara_query(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated",
    config,
)

[(Document(metadata={'year': 1995, 'genre': 'animated', 'source': 'langchain'}, page_content='Toys come alive and have a blast doing so'),
  0.7290377616882324),
 (Document(metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction', 'source': 'langchain'}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
  0.4838160574436188)]

设置​

开始使用

从 LangChain 连接到 Vectara​

数据集​

Vectara 自查询​

查询​

设置

从 LangChain 连接到 Vectara

数据集

Vectara 自查询

查询