跳至主要内容

Vectara

Vectara 提供了一个可信的生成式 AI 平台,使组织能够快速创建类似 ChatGPT 的体验(AI 助手),该体验以他们拥有的数据、文档和知识为基础(从技术上讲,它是检索增强生成即服务)。

Vectara 无服务器 RAG 即服务通过易于使用的 API 提供了 RAG 的所有组件,包括

  1. 一种从文件中提取文本的方法(PDF、PPT、DOCX 等)
  2. 基于机器学习的切分,提供最先进的性能。
  3. Boomerang 嵌入模型。
  4. 它自己的内部向量数据库,用于存储文本片段和嵌入向量。
  5. 查询服务,自动将查询编码为嵌入,并检索最相关的文本段(包括对混合搜索MMR的支持)
  6. 一个大型语言模型,用于根据检索到的文档(上下文)创建生成摘要,包括引用。

有关如何使用 API 的更多信息,请参阅Vectara API 文档

此笔记本展示了如何在仅将 Vectara 用作向量存储(无摘要)时使用基本检索功能,包括:similarity_searchsimilarity_search_with_score 以及使用 LangChain 的 as_retriever 功能。

您需要使用 pip install -qU langchain-community 安装 langchain-community 才能使用此集成。

入门

要开始使用,请按照以下步骤操作

  1. 如果您还没有,请注册您的免费 Vectara 帐户。完成注册后,您将拥有一个 Vectara 客户 ID。您可以通过单击 Vectara 控制台窗口右上角的姓名来找到您的客户 ID。
  2. 在您的帐户中,您可以创建一个或多个语料库。每个语料库代表一个区域,在从输入文档导入后存储文本数据。要创建语料库,请使用“创建语料库”按钮。然后,您为语料库提供名称和描述。或者,您可以定义过滤属性并应用一些高级选项。如果您单击创建的语料库,您可以在顶部看到其名称和语料库 ID。
  3. 接下来,您需要创建 API 密钥以访问语料库。在语料库视图中单击“访问控制”选项卡,然后单击“创建 API 密钥”按钮。为您的密钥命名,并选择您是否希望为密钥使用仅查询或查询+索引。单击“创建”,您现在拥有一个活动的 API 密钥。请将此密钥保密。

要将 LangChain 与 Vectara 一起使用,您需要拥有以下三个值:客户 ID语料库 IDapi_密钥。您可以通过两种方式将其提供给 LangChain

  1. 在您的环境中包含这三个变量:VECTARA_CUSTOMER_IDVECTARA_CORPUS_IDVECTARA_API_KEY

    例如,您可以使用 os.environ 和 getpass 设置这些变量,如下所示

import os
import getpass

os.environ["VECTARA_CUSTOMER_ID"] = getpass.getpass("Vectara Customer ID:")
os.environ["VECTARA_CORPUS_ID"] = getpass.getpass("Vectara Corpus ID:")
os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
  1. 将它们添加到 Vectara 向量存储构造函数中
vectara = Vectara(
vectara_customer_id=vectara_customer_id,
vectara_corpus_id=vectara_corpus_id,
vectara_api_key=vectara_api_key
)

在此笔记本中,我们假设它们在环境中提供。

import os

os.environ["VECTARA_API_KEY"] = "<YOUR_VECTARA_API_KEY>"
os.environ["VECTARA_CORPUS_ID"] = "<YOUR_VECTARA_CORPUS_ID>"
os.environ["VECTARA_CUSTOMER_ID"] = "<YOUR_VECTARA_CUSTOMER_ID>"

from langchain_community.vectorstores import Vectara
from langchain_community.vectorstores.vectara import (
RerankConfig,
SummaryConfig,
VectaraQueryConfig,
)

首先,我们将国情咨文文本加载到 Vectara 中。

请注意,我们使用 from_files 接口,该接口不需要任何本地处理或切分 - Vectara 接收文件内容并执行所有必要的预处理、切分和将文件嵌入到其知识库中。

在这种情况下,它使用 .txt 文件,但对于许多其他文件类型也是如此。

vectara = Vectara.from_files(["state_of_the_union.txt"])

基本 Vectara RAG(检索增强生成)

我们现在创建一个 VectaraQueryConfig 对象来控制检索和摘要选项

  • 我们启用摘要,指定我们希望大型语言模型选择前 7 个匹配的片段并以英语回复
  • 我们在检索过程中启用 MMR(最大边际相关性),偏差因子为 0.2
  • 我们希望获得前 10 个结果,并使用值为 0.025 的混合搜索进行配置

使用此配置,让我们创建一个封装完整 Vectara RAG 管道的 LangChain Runnable 对象,使用 as_rag 方法

summary_config = SummaryConfig(is_enabled=True, max_results=7, response_lang="eng")
rerank_config = RerankConfig(reranker="mmr", rerank_k=50, mmr_diversity_bias=0.2)
config = VectaraQueryConfig(
k=10, lambda_val=0.005, rerank_config=rerank_config, summary_config=summary_config
)

query_str = "what did Biden say?"

rag = vectara.as_rag(config)
rag.invoke(query_str)["answer"]
"Biden addressed various topics in his statements. He highlighted the need to confront Putin by building a coalition of nations[1]. He also expressed commitment to investigating the impact of burn pits on soldiers' health, including his son's case[2]. Additionally, Biden outlined a plan to fight inflation by cutting prescription drug costs[3]. He emphasized the importance of continuing to combat COVID-19 and not just accepting living with it[4]. Furthermore, he discussed measures to weaken Russia economically and target Russian oligarchs[6]. Biden also advocated for passing the Equality Act to support LGBTQ+ Americans and condemned state laws targeting transgender individuals[7]."

我们也可以像这样使用流式接口

output = {}
curr_key = None
for chunk in rag.stream(query_str):
for key in chunk:
if key not in output:
output[key] = chunk[key]
else:
output[key] += chunk[key]
if key == "answer":
print(chunk[key], end="", flush=True)
curr_key = key
Biden addressed various topics in his statements. He highlighted the importance of building coalitions to confront global challenges [1]. He also expressed commitment to investigating the impact of burn pits on soldiers' health, including his son's case [2, 4]. Additionally, Biden outlined his plan to combat inflation by cutting prescription drug costs and reducing the deficit, with support from Nobel laureates and business leaders [3]. He emphasized the ongoing fight against COVID-19 and the need to continue combating the virus [5]. Furthermore, Biden discussed measures taken to weaken Russia's economic and military strength, targeting Russian oligarchs and corrupt leaders [6]. He also advocated for passing the Equality Act to support LGBTQ+ Americans and address discriminatory state laws [7].

幻觉检测和事实一致性评分

Vectara 创建了HHEM - 一个可以用来评估 RAG 响应的事实一致性的开源模型。

作为 Vectara RAG 的一部分,“事实一致性评分”(或 FCS)作为开源 HHEM 的改进版本,可通过 API 获得。这会自动包含在 RAG 管道的输出中

summary_config = SummaryConfig(is_enabled=True, max_results=5, response_lang="eng")
rerank_config = RerankConfig(reranker="mmr", rerank_k=50, mmr_diversity_bias=0.1)
config = VectaraQueryConfig(
k=10, lambda_val=0.005, rerank_config=rerank_config, summary_config=summary_config
)

rag = vectara.as_rag(config)
resp = rag.invoke(query_str)
print(resp["answer"])
print(f"Vectara FCS = {resp['fcs']}")
Biden addressed various topics in his statements. He highlighted the need to confront Putin by building a coalition of nations[1]. He also expressed his commitment to investigating the impact of burn pits on soldiers' health, referencing his son's experience[2]. Additionally, Biden discussed his plan to fight inflation by cutting prescription drug costs and garnering support from Nobel laureates and business leaders[4]. Furthermore, he emphasized the importance of continuing to combat COVID-19 and not merely accepting living with the virus[5]. Biden's remarks encompassed international relations, healthcare challenges faced by soldiers, economic strategies, and the ongoing battle against the pandemic.
Vectara FCS = 0.41796625

Vectara 作为 langchain 检索器

Vectara 组件也可以仅用作检索器。

在这种情况下,它的行为就像任何其他 LangChain 检索器一样。此模式的主要用途是语义搜索,在这种情况下,我们禁用摘要

config.summary_config.is_enabled = False
config.k = 3
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(page_content='He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready.  Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.', metadata={'lang': 'eng', 'section': '1', 'offset': '2160', 'len': '36', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.', metadata={'lang': 'eng', 'section': '1', 'offset': '34652', 'len': '60', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are.', metadata={'lang': 'eng', 'section': '1', 'offset': '35442', 'len': '57', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'})]

为了向后兼容性,您还可以使用检索器启用摘要,在这种情况下,摘要将作为其他文档对象添加

config.summary_config.is_enabled = True
config.k = 3
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(page_content='He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready.  Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.', metadata={'lang': 'eng', 'section': '1', 'offset': '2160', 'len': '36', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.', metadata={'lang': 'eng', 'section': '1', 'offset': '34652', 'len': '60', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are.', metadata={'lang': 'eng', 'section': '1', 'offset': '35442', 'len': '57', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content="Biden discussed various topics in his statements. He highlighted the importance of unity and preparation to confront challenges, such as building coalitions to address global issues [1]. Additionally, he shared personal stories about the impact of health issues on soldiers, including his son's experience with brain cancer possibly linked to burn pits [2]. Biden also outlined his plans to combat inflation by cutting prescription drug costs and emphasized the ongoing efforts to combat COVID-19, rejecting the idea of merely living with the virus [4, 5]. Overall, Biden's messages revolved around unity, healthcare challenges faced by soldiers, economic plans, and the ongoing fight against COVID-19.", metadata={'summary': True, 'fcs': 0.54751414})]

使用 Vectara 的高级 LangChain 查询预处理

Vectara 的“RAG 即服务”在创建问答或聊天机器人链中做了大量繁重的工作。与 LangChain 的集成提供了使用其他功能(例如 SelfQueryRetrieverMultiQueryRetriever 等查询预处理)的选项。让我们看看使用MultiQueryRetriever的示例。

由于 MQR 使用大型语言模型,因此我们必须对其进行设置 - 在这里我们选择 ChatOpenAI

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0)
mqr = MultiQueryRetriever.from_llm(retriever=retriever, llm=llm)


def get_summary(documents):
return documents[-1].page_content


(mqr | get_summary).invoke(query_str)
"Biden's statement highlighted his efforts to unite freedom-loving nations against Putin's aggression, sharing information in advance to counter Russian lies and hold Putin accountable[1]. Additionally, he emphasized his commitment to military families, like Danielle Robinson, and outlined plans for more affordable housing, Pre-K for 3- and 4-year-olds, and ensuring no additional taxes for those earning less than $400,000 a year[2][3]. The statement also touched on the readiness of the West and NATO to respond to Putin's actions, showcasing extensive preparation and coalition-building efforts[4]. Heath Robinson's story, a combat medic who succumbed to cancer from burn pits, was used to illustrate the resilience and fight for better conditions[5]."

此页面是否有帮助?


您也可以留下详细的反馈 在 GitHub 上.