Couchbase

Couchbase 是一款屡获殊荣的分布式 NoSQL 云数据库，为所有云、移动、AI 和边缘计算应用程序提供无与伦比的多功能性、性能、可扩展性和财务价值。Couchbase 通过为开发人员提供代码辅助和为其应用程序提供向量搜索来拥抱 AI。

向量搜索是 Couchbase 中全文搜索服务（搜索服务）的一部分。

本教程介绍如何在 Couchbase 中使用向量搜索。您可以使用 Couchbase Capella 和自管理的 Couchbase Server。

设置

要访问 CouchbaseVectorStore，您首先需要安装 langchain-couchbase 合作伙伴软件包

pip install -qU langchain-couchbase

凭据

访问 Couchbase 网站并创建一个新的连接，确保保存您的数据库用户名和密码

import getpass

COUCHBASE_CONNECTION_STRING = getpass.getpass(
    "Enter the connection string for the Couchbase cluster: "
)
DB_USERNAME = getpass.getpass("Enter the username for the Couchbase cluster: ")
DB_PASSWORD = getpass.getpass("Enter the password for the Couchbase cluster: ")

如果您想获得一流的模型调用自动跟踪，您也可以通过取消以下注释来设置您的 LangSmith API 密钥

# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

初始化

在实例化之前，我们需要创建一个连接。

创建 Couchbase 连接对象

我们首先创建一个到 Couchbase 集群的连接，然后将集群对象传递给向量存储。

在这里，我们使用上面的用户名和密码进行连接。您也可以使用任何其他支持的方式连接到您的集群。

有关连接到 Couchbase 集群的更多信息，请查看文档。

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

我们现在将在 Couchbase 集群中设置我们要用于向量搜索的桶、范围和集合名称。

在此示例中，我们使用默认范围和集合。

BUCKET_NAME = "langchain_bucket"
SCOPE_NAME = "_default"
COLLECTION_NAME = "default"
SEARCH_INDEX_NAME = "langchain-test-index"

有关如何创建支持向量字段的搜索索引的详细信息，请参阅文档。

简单实例化

下面，我们使用集群信息和搜索索引名称创建向量存储对象。

OpenAI
HuggingFace
假嵌入

pip install -qU langchain-openai

import getpass

    os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

pip install -qU langchain-huggingface

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model="sentence-transformers/all-mpnet-base-v2")

pip install -qU langchain-core

from langchain_core.embeddings import FakeEmbeddings

embeddings = FakeEmbeddings(size=4096)

from langchain_couchbase.vectorstores import CouchbaseVectorStore

vector_store = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    embedding=embeddings,
    index_name=SEARCH_INDEX_NAME,
)

指定文本和嵌入字段

您可以选择使用 text_key 和 embedding_key 字段为文档指定文本和嵌入字段。

vector_store_specific = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    embedding=embeddings,
    index_name=SEARCH_INDEX_NAME,
    text_key="text",
    embedding_key="embedding",
)

管理向量存储

创建向量存储后，我们可以通过添加和删除不同的项目与之交互。

将项目添加到向量存储

我们可以使用 add_documents 函数将项目添加到向量存储。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

API 参考：Document

从向量存储中删除项目

vector_store.delete(ids=[uuids[-1]])

查询向量存储

创建向量存储并添加相关文档后，您可能希望在链或代理运行期间对其进行查询。

直接查询

相似性搜索

执行简单的相似性搜索可以按如下方式完成

results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

带有分数的相似性搜索

您还可以通过调用 similarity_search_with_score 方法来获取结果的分数。

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

指定要返回的字段

您可以使用搜索中的 fields 参数指定要从文档中返回的字段。这些字段作为返回文档的 metadata 对象的一部分返回。您可以获取存储在搜索索引中的任何字段。文档的 text_key 作为文档的 page_content 的一部分返回。

如果您没有指定要获取的任何字段，则将返回存储在索引中的所有字段。

如果您想获取元数据中的某个字段，则需要使用 . 指定它。

例如，要获取元数据中的 source 字段，您需要指定 metadata.source。

query = "What did I eat for breakfast today?"
results = vector_store.similarity_search(query, fields=["metadata.source"])
print(results[0])

混合查询

Couchbase 允许您通过将向量搜索结果与文档非向量字段（如 metadata 对象）上的搜索相结合来执行混合搜索。

结果将基于向量搜索和搜索服务支持的搜索的结果的组合。每个组件搜索的分数加起来得到结果的总分数。

要执行混合搜索，有一个可选参数 search_options 可以传递给所有相似性搜索。
可以在此处找到 search_options 的不同搜索/查询可能性。

为混合搜索创建不同的元数据

为了模拟混合搜索，让我们从现有文档中创建一些随机元数据。我们统一地向元数据添加了三个字段：date（2010 年至 2020 年之间）、rating（1 到 5 之间）和 author（设置为 John Doe 或 Jane Doe）。

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Adding metadata to documents
for i, doc in enumerate(docs):
    doc.metadata["date"] = f"{range(2010, 2020)[i % 10]}-01-01"
    doc.metadata["rating"] = range(1, 6)[i % 5]
    doc.metadata["author"] = ["John Doe", "Jane Doe"][i % 2]

vector_store.add_documents(docs)

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(query)
print(results[0].metadata)

API 参考：TextLoader | CharacterTextSplitter

按精确值查询

我们可以搜索 metadata 对象中作者等文本字段的精确匹配。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
    query,
    search_options={"query": {"field": "metadata.author", "match": "John Doe"}},
    fields=["metadata.author"],
)
print(results[0])

按部分匹配查询

我们可以通过为搜索指定模糊度来搜索部分匹配。当您想搜索搜索查询的细微变化或拼写错误时，这很有用。

在这里，“Jae” 接近（模糊度为 1）“Jane”。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
    query,
    search_options={
        "query": {"field": "metadata.author", "match": "Jae", "fuzziness": 1}
    },
    fields=["metadata.author"],
)
print(results[0])

按日期范围查询

我们可以搜索 metadata.date 等日期字段上的日期范围查询内的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search(
    query,
    search_options={
        "query": {
            "start": "2016-12-31",
            "end": "2017-01-02",
            "inclusive_start": True,
            "inclusive_end": False,
            "field": "metadata.date",
        }
    },
)
print(results[0])

按数字范围查询

我们可以搜索数值字段（如metadata.rating）在一定范围内的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
    query,
    search_options={
        "query": {
            "min": 3,
            "max": 5,
            "inclusive_min": True,
            "inclusive_max": True,
            "field": "metadata.rating",
        }
    },
)
print(results[0])

组合多个搜索查询

可以使用 AND（合取词）或 OR（析取词）运算符组合不同的搜索查询。

在本示例中，我们检查评级在 3 到 4 之间且日期在 2015 年到 2018 年之间的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
    query,
    search_options={
        "query": {
            "conjuncts": [
                {"min": 3, "max": 4, "inclusive_max": True, "field": "metadata.rating"},
                {"start": "2016-12-31", "end": "2017-01-02", "field": "metadata.date"},
            ]
        }
    },
)
print(results[0])

其他查询

同样，您可以在search_options参数中使用任何受支持的查询方法，例如 Geo Distance、Polygon Search、Wildcard、Regular Expressions 等。有关可用查询方法及其语法的更多详细信息，请参阅文档。

通过转换为检索器进行查询

您还可以将向量存储转换为检索器，以便在您的链中更容易使用。

以下是如何将您的向量存储转换为检索器，然后使用简单的查询和过滤器调用检索器。

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

用于检索增强型生成的使用情况

有关如何将此向量存储用于检索增强型生成 (RAG) 的指南，请参见以下部分

常见问题解答

问题：在创建 CouchbaseVectorStore 对象之前，我应该创建搜索索引吗？

是的，目前您需要在创建CouchbaseVectoreStore对象之前创建搜索索引。

问题：我没有看到我在搜索结果中指定的全部字段。

在 Couchbase 中，我们只能返回存储在搜索索引中的字段。请确保您尝试在搜索结果中访问的字段是搜索索引的一部分。

一种处理方法是在索引中动态地索引和存储文档的字段。

在 Capella 中，您需要转到“高级模式”，然后在“一般设置”下拉菜单下，您可以选中“[X]存储动态字段”或“[X]索引动态字段”。
在 Couchbase Server 中，在“索引编辑器”（不是“快速编辑器”）的“高级”下拉菜单下，您可以选中“[X]存储动态字段”或“[X]索引动态字段”。

请注意，这些选项会增加索引的大小。

有关动态映射的更多详细信息，请参阅文档。

问题：我在搜索结果中看不到元数据对象。

这很可能是由于文档中的metadata字段未被 Couchbase Search 索引索引和/或存储。为了索引文档中的metadata字段，您需要将其作为子映射添加到索引中。

如果您选择映射映射中的所有字段，您将能够按所有元数据字段进行搜索。或者，为了优化索引，您可以选择要索引的metadata对象内的特定字段。您可以参考文档，以了解有关索引子映射的更多信息。

创建子映射

API 参考

有关所有CouchbaseVectorStore功能和配置的详细文档，请访问 API 参考：https://python.langchain.ac.cn/v0.2/api_reference/couchbase/vectorstores/langchain_couchbase.vectorstores.CouchbaseVectorStore.html

向量存储概念指南
向量存储操作方法指南

Couchbase

设置

凭据

初始化

创建 Couchbase 连接对象

简单实例化

指定文本和嵌入字段

管理向量存储

将项目添加到向量存储

从向量存储中删除项目

查询向量存储

直接查询

相似性搜索

带有分数的相似性搜索

指定要返回的字段

混合查询

为混合搜索创建不同的元数据

按精确值查询

按部分匹配查询

按日期范围查询

按数字范围查询

组合多个搜索查询

其他查询

通过转换为检索器进行查询

用于检索增强型生成的使用情况

常见问题解答

问题：在创建 CouchbaseVectorStore 对象之前，我应该创建搜索索引吗？

问题：我没有看到我在搜索结果中指定的全部字段。

问题：我在搜索结果中看不到元数据对象。

API 参考

此页面对您有帮助吗？

您也可以在 GitHub 上留下详细的反馈 GitHub.

Couchbase

设置​

凭据​

初始化​

创建 Couchbase 连接对象​

简单实例化​

指定文本和嵌入字段​

管理向量存储​

将项目添加到向量存储​

从向量存储中删除项目​

查询向量存储​

直接查询​

相似性搜索​

带有分数的相似性搜索​

指定要返回的字段​

混合查询​

为混合搜索创建不同的元数据​

按精确值查询​

按部分匹配查询​

按日期范围查询​

按数字范围查询​

组合多个搜索查询​

其他查询​

通过转换为检索器进行查询​

用于检索增强型生成的使用情况​

常见问题解答

问题：在创建 CouchbaseVectorStore 对象之前，我应该创建搜索索引吗？​

问题：我没有看到我在搜索结果中指定的全部字段。​

问题：我在搜索结果中看不到元数据对象。​

API 参考​

相关​

此页面对您有帮助吗？

您也可以在 GitHub 上留下详细的反馈 GitHub.

设置

凭据

初始化

创建 Couchbase 连接对象

简单实例化

指定文本和嵌入字段

管理向量存储

将项目添加到向量存储

从向量存储中删除项目

查询向量存储

直接查询

相似性搜索

带有分数的相似性搜索

指定要返回的字段

混合查询

为混合搜索创建不同的元数据

按精确值查询

按部分匹配查询

按日期范围查询

按数字范围查询

组合多个搜索查询

其他查询

通过转换为检索器进行查询

用于检索增强型生成的使用情况

问题：在创建 CouchbaseVectorStore 对象之前，我应该创建搜索索引吗？

问题：我没有看到我在搜索结果中指定的全部字段。

问题：我在搜索结果中看不到元数据对象。

API 参考

相关