CloudflareVectorizeVectorStore

本笔记本介绍了如何开始使用 CloudflareVectorize 向量存储。

设置

这个 Python 包是 Cloudflare REST API 的封装。要与 API 交互，您需要提供具有相应权限的 API 令牌。

您可以在此处创建和管理 API 令牌：

https://dash.cloudflare.com/YOUR-ACCT-NUMBER/api-tokens

凭证

CloudflareVectorize 依赖于 WorkersAI（如果您想将其用于嵌入）和 D1（如果您将其用于存储和检索原始值）。

虽然您可以创建一个具有所有所需资源（WorkersAI、Vectorize 和 D1）编辑权限的单个 api_token，但您可能希望遵循“最小权限访问”原则，为每个服务创建单独的 API 令牌。

注意：这些服务特定的令牌（如果提供）将优先于全局令牌。您可以提供这些令牌而不是全局令牌。

您也可以将这些参数设置为环境变量。

import os

from dotenv import load_dotenv

load_dotenv(".env")

cf_acct_id = os.getenv("CF_ACCOUNT_ID")

# single "globally scoped" token with WorkersAI, Vectorize & D1
api_token = os.getenv("CF_API_TOKEN")

# OR, separate tokens with access to each service
cf_vectorize_token = os.getenv("CF_VECTORIZE_API_TOKEN")
cf_d1_token = os.getenv("CF_D1_API_TOKEN")

初始化

import asyncio
import json
import uuid
import warnings

from langchain_cloudflare.embeddings import (
    CloudflareWorkersAIEmbeddings,
)
from langchain_cloudflare.vectorstores import (
    CloudflareVectorize,
)
from langchain_community.document_loaders import WikipediaLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

warnings.filterwarnings("ignore")

API 参考：WikipediaLoader | Document | RecursiveCharacterTextSplitter

# name your vectorize index
vectorize_index_name = f"test-langchain-{uuid.uuid4().hex}"

嵌入

为了存储嵌入、进行语义搜索和检索，您必须将原始值嵌入为向量。请指定 WorkersAI 上可用的嵌入模型：

https://developers.cloudflare.com/workers-ai/models/

MODEL_WORKERSAI = "@cf/baai/bge-large-en-v1.5"

cf_ai_token = os.getenv(
    "CF_AI_API_TOKEN"
)  # needed if you want to use workersAI for embeddings

embedder = CloudflareWorkersAIEmbeddings(
    account_id=cf_acct_id, api_token=cf_ai_token, model_name=MODEL_WORKERSAI
)

使用 D1 的原始值

Vectorize 只存储嵌入、元数据和命名空间。如果您想存储和检索原始值，您必须利用 Cloudflare 的 SQL 数据库 D1。

您可以在此处创建数据库并检索其 ID：

[https://dash.cloudflare.com/YOUR-ACCT-NUMBER/workers/d1

# provide the id of your D1 Database
d1_database_id = os.getenv("CF_D1_DATABASE_ID")

CloudflareVectorize 类

现在我们可以创建 CloudflareVectorize 实例了。这里我们传入：

之前的 embedding 实例
账户 ID
用于所有服务（WorkersAI、Vectorize、D1）的全局 API 令牌
每个服务的独立 API 令牌

cfVect = CloudflareVectorize(
    embedding=embedder,
    account_id=cf_acct_id,
    d1_api_token=cf_d1_token,  # (Optional if using global token)
    vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
    d1_database_id=d1_database_id,  # (Optional if not using D1)
)

清理

在我们开始之前，让我们删除本教程中所有 test-langchain* 索引。

# depending on your notebook environment you might need to include:
# import nest_asyncio
# nest_asyncio.apply()

arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
    cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);

注意事项

提供了 D1 数据库 ID，但没有提供“全局”api_token 且没有 d1_api_token

try:
    cfVect = CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, # (Optional if using service-specific token)
        ai_api_token=cf_ai_token,  # (Optional if using global token)
        # d1_api_token=cf_d1_token,  # (Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
        d1_database_id=d1_database_id,  # (Optional if not using D1)
    )
except Exception as e:
    print(str(e))

`d1_database_id` provided, but no global `api_token` provided and no `d1_api_token` provided.

管理向量存储

创建索引

让我们从创建索引开始（如果存在则先删除）。如果索引不存在，Cloudflare 将会返回错误。

%%capture

try:
    cfVect.delete_index(index_name=vectorize_index_name, wait=True)
except Exception as e:
    print(e)

r = cfVect.create_index(
    index_name=vectorize_index_name, description="A Test Vectorize Index", wait=True
)
print(r)

{'created_on': '2025-05-13T05:38:04.487284Z', 'modified_on': '2025-05-13T05:38:04.487284Z', 'name': 'test-langchain-5c177bb404f74d438c916462ad73d27a', 'description': 'A Test Vectorize Index', 'config': {'dimensions': 1024, 'metric': 'cosine'}}

列出索引

现在，我们可以列出我们账户上的索引。

indexes = cfVect.list_indexes()
indexes = [x for x in indexes if "test-langchain" in x.get("name")]
print(indexes)

[{'created_on': '2025-05-13T05:38:04.487284Z', 'modified_on': '2025-05-13T05:38:04.487284Z', 'name': 'test-langchain-5c177bb404f74d438c916462ad73d27a', 'description': 'A Test Vectorize Index', 'config': {'dimensions': 1024, 'metric': 'cosine'}}]

获取索引信息

我们也可以获取某些索引并检索更多关于索引的详细信息。

此调用返回一个 processedUpToMutation，可用于跟踪创建索引、添加或删除记录等操作的状态。

r = cfVect.get_index_info(index_name=vectorize_index_name)
print(r)

{'dimensions': 1024, 'vectorCount': 0}

添加元数据索引

通过在查询中提供元数据过滤器来辅助检索是很常见的。在 Vectorize 中，这可以通过首先在 Vectorize 索引上创建“元数据索引”来实现。我们将在示例中在文档的 section 字段上创建一个元数据索引。

参考： https://developers.cloudflare.com/vectorize/reference/metadata-filtering/

r = cfVect.create_metadata_index(
    property_name="section",
    index_type="string",
    index_name=vectorize_index_name,
    wait=True,
)
print(r)

{'mutationId': '7fc5f849-4d35-420c-bb3f-b950a79e48b6'}

列出元数据索引

r = cfVect.list_metadata_indexes(index_name=vectorize_index_name)
print(r)

[{'propertyName': 'section', 'indexType': 'String'}]

添加文档

在此示例中，我们将使用 LangChain 的 Wikipedia 加载器来获取一篇关于 Cloudflare 的文章。我们将把这篇文章存储在 Vectorize 中，并在之后查询其内容。

docs = WikipediaLoader(query="Cloudflare", load_max_docs=2).load()

然后，我们将根据分块（chunk）部分创建一些简单的带有元数据的分块。

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([docs[0].page_content])

running_section = ""
for idx, text in enumerate(texts):
    if text.page_content.startswith("="):
        running_section = text.page_content
        running_section = running_section.replace("=", "").strip()
    else:
        if running_section == "":
            text.metadata = {"section": "Introduction"}
        else:
            text.metadata = {"section": running_section}

print(len(texts))
print(texts[0], "\n\n", texts[-1])

55
page_content='Cloudflare, Inc., is an American company that provides content delivery network services,' metadata={'section': 'Introduction'} 

 page_content='attacks, Cloudflare ended up being attacked as well; Google and other companies eventually' metadata={'section': 'DDoS mitigation'}

现在我们将文档添加到我们的 Vectorize 索引中。

注意：向 Vectorize 添加嵌入是异步进行的，这意味着在添加嵌入和能够查询它们之间会有一个小的延迟。默认情况下，add_documents 具有一个 wait=True 参数，该参数会等待此操作完成才返回响应。如果您不希望程序等待嵌入可用性，可以将其设置为 wait=False。

r = cfVect.add_documents(index_name=vectorize_index_name, documents=texts, wait=True)

print(json.dumps(r)[:300])

["433a614a-2253-4c54-951f-0e40379a52c4", "608a9cb6-ab71-4e5c-8831-ebedeb9749e8", "40a0eead-a781-46a7-a6a3-1940ec57c9ae", "64081e01-12d1-4760-9b3c-84ee1e4ba199", "af465fb9-9e3b-49a6-b00a-6a9eec4fc623", "2898e362-b667-46ab-ac20-651d8e13f2bf", "a2c0095b-2cbc-4724-bbcb-86cd702bfa84", "cc659763-37cb-42cb

查询向量存储

我们将在嵌入上进行一些搜索。我们可以指定搜索 query 和我们想要的顶部结果数量 k。

query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name, query="Workers AI", k=100, return_metadata="none"
)

print(f"{len(query_documents)} results:\n{query_documents[:3]}")

55 results:
[Document(id='24405ae0-c125-4177-a1c2-8b1849c13ab7', metadata={}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), Document(id='ca33b19e-4e28-4e1b-8ed7-94f133dbf8a7', metadata={}, page_content='based on queries by leveraging Workers AI.Cloudflare announced plans in September 2024 to launch a'), Document(id='14602058-73fe-4307-a1c2-95956d6392ad', metadata={}, page_content='=== Artificial intelligence ===')]

输出

如果您想返回元数据，可以传入 return_metadata="all" | 'indexed'。默认值为 all。

如果您想返回嵌入值，可以传入 return_values=True。默认值为 False。嵌入将作为特殊字段 _values 返回在 metadata 字段下。

注意： return_metadata="none" 和 return_values=True 将只返回 metadata 中的 _values 字段。

注意： 如果您返回元数据或值，结果将被限制在前 20 条。

https://developers.cloudflare.com/vectorize/platform/limits/

query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="Workers AI",
    return_values=True,
    return_metadata="all",
    k=100,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within' metadata={'section': 'Artificial intelligence', '_values': [0.014350891, 0.0053482056, -0.022354126, 0.002948761, 0.010406494, -0.016067505, -0.002029419, -0.023513794, 0.020141602, 0.023742676, 0.01361084, 0.003019333, 0.02748108, -0.023162842, 0.008979797, -0.029373169, -0.03643799, -0.03842163, -0.004463196, 0.021255493, 0.02192688, -0.005947113, -0.060272217, -0.055389404, -0.031188965

如果您希望返回相似度 scores，可以使用 similarity_search_with_score。

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="Workers AI",
    k=100,
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

20 results:
(Document(id='24405ae0-c125-4177-a1c2-8b1849c13ab7', metadata={'section': 'Artificial intelligence'}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), 0.7851709)

包含 D1 以获取“原始值”

CloudflareVectorize 上的所有 add 和 search 方法都支持 include_d1 参数（默认为 True）。

这是为了配置您是否要存储/检索原始值。

如果您不想为此使用 D1，可以将其设置为 include=False。这将返回 page_content 字段为空的文档。

注意：您的 D1 表名必须与您的 vectorize 索引名匹配！如果您运行 'create_index' 并设置 include_d1=True 或 cfVect(d1_database=...,)，此 D1 表将与您的 Vectorize 索引一起创建。

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="california",
    k=100,
    return_metadata="all",
    include_d1=False,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

20 results:
(Document(id='64081e01-12d1-4760-9b3c-84ee1e4ba199', metadata={'section': 'Introduction'}, page_content=''), 0.60426825)

转换为检索器进行查询

您还可以将向量存储转换为检索器，以便在您的链中更方便地使用。

retriever = cfVect.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1, "index_name": vectorize_index_name},
)
r = retriever.get_relevant_documents("california")

使用元数据过滤进行搜索

如前所述，Vectorize 支持通过索引元数据字段上的过滤器进行过滤搜索。这里有一个示例，我们搜索索引的 section 元数据字段中的 Introduction 值。

有关元数据字段搜索的更多信息，请参阅此处：https://developers.cloudflare.com/vectorize/reference/metadata-filtering/

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="California",
    k=100,
    md_filter={"section": "Introduction"},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")

6 results:
 - [(Document(id='64081e01-12d1-4760-9b3c-84ee1e4ba199', metadata={'section': 'Introduction'}, page_content="and other services. Cloudflare's headquarters are in San Francisco, California. According to"), 0.60426825), (Document(id='608a9cb6-ab71-4e5c-8831-ebedeb9749e8', metadata={'section': 'Introduction'}, page_content='network services, cybersecurity, DDoS mitigation, wide area network services, reverse proxies,'), 0.52082914), (Document(id='433a614a-2253-4c54-951f-0e40379a52c4', metadata={'section': 'Introduction'}, page_content='Cloudflare, Inc., is an American company that provides content delivery network services,'), 0.50490546)]

您也可以进行更复杂的过滤：

https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#valid-filter-examples

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="California",
    k=100,
    md_filter={"section": {"$ne": "Introduction"}},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")

20 results:
 - [(Document(id='daeb7891-ec00-4c09-aa73-fc8e9a226ca8', metadata={}, page_content='== Products =='), 0.56540567), (Document(id='8c91ed93-d306-4cf9-ad1e-157e90a01ddf', metadata={'section': 'History'}, page_content='Since at least 2017, Cloudflare has been using a wall of lava lamps in their San Francisco'), 0.5604333), (Document(id='1400609f-0937-4700-acde-6e770d2dbd12', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.55573463)]

query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="DNS",
    k=100,
    md_filter={"section": {"$in": ["Products", "History"]}},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")

20 results:
 - [(Document(id='253a0987-1118-4ab2-a444-b8a50f0b4a63', metadata={'section': 'Products'}, page_content='protocols such as DNS over HTTPS, SMTP, and HTTP/2 with support for HTTP/2 Server Push. As of 2023,'), 0.7205538), (Document(id='112b61d1-6c34-41d6-a22f-7871bf1cca7b', metadata={'section': 'Products'}, page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'), 0.58178145), (Document(id='36929a30-32a9-482a-add7-6c109bbf8f82', metadata={'section': 'Products'}, page_content='and a content distribution network to serve content across its network of servers. It supports'), 0.5797795), (Document(id='485ac8dc-c2ad-443a-90fc-8be9e004eaee', metadata={'section': 'History'}, page_content='the New York Stock Exchange under the stock ticker NET. It opened for public trading on September'), 0.5678468), (Document(id='1c7581d5-0b06-45d6-874c-554907f4f7f7', metadata={'section': 'Products'}, page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge'), 0.55722594), (Document(id='f2fd02ac-3bab-4565-a6e2-14d9963e8fd9', metadata={'section': 'History'}, page_content='Cloudflare has acquired web-services and security companies, including StopTheHacker (February'), 0.5558441), (Document(id='1315a8ff-6509-4350-ae84-21e11da282b3', metadata={'section': 'Products'}, page_content='Push. As of 2023, Cloudflare handles an average of 45 million HTTP requests per second.'), 0.55429655), (Document(id='f5b0c9d0-89c2-43ec-a9b7-5a5b376a5a85', metadata={'section': 'Products'}, page_content='It supports transport layer protocols TCP, UDP, QUIC, and many application layer protocols such as'), 0.54969466), (Document(id='cc659763-37cb-42cb-bf09-465df1b5bc1a', metadata={'section': 'History'}, page_content='Cloudflare was founded in July 2009 by Matthew Prince, Lee Holloway, and Michelle Zatlyn. Prince'), 0.54691005), (Document(id='b467348b-9a9b-4bf1-9104-27570891c9e4', metadata={'section': 'History'}, page_content='2019, Cloudflare submitted its S-1 filing for an initial public offering on the New York Stock'), 0.533554), (Document(id='7966591b-ff56-4346-aca8-341daece01fc', metadata={'section': 'History'}, page_content='Networks (March 2024), BastionZero (May 2024), and Kivera (October 2024).'), 0.53296596), (Document(id='c7657276-c631-4331-98ec-af308387ea49', metadata={'section': 'Products'}, page_content='Verizon’s October 2024 outage.'), 0.53137076), (Document(id='9418e10c-426b-45fa-a1a4-672074310890', metadata={'section': 'Products'}, page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October'), 0.53107977), (Document(id='db5507e2-0103-4275-a9f8-466f977255c0', metadata={'section': 'History'}, page_content='a product of Unspam Technologies that served as some inspiration for the basis of Cloudflare. From'), 0.528889), (Document(id='9d840318-be0e-4cf7-8a60-eaab50d45c9e', metadata={'section': 'History'}, page_content='of Cloudflare. From 2009, the company was venture-capital funded. On August 15, 2019, Cloudflare'), 0.52717584), (Document(id='db9137cc-051b-4b20-8d49-8a32bb2b99a7', metadata={'section': 'History'}, page_content='(December 2021), Vectrix (February 2022), Area 1 Security (February 2022), Nefeli Networks (March'), 0.52209044), (Document(id='dfaffd2f-4492-444d-accf-180b1f841463', metadata={'section': 'Products'}, page_content='As of 2024, Cloudflare servers are powered by AMD EPYC 9684X processors.'), 0.5169676), (Document(id='65bbd754-22d1-435a-860a-9259f6cf7dea', metadata={'section': 'History'}, page_content='(February 2014), CryptoSeal (June 2014), Eager Platform Co. (December 2016), Neumob (November'), 0.5132974), (Document(id='1400609f-0937-4700-acde-6e770d2dbd12', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.50999177), (Document(id='b77cef8b-1140-4d92-891b-0048ea70ae3a', metadata={'section': 'History'}, page_content='Neumob (November 2017), S2 Systems (January 2020), Linc (December 2020), Zaraz (December 2021),'), 0.5092492)]

按命名空间搜索

我们也可以通过 namespace 搜索向量。我们只需在将其添加到向量数据库时将其添加到 namespaces 数组中。

https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#namespace-versus-metadata-filtering

namespace_name = f"test-namespace-{uuid.uuid4().hex[:8]}"

new_documents = [
    Document(
        page_content="This is a new namespace specific document!",
        metadata={"section": "Namespace Test1"},
    ),
    Document(
        page_content="This is another namespace specific document!",
        metadata={"section": "Namespace Test2"},
    ),
]

r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=new_documents,
    namespaces=[namespace_name] * len(new_documents),
    wait=True,
)

query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="California",
    namespace=namespace_name,
)

print(f"{len(query_documents)} results:\n - {str(query_documents)}")

2 results:
 - [Document(id='65c4f7f4-aa4f-46b4-85ba-c90ea18dc7ed', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-9cc13b96'}, page_content='This is another namespace specific document!'), Document(id='96350f98-7053-41c7-b6bb-5acdd3ab67bd', metadata={'section': 'Namespace Test1', '_namespace': 'test-namespace-9cc13b96'}, page_content='This is a new namespace specific document!')]

按 ID 搜索

我们还可以检索特定 ID 的特定记录。为此，我们需要在 Vectorize 状态参数 index_name 上设置 vectorize 索引名称。

这将返回 _namespace 和 _values 以及其他 metadata。

sample_ids = [x.id for x in query_documents]

cfVect.index_name = vectorize_index_name

query_documents = cfVect.get_by_ids(
    sample_ids,
)
print(str(query_documents[:3])[:500])

[Document(id='65c4f7f4-aa4f-46b4-85ba-c90ea18dc7ed', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-9cc13b96', '_values': [-0.0005841255, 0.014480591, 0.040771484, 0.005218506, 0.015579224, 0.0007543564, -0.005138397, -0.022720337, 0.021835327, 0.038970947, 0.017456055, 0.022705078, 0.013450623, -0.015686035, -0.019119263, -0.01512146, -0.017471313, -0.007183075, -0.054382324, -0.01914978, 0.0005302429, 0.018600464, -0.083740234, -0.006462097, 0.0005598068, 0.024230957, -0

命名空间将包含在 metadata 中的 _namespace 字段下，以及您的其他元数据（如果您在 return_metadata 中请求了它）。

注意：您不能在 metadata 中设置 _namespace 或 _values 字段，因为它们是保留字段。它们将在插入过程中被剥离。

更新插入（Upserts）

Vectorize 支持更新插入（Upserts），您可以通过设置 upsert=True 来执行此操作。

query_documents[0].page_content = "Updated: " + query_documents[0].page_content
print(query_documents[0].page_content)

Updated: This is another namespace specific document!

new_document_id = "12345678910"
new_document = Document(
    id=new_document_id,
    page_content="This is a new document!",
    metadata={"section": "Introduction"},
)

r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=[new_document, query_documents[0]],
    upsert=True,
    wait=True,
)

query_documents_updated = cfVect.get_by_ids([new_document_id, query_documents[0].id])

print(str(query_documents_updated[0])[:500])
print(query_documents_updated[0].page_content)
print(query_documents_updated[1].page_content)

page_content='This is a new document!' metadata={'section': 'Introduction', '_namespace': None, '_values': [-0.007522583, 0.0023021698, 0.009963989, 0.031051636, -0.021316528, 0.0048103333, 0.026046753, 0.01348114, 0.026306152, 0.040374756, 0.03225708, 0.007423401, 0.031021118, -0.007347107, -0.034179688, 0.002111435, -0.027191162, -0.020950317, -0.021636963, -0.0030593872, -0.04977417, 0.018859863, -0.08062744, -0.027679443, 0.012512207, 0.0053634644, 0.008079529, -0.010528564, 0.07312012, 0.02
This is a new document!
Updated: This is another namespace specific document!

删除记录

我们也可以通过 ID 删除记录

r = cfVect.delete(index_name=vectorize_index_name, ids=sample_ids, wait=True)
print(r)

True

并确认删除

query_documents = cfVect.get_by_ids(sample_ids)
assert len(query_documents) == 0

从文档创建

LangChain 规定所有向量存储都必须有一个 from_documents 方法，以便从文档实例化一个新的向量存储。这是一种比上面所示的单独 create, add 步骤更简化的方法。

您可以按此处所示进行操作

vectorize_index_name = "test-langchain-from-docs"

cfVect = CloudflareVectorize.from_documents(
    account_id=cf_acct_id,
    index_name=vectorize_index_name,
    documents=texts,
    embedding=embedder,
    d1_database_id=d1_database_id,
    d1_api_token=cf_d1_token,
    vectorize_api_token=cf_vectorize_token,
    wait=True,
)

# query for documents
query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="Edge Computing",
)

print(f"{len(query_documents)} results:\n{str(query_documents[0])[:300]}")

20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content' metadata={'section': 'Products'}

异步示例

本节将展示一些异步示例

创建索引

vectorize_index_name1 = f"test-langchain-{uuid.uuid4().hex}"
vectorize_index_name2 = f"test-langchain-{uuid.uuid4().hex}"

# depending on your notebook environment you might need to include these:
# import nest_asyncio
# nest_asyncio.apply()

async_requests = [
    cfVect.acreate_index(index_name=vectorize_index_name1),
    cfVect.acreate_index(index_name=vectorize_index_name2),
]

res = await asyncio.gather(*async_requests);

创建元数据索引

async_requests = [
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name1,
        wait=True,
    ),
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name2,
        wait=True,
    ),
]

await asyncio.gather(*async_requests);

添加文档

async_requests = [
    cfVect.aadd_documents(index_name=vectorize_index_name1, documents=texts, wait=True),
    cfVect.aadd_documents(index_name=vectorize_index_name2, documents=texts, wait=True),
]

await asyncio.gather(*async_requests);

查询/搜索

async_requests = [
    cfVect.asimilarity_search(index_name=vectorize_index_name1, query="Workers AI"),
    cfVect.asimilarity_search(index_name=vectorize_index_name2, query="Edge Computing"),
]

async_results = await asyncio.gather(*async_requests);

print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")

20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within'
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'

返回元数据/值

async_requests = [
    cfVect.asimilarity_search(
        index_name=vectorize_index_name1,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name2,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
]

async_results = await asyncio.gather(*async_requests);

print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")

20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820

使用元数据过滤进行搜索

async_requests = [
    cfVect.asimilarity_search(
        index_name=vectorize_index_name1,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name2,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
]

async_results = await asyncio.gather(*async_requests);

print(f"{len(async_results[0])} results:\n{str(async_results[0][-1])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")

9 results:
page_content='It supports transport layer protocols TCP, UDP, QUIC, and many application layer protocols such as' metadata={'section': 'Products'}
9 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}

清理

最后，让我们删除本笔记本中创建的所有索引。

arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]

arr_async_requests = [
    cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);

API 参考

https://developers.cloudflare.com/api/resources/vectorize/

https://developers.cloudflare.com/vectorize/

向量存储概念指南
向量存储操作指南

设置​

凭证​

初始化​

嵌入​

使用 D1 的原始值​

CloudflareVectorize 类​

清理​

注意事项​

管理向量存储​

创建索引​

列出索引​

获取索引信息​

添加元数据索引​

列出元数据索引​

添加文档​

查询向量存储​

输出​

包含 D1 以获取“原始值”​

转换为检索器进行查询​

使用元数据过滤进行搜索​

按命名空间搜索​

按 ID 搜索​

更新插入（Upserts）​

删除记录​

从文档创建​

异步示例​

创建索引​

创建元数据索引​

添加文档​

查询/搜索​

返回元数据/值​

使用元数据过滤进行搜索​

清理​

API 参考​

相关​

设置

凭证

初始化

嵌入

使用 D1 的原始值

CloudflareVectorize 类

清理

注意事项

管理向量存储

创建索引

列出索引

获取索引信息

添加元数据索引

列出元数据索引

添加文档

查询向量存储

输出

包含 D1 以获取“原始值”

转换为检索器进行查询

使用元数据过滤进行搜索

按命名空间搜索

按 ID 搜索

更新插入（Upserts）

删除记录

从文档创建

异步示例

创建索引

创建元数据索引

添加文档

查询/搜索

返回元数据/值

使用元数据过滤进行搜索

清理

API 参考

相关