CloudflareVectorizeVectorStore
本笔记本介绍了如何开始使用 CloudflareVectorize 向量存储。
设置
这个 Python 包是 Cloudflare REST API 的封装。要与 API 交互,您需要提供具有相应权限的 API 令牌。
您可以在此处创建和管理 API 令牌:
https://dash.cloudflare.com/YOUR-ACCT-NUMBER/api-tokens
凭证
CloudflareVectorize 依赖于 WorkersAI(如果您想将其用于嵌入)和 D1(如果您将其用于存储和检索原始值)。
虽然您可以创建一个具有所有所需资源(WorkersAI、Vectorize 和 D1)编辑权限的单个 api_token
,但您可能希望遵循“最小权限访问”原则,为每个服务创建单独的 API 令牌。
注意:这些服务特定的令牌(如果提供)将优先于全局令牌。您可以提供这些令牌而不是全局令牌。
您也可以将这些参数设置为环境变量。
import os
from dotenv import load_dotenv
load_dotenv(".env")
cf_acct_id = os.getenv("CF_ACCOUNT_ID")
# single "globally scoped" token with WorkersAI, Vectorize & D1
api_token = os.getenv("CF_API_TOKEN")
# OR, separate tokens with access to each service
cf_vectorize_token = os.getenv("CF_VECTORIZE_API_TOKEN")
cf_d1_token = os.getenv("CF_D1_API_TOKEN")
初始化
import asyncio
import json
import uuid
import warnings
from langchain_cloudflare.embeddings import (
CloudflareWorkersAIEmbeddings,
)
from langchain_cloudflare.vectorstores import (
CloudflareVectorize,
)
from langchain_community.document_loaders import WikipediaLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
warnings.filterwarnings("ignore")
# name your vectorize index
vectorize_index_name = f"test-langchain-{uuid.uuid4().hex}"
嵌入
为了存储嵌入、进行语义搜索和检索,您必须将原始值嵌入为向量。请指定 WorkersAI 上可用的嵌入模型:
https://developers.cloudflare.com/workers-ai/models/
MODEL_WORKERSAI = "@cf/baai/bge-large-en-v1.5"
cf_ai_token = os.getenv(
"CF_AI_API_TOKEN"
) # needed if you want to use workersAI for embeddings
embedder = CloudflareWorkersAIEmbeddings(
account_id=cf_acct_id, api_token=cf_ai_token, model_name=MODEL_WORKERSAI
)
使用 D1 的原始值
Vectorize 只存储嵌入、元数据和命名空间。如果您想存储和检索原始值,您必须利用 Cloudflare 的 SQL 数据库 D1。
您可以在此处创建数据库并检索其 ID:
[https://dash.cloudflare.com/YOUR-ACCT-NUMBER/workers/d1
# provide the id of your D1 Database
d1_database_id = os.getenv("CF_D1_DATABASE_ID")
CloudflareVectorize 类
现在我们可以创建 CloudflareVectorize 实例了。这里我们传入:
- 之前的
embedding
实例 - 账户 ID
- 用于所有服务(WorkersAI、Vectorize、D1)的全局 API 令牌
- 每个服务的独立 API 令牌
cfVect = CloudflareVectorize(
embedding=embedder,
account_id=cf_acct_id,
d1_api_token=cf_d1_token, # (Optional if using global token)
vectorize_api_token=cf_vectorize_token, # (Optional if using global token)
d1_database_id=d1_database_id, # (Optional if not using D1)
)
清理
在我们开始之前,让我们删除本教程中所有 test-langchain*
索引。
# depending on your notebook environment you might need to include:
# import nest_asyncio
# nest_asyncio.apply()
arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);
注意事项
提供了 D1 数据库 ID,但没有提供“全局”api_token
且没有 d1_api_token
try:
cfVect = CloudflareVectorize(
embedding=embedder,
account_id=cf_acct_id,
# api_token=api_token, # (Optional if using service-specific token)
ai_api_token=cf_ai_token, # (Optional if using global token)
# d1_api_token=cf_d1_token, # (Optional if using global token)
vectorize_api_token=cf_vectorize_token, # (Optional if using global token)
d1_database_id=d1_database_id, # (Optional if not using D1)
)
except Exception as e:
print(str(e))
`d1_database_id` provided, but no global `api_token` provided and no `d1_api_token` provided.
管理向量存储
创建索引
让我们从创建索引开始(如果存在则先删除)。如果索引不存在,Cloudflare 将会返回错误。
%%capture
try:
cfVect.delete_index(index_name=vectorize_index_name, wait=True)
except Exception as e:
print(e)
r = cfVect.create_index(
index_name=vectorize_index_name, description="A Test Vectorize Index", wait=True
)
print(r)
{'created_on': '2025-05-13T05:38:04.487284Z', 'modified_on': '2025-05-13T05:38:04.487284Z', 'name': 'test-langchain-5c177bb404f74d438c916462ad73d27a', 'description': 'A Test Vectorize Index', 'config': {'dimensions': 1024, 'metric': 'cosine'}}
列出索引
现在,我们可以列出我们账户上的索引。
indexes = cfVect.list_indexes()
indexes = [x for x in indexes if "test-langchain" in x.get("name")]
print(indexes)
[{'created_on': '2025-05-13T05:38:04.487284Z', 'modified_on': '2025-05-13T05:38:04.487284Z', 'name': 'test-langchain-5c177bb404f74d438c916462ad73d27a', 'description': 'A Test Vectorize Index', 'config': {'dimensions': 1024, 'metric': 'cosine'}}]
获取索引信息
我们也可以获取某些索引并检索更多关于索引的详细信息。
此调用返回一个 processedUpToMutation
,可用于跟踪创建索引、添加或删除记录等操作的状态。
r = cfVect.get_index_info(index_name=vectorize_index_name)
print(r)
{'dimensions': 1024, 'vectorCount': 0}
添加元数据索引
通过在查询中提供元数据过滤器来辅助检索是很常见的。在 Vectorize 中,这可以通过首先在 Vectorize 索引上创建“元数据索引”来实现。我们将在示例中在文档的 section
字段上创建一个元数据索引。
参考: https://developers.cloudflare.com/vectorize/reference/metadata-filtering/
r = cfVect.create_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name,
wait=True,
)
print(r)
{'mutationId': '7fc5f849-4d35-420c-bb3f-b950a79e48b6'}
列出元数据索引
r = cfVect.list_metadata_indexes(index_name=vectorize_index_name)
print(r)
[{'propertyName': 'section', 'indexType': 'String'}]
添加文档
在此示例中,我们将使用 LangChain 的 Wikipedia 加载器来获取一篇关于 Cloudflare 的文章。我们将把这篇文章存储在 Vectorize 中,并在之后查询其内容。
docs = WikipediaLoader(query="Cloudflare", load_max_docs=2).load()
然后,我们将根据分块(chunk)部分创建一些简单的带有元数据的分块。
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([docs[0].page_content])
running_section = ""
for idx, text in enumerate(texts):
if text.page_content.startswith("="):
running_section = text.page_content
running_section = running_section.replace("=", "").strip()
else:
if running_section == "":
text.metadata = {"section": "Introduction"}
else:
text.metadata = {"section": running_section}
print(len(texts))
print(texts[0], "\n\n", texts[-1])
55
page_content='Cloudflare, Inc., is an American company that provides content delivery network services,' metadata={'section': 'Introduction'}
page_content='attacks, Cloudflare ended up being attacked as well; Google and other companies eventually' metadata={'section': 'DDoS mitigation'}
现在我们将文档添加到我们的 Vectorize 索引中。
注意:向 Vectorize 添加嵌入是异步进行的,这意味着在添加嵌入和能够查询它们之间会有一个小的延迟。默认情况下,add_documents
具有一个 wait=True
参数,该参数会等待此操作完成才返回响应。如果您不希望程序等待嵌入可用性,可以将其设置为 wait=False
。
r = cfVect.add_documents(index_name=vectorize_index_name, documents=texts, wait=True)
print(json.dumps(r)[:300])
["433a614a-2253-4c54-951f-0e40379a52c4", "608a9cb6-ab71-4e5c-8831-ebedeb9749e8", "40a0eead-a781-46a7-a6a3-1940ec57c9ae", "64081e01-12d1-4760-9b3c-84ee1e4ba199", "af465fb9-9e3b-49a6-b00a-6a9eec4fc623", "2898e362-b667-46ab-ac20-651d8e13f2bf", "a2c0095b-2cbc-4724-bbcb-86cd702bfa84", "cc659763-37cb-42cb
查询向量存储
我们将在嵌入上进行一些搜索。我们可以指定搜索 query
和我们想要的顶部结果数量 k
。
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name, query="Workers AI", k=100, return_metadata="none"
)
print(f"{len(query_documents)} results:\n{query_documents[:3]}")
55 results:
[Document(id='24405ae0-c125-4177-a1c2-8b1849c13ab7', metadata={}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), Document(id='ca33b19e-4e28-4e1b-8ed7-94f133dbf8a7', metadata={}, page_content='based on queries by leveraging Workers AI.Cloudflare announced plans in September 2024 to launch a'), Document(id='14602058-73fe-4307-a1c2-95956d6392ad', metadata={}, page_content='=== Artificial intelligence ===')]
输出
如果您想返回元数据,可以传入 return_metadata="all" | 'indexed'
。默认值为 all
。
如果您想返回嵌入值,可以传入 return_values=True
。默认值为 False
。嵌入将作为特殊字段 _values
返回在 metadata
字段下。
注意: return_metadata="none"
和 return_values=True
将只返回 metadata
中的 _values
字段。
注意: 如果您返回元数据或值,结果将被限制在前 20 条。
https://developers.cloudflare.com/vectorize/platform/limits/
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name,
query="Workers AI",
return_values=True,
return_metadata="all",
k=100,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")
20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within' metadata={'section': 'Artificial intelligence', '_values': [0.014350891, 0.0053482056, -0.022354126, 0.002948761, 0.010406494, -0.016067505, -0.002029419, -0.023513794, 0.020141602, 0.023742676, 0.01361084, 0.003019333, 0.02748108, -0.023162842, 0.008979797, -0.029373169, -0.03643799, -0.03842163, -0.004463196, 0.021255493, 0.02192688, -0.005947113, -0.060272217, -0.055389404, -0.031188965
如果您希望返回相似度 scores
,可以使用 similarity_search_with_score
。
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="Workers AI",
k=100,
return_metadata="all",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")
20 results:
(Document(id='24405ae0-c125-4177-a1c2-8b1849c13ab7', metadata={'section': 'Artificial intelligence'}, page_content="In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within"), 0.7851709)
包含 D1 以获取“原始值”
CloudflareVectorize 上的所有 add
和 search
方法都支持 include_d1
参数(默认为 True)。
这是为了配置您是否要存储/检索原始值。
如果您不想为此使用 D1,可以将其设置为 include=False
。这将返回 page_content
字段为空的文档。
注意:您的 D1 表名必须与您的 vectorize 索引名匹配!如果您运行 'create_index' 并设置 include_d1=True 或 cfVect(d1_database=...,),此 D1 表将与您的 Vectorize 索引一起创建。
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="california",
k=100,
return_metadata="all",
include_d1=False,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")
20 results:
(Document(id='64081e01-12d1-4760-9b3c-84ee1e4ba199', metadata={'section': 'Introduction'}, page_content=''), 0.60426825)
转换为检索器进行查询
您还可以将向量存储转换为检索器,以便在您的链中更方便地使用。
retriever = cfVect.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "index_name": vectorize_index_name},
)
r = retriever.get_relevant_documents("california")
使用元数据过滤进行搜索
如前所述,Vectorize 支持通过索引元数据字段上的过滤器进行过滤搜索。这里有一个示例,我们搜索索引的 section
元数据字段中的 Introduction
值。
有关元数据字段搜索的更多信息,请参阅此处:https://developers.cloudflare.com/vectorize/reference/metadata-filtering/
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="California",
k=100,
md_filter={"section": "Introduction"},
return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")
6 results:
- [(Document(id='64081e01-12d1-4760-9b3c-84ee1e4ba199', metadata={'section': 'Introduction'}, page_content="and other services. Cloudflare's headquarters are in San Francisco, California. According to"), 0.60426825), (Document(id='608a9cb6-ab71-4e5c-8831-ebedeb9749e8', metadata={'section': 'Introduction'}, page_content='network services, cybersecurity, DDoS mitigation, wide area network services, reverse proxies,'), 0.52082914), (Document(id='433a614a-2253-4c54-951f-0e40379a52c4', metadata={'section': 'Introduction'}, page_content='Cloudflare, Inc., is an American company that provides content delivery network services,'), 0.50490546)]
您也可以进行更复杂的过滤:
https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#valid-filter-examples
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="California",
k=100,
md_filter={"section": {"$ne": "Introduction"}},
return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")
20 results:
- [(Document(id='daeb7891-ec00-4c09-aa73-fc8e9a226ca8', metadata={}, page_content='== Products =='), 0.56540567), (Document(id='8c91ed93-d306-4cf9-ad1e-157e90a01ddf', metadata={'section': 'History'}, page_content='Since at least 2017, Cloudflare has been using a wall of lava lamps in their San Francisco'), 0.5604333), (Document(id='1400609f-0937-4700-acde-6e770d2dbd12', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.55573463)]
query_documents = cfVect.similarity_search_with_score(
index_name=vectorize_index_name,
query="DNS",
k=100,
md_filter={"section": {"$in": ["Products", "History"]}},
return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")
20 results:
- [(Document(id='253a0987-1118-4ab2-a444-b8a50f0b4a63', metadata={'section': 'Products'}, page_content='protocols such as DNS over HTTPS, SMTP, and HTTP/2 with support for HTTP/2 Server Push. As of 2023,'), 0.7205538), (Document(id='112b61d1-6c34-41d6-a22f-7871bf1cca7b', metadata={'section': 'Products'}, page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'), 0.58178145), (Document(id='36929a30-32a9-482a-add7-6c109bbf8f82', metadata={'section': 'Products'}, page_content='and a content distribution network to serve content across its network of servers. It supports'), 0.5797795), (Document(id='485ac8dc-c2ad-443a-90fc-8be9e004eaee', metadata={'section': 'History'}, page_content='the New York Stock Exchange under the stock ticker NET. It opened for public trading on September'), 0.5678468), (Document(id='1c7581d5-0b06-45d6-874c-554907f4f7f7', metadata={'section': 'Products'}, page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge'), 0.55722594), (Document(id='f2fd02ac-3bab-4565-a6e2-14d9963e8fd9', metadata={'section': 'History'}, page_content='Cloudflare has acquired web-services and security companies, including StopTheHacker (February'), 0.5558441), (Document(id='1315a8ff-6509-4350-ae84-21e11da282b3', metadata={'section': 'Products'}, page_content='Push. As of 2023, Cloudflare handles an average of 45 million HTTP requests per second.'), 0.55429655), (Document(id='f5b0c9d0-89c2-43ec-a9b7-5a5b376a5a85', metadata={'section': 'Products'}, page_content='It supports transport layer protocols TCP, UDP, QUIC, and many application layer protocols such as'), 0.54969466), (Document(id='cc659763-37cb-42cb-bf09-465df1b5bc1a', metadata={'section': 'History'}, page_content='Cloudflare was founded in July 2009 by Matthew Prince, Lee Holloway, and Michelle Zatlyn. Prince'), 0.54691005), (Document(id='b467348b-9a9b-4bf1-9104-27570891c9e4', metadata={'section': 'History'}, page_content='2019, Cloudflare submitted its S-1 filing for an initial public offering on the New York Stock'), 0.533554), (Document(id='7966591b-ff56-4346-aca8-341daece01fc', metadata={'section': 'History'}, page_content='Networks (March 2024), BastionZero (May 2024), and Kivera (October 2024).'), 0.53296596), (Document(id='c7657276-c631-4331-98ec-af308387ea49', metadata={'section': 'Products'}, page_content='Verizon’s October 2024 outage.'), 0.53137076), (Document(id='9418e10c-426b-45fa-a1a4-672074310890', metadata={'section': 'Products'}, page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October'), 0.53107977), (Document(id='db5507e2-0103-4275-a9f8-466f977255c0', metadata={'section': 'History'}, page_content='a product of Unspam Technologies that served as some inspiration for the basis of Cloudflare. From'), 0.528889), (Document(id='9d840318-be0e-4cf7-8a60-eaab50d45c9e', metadata={'section': 'History'}, page_content='of Cloudflare. From 2009, the company was venture-capital funded. On August 15, 2019, Cloudflare'), 0.52717584), (Document(id='db9137cc-051b-4b20-8d49-8a32bb2b99a7', metadata={'section': 'History'}, page_content='(December 2021), Vectrix (February 2022), Area 1 Security (February 2022), Nefeli Networks (March'), 0.52209044), (Document(id='dfaffd2f-4492-444d-accf-180b1f841463', metadata={'section': 'Products'}, page_content='As of 2024, Cloudflare servers are powered by AMD EPYC 9684X processors.'), 0.5169676), (Document(id='65bbd754-22d1-435a-860a-9259f6cf7dea', metadata={'section': 'History'}, page_content='(February 2014), CryptoSeal (June 2014), Eager Platform Co. (December 2016), Neumob (November'), 0.5132974), (Document(id='1400609f-0937-4700-acde-6e770d2dbd12', metadata={'section': 'History'}, page_content='their San Francisco headquarters as a source of randomness for encryption keys, alongside double'), 0.50999177), (Document(id='b77cef8b-1140-4d92-891b-0048ea70ae3a', metadata={'section': 'History'}, page_content='Neumob (November 2017), S2 Systems (January 2020), Linc (December 2020), Zaraz (December 2021),'), 0.5092492)]
按命名空间搜索
我们也可以通过 namespace
搜索向量。我们只需在将其添加到向量数据库时将其添加到 namespaces
数组中。
namespace_name = f"test-namespace-{uuid.uuid4().hex[:8]}"
new_documents = [
Document(
page_content="This is a new namespace specific document!",
metadata={"section": "Namespace Test1"},
),
Document(
page_content="This is another namespace specific document!",
metadata={"section": "Namespace Test2"},
),
]
r = cfVect.add_documents(
index_name=vectorize_index_name,
documents=new_documents,
namespaces=[namespace_name] * len(new_documents),
wait=True,
)
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name,
query="California",
namespace=namespace_name,
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")
2 results:
- [Document(id='65c4f7f4-aa4f-46b4-85ba-c90ea18dc7ed', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-9cc13b96'}, page_content='This is another namespace specific document!'), Document(id='96350f98-7053-41c7-b6bb-5acdd3ab67bd', metadata={'section': 'Namespace Test1', '_namespace': 'test-namespace-9cc13b96'}, page_content='This is a new namespace specific document!')]
按 ID 搜索
我们还可以检索特定 ID 的特定记录。为此,我们需要在 Vectorize 状态参数 index_name
上设置 vectorize 索引名称。
这将返回 _namespace
和 _values
以及其他 metadata
。
sample_ids = [x.id for x in query_documents]
cfVect.index_name = vectorize_index_name
query_documents = cfVect.get_by_ids(
sample_ids,
)
print(str(query_documents[:3])[:500])
[Document(id='65c4f7f4-aa4f-46b4-85ba-c90ea18dc7ed', metadata={'section': 'Namespace Test2', '_namespace': 'test-namespace-9cc13b96', '_values': [-0.0005841255, 0.014480591, 0.040771484, 0.005218506, 0.015579224, 0.0007543564, -0.005138397, -0.022720337, 0.021835327, 0.038970947, 0.017456055, 0.022705078, 0.013450623, -0.015686035, -0.019119263, -0.01512146, -0.017471313, -0.007183075, -0.054382324, -0.01914978, 0.0005302429, 0.018600464, -0.083740234, -0.006462097, 0.0005598068, 0.024230957, -0
命名空间将包含在 metadata
中的 _namespace
字段下,以及您的其他元数据(如果您在 return_metadata
中请求了它)。
注意:您不能在 metadata
中设置 _namespace
或 _values
字段,因为它们是保留字段。它们将在插入过程中被剥离。
更新插入(Upserts)
Vectorize 支持更新插入(Upserts),您可以通过设置 upsert=True
来执行此操作。
query_documents[0].page_content = "Updated: " + query_documents[0].page_content
print(query_documents[0].page_content)
Updated: This is another namespace specific document!
new_document_id = "12345678910"
new_document = Document(
id=new_document_id,
page_content="This is a new document!",
metadata={"section": "Introduction"},
)
r = cfVect.add_documents(
index_name=vectorize_index_name,
documents=[new_document, query_documents[0]],
upsert=True,
wait=True,
)
query_documents_updated = cfVect.get_by_ids([new_document_id, query_documents[0].id])
print(str(query_documents_updated[0])[:500])
print(query_documents_updated[0].page_content)
print(query_documents_updated[1].page_content)
page_content='This is a new document!' metadata={'section': 'Introduction', '_namespace': None, '_values': [-0.007522583, 0.0023021698, 0.009963989, 0.031051636, -0.021316528, 0.0048103333, 0.026046753, 0.01348114, 0.026306152, 0.040374756, 0.03225708, 0.007423401, 0.031021118, -0.007347107, -0.034179688, 0.002111435, -0.027191162, -0.020950317, -0.021636963, -0.0030593872, -0.04977417, 0.018859863, -0.08062744, -0.027679443, 0.012512207, 0.0053634644, 0.008079529, -0.010528564, 0.07312012, 0.02
This is a new document!
Updated: This is another namespace specific document!
删除记录
我们也可以通过 ID 删除记录
r = cfVect.delete(index_name=vectorize_index_name, ids=sample_ids, wait=True)
print(r)
True
并确认删除
query_documents = cfVect.get_by_ids(sample_ids)
assert len(query_documents) == 0
从文档创建
LangChain 规定所有向量存储都必须有一个 from_documents
方法,以便从文档实例化一个新的向量存储。这是一种比上面所示的单独 create, add
步骤更简化的方法。
您可以按此处所示进行操作
vectorize_index_name = "test-langchain-from-docs"
cfVect = CloudflareVectorize.from_documents(
account_id=cf_acct_id,
index_name=vectorize_index_name,
documents=texts,
embedding=embedder,
d1_database_id=d1_database_id,
d1_api_token=cf_d1_token,
vectorize_api_token=cf_vectorize_token,
wait=True,
)
# query for documents
query_documents = cfVect.similarity_search(
index_name=vectorize_index_name,
query="Edge Computing",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:300]}")
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content' metadata={'section': 'Products'}
异步示例
本节将展示一些异步示例
创建索引
vectorize_index_name1 = f"test-langchain-{uuid.uuid4().hex}"
vectorize_index_name2 = f"test-langchain-{uuid.uuid4().hex}"
# depending on your notebook environment you might need to include these:
# import nest_asyncio
# nest_asyncio.apply()
async_requests = [
cfVect.acreate_index(index_name=vectorize_index_name1),
cfVect.acreate_index(index_name=vectorize_index_name2),
]
res = await asyncio.gather(*async_requests);
创建元数据索引
async_requests = [
cfVect.acreate_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name1,
wait=True,
),
cfVect.acreate_metadata_index(
property_name="section",
index_type="string",
index_name=vectorize_index_name2,
wait=True,
),
]
await asyncio.gather(*async_requests);
添加文档
async_requests = [
cfVect.aadd_documents(index_name=vectorize_index_name1, documents=texts, wait=True),
cfVect.aadd_documents(index_name=vectorize_index_name2, documents=texts, wait=True),
]
await asyncio.gather(*async_requests);
查询/搜索
async_requests = [
cfVect.asimilarity_search(index_name=vectorize_index_name1, query="Workers AI"),
cfVect.asimilarity_search(index_name=vectorize_index_name2, query="Edge Computing"),
]
async_results = await asyncio.gather(*async_requests);
print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within'
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'
返回元数据/值
async_requests = [
cfVect.asimilarity_search(
index_name=vectorize_index_name1,
query="California",
return_values=True,
return_metadata="all",
),
cfVect.asimilarity_search(
index_name=vectorize_index_name2,
query="California",
return_values=True,
return_metadata="all",
),
]
async_results = await asyncio.gather(*async_requests);
print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.031219482, -0.018295288, -0.006000519, 0.017532349, 0.016403198, -0.029922485, -0.007133484, 0.004447937, 0.04559326, -0.011405945, 0.034820
使用元数据过滤进行搜索
async_requests = [
cfVect.asimilarity_search(
index_name=vectorize_index_name1,
query="Cloudflare services",
k=2,
md_filter={"section": "Products"},
return_metadata="all",
# return_values=True
),
cfVect.asimilarity_search(
index_name=vectorize_index_name2,
query="Cloudflare services",
k=2,
md_filter={"section": "Products"},
return_metadata="all",
# return_values=True
),
]
async_results = await asyncio.gather(*async_requests);
print(f"{len(async_results[0])} results:\n{str(async_results[0][-1])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
9 results:
page_content='It supports transport layer protocols TCP, UDP, QUIC, and many application layer protocols such as' metadata={'section': 'Products'}
9 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}
清理
最后,让我们删除本笔记本中创建的所有索引。
arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);
API 参考
https://developers.cloudflare.com/api/resources/vectorize/
https://developers.cloudflare.com/vectorize/