跳到主要内容
Open In ColabOpen on GitHub

IBM Db2 向量存储和向量搜索

LangChain 的 Db2 集成 (langchain-db2) 为使用 IBM 关系型数据库 Db2 v12.1.2 及以上版本提供了向量存储和向量搜索功能,该集成在 MIT 许可下分发。用户可以按原样使用提供的实现,也可以根据特定需求进行自定义。主要功能包括:

  • 带元数据的向量存储
  • 向量相似度搜索和最大边际相关性搜索,支持元数据过滤选项
  • 支持点积、余弦和欧几里得距离度量
  • 通过创建索引和近似最近邻搜索进行性能优化。(即将添加)

设置

安装 `langchain-db2` 包,该包是用于 Db2 LangChain 向量存储和搜索的集成包。

安装此包时也应安装其依赖项,例如 `langchain-core` 和 `ibm_db`。

# pip install -U langchain-db2

连接到 Db2 向量存储

以下示例代码将展示如何连接到 Db2 数据库。除了上述依赖项外,您还需要一个正在运行的 Db2 数据库实例(版本 v12.1.2+,支持向量数据类型)。

import ibm_db
import ibm_db_dbi

database = ""
username = ""
password = ""

try:
connection = ibm_db_dbi.connect(database, username, password)
print("Connection successful!")
except Exception as e:
print("Connection failed!")

导入所需的依赖项

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_core.documents import Document
from langchain_db2 import db2vs
from langchain_db2.db2vs import DB2VS

初始化

创建文档

# Define a list of documents
documents_json_list = [
{
"id": "doc_1_2_P4",
"text": "Db2 handles LOB data differently than other kinds of data. As a result, you sometimes need to take additional actions when you define LOB columns and insert the LOB data.",
"link": "https://www.ibm.com/docs/en/db2-for-zos/12?topic=programs-storing-lob-data-in-tables",
},
{
"id": "doc_11.1.0_P1",
"text": "Db2® column-organized tables add columnar capabilities to Db2 databases, which include data that is stored with column organization and vector processing of column data. Using this table format with star schema data marts provides significant improvements to storage, query performance, and ease of use through simplified design and tuning.",
"link": "https://www.ibm.com/docs/en/db2/11.1.0?topic=organization-column-organized-tables",
},
{
"id": "id_22.3.4.3.1_P2",
"text": "Data structures are elements that are required to use Db2®. You can access and use these elements to organize your data. Examples of data structures include tables, table spaces, indexes, index spaces, keys, views, and databases.",
"link": "https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-db2-data-structures",
},
{
"id": "id_3.4.3.1_P3",
"text": "Db2® maintains a set of tables that contain information about the data that Db2 controls. These tables are collectively known as the catalog. The catalog tables contain information about Db2 objects such as tables, views, and indexes. When you create, alter, or drop an object, Db2 inserts, updates, or deletes rows of the catalog that describe the object.",
"link": "https://www.ibm.com/docs/en/zos-basic-skills?topic=objects-db2-catalog",
},
]
# Create Langchain Documents

documents_langchain = []

for doc in documents_json_list:
metadata = {"id": doc["id"], "link": doc["link"]}
doc_langchain = Document(page_content=doc["text"], metadata=metadata)
documents_langchain.append(doc_langchain)

创建具有不同距离度量的向量存储

首先,我们将创建三个向量存储,每个使用不同的距离策略。

(您可以手动连接到 Db2 数据库,并会看到三个表:Documents_DOT、Documents_COSINE 和 Documents_EUCLIDEAN。)

# Create Db2 Vector Stores using different distance strategies

# When using our API calls, start by initializing your vector store with a subset of your documents
# through from_documents(), then incrementally add more documents using add_texts().
# This approach prevents system overload and ensures efficient document processing.

model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

vector_store_dot = DB2VS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_DOT",
distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
vector_store_max = DB2VS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_COSINE",
distance_strategy=DistanceStrategy.COSINE,
)
vector_store_euclidean = DB2VS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_EUCLIDEAN",
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

管理向量存储

def manage_texts(vector_stores):
"""
Adds texts to each vector store, demonstrates error handling for duplicate additions,
and performs deletion of texts. Showcases similarity searches and index creation for each vector store.

Args:
- vector_stores (list): A list of DB2VS instances.
"""
texts = ["Rohan", "Shailendra"]
metadata = [
{"id": "100", "link": "Document Example Test 1"},
{"id": "101", "link": "Document Example Test 2"},
]

for i, vs in enumerate(vector_stores, start=1):
# Adding texts
try:
vs.add_texts(texts, metadata)
print(f"\n\n\nAdd texts complete for vector store {i}\n\n\n")
except Exception as ex:
print(f"\n\n\nExpected error on duplicate add for vector store {i}\n\n\n")

# Deleting texts using the value of 'id'
vs.delete([metadata[0]["id"], metadata[1]["id"]])
print(f"\n\n\nDelete texts complete for vector store {i}\n\n\n")

# Similarity search
results = vs.similarity_search("How are LOBS stored in Db2 Database", 2)
print(f"\n\n\nSimilarity search results for vector store {i}: {results}\n\n\n")


vector_store_list = [
vector_store_dot,
vector_store_max,
vector_store_euclidean,
]
manage_texts(vector_store_list)

查询向量存储

演示向量存储上的高级搜索,包括带属性过滤和不带属性过滤的搜索

使用过滤功能,我们只选择文档 ID 101,不选择其他。

# Conduct advanced searches
def conduct_advanced_searches(vector_stores):
query = "How are LOBS stored in Db2 Database"
# Constructing a filter for direct comparison against document metadata
# This filter aims to include documents whose metadata 'id' is exactly '101'
filter_criteria = {"id": ["101"]} # Direct comparison filter

for i, vs in enumerate(vector_stores, start=1):
print(f"\n--- Vector Store {i} Advanced Searches ---")
# Similarity search without a filter
print("\nSimilarity search results without filter:")
print(vs.similarity_search(query, 2))

# Similarity search with a filter
print("\nSimilarity search results with filter:")
print(vs.similarity_search(query, 2, filter=filter_criteria))

# Similarity search with relevance score
print("\nSimilarity search with relevance score:")
print(vs.similarity_search_with_score(query, 2))

# Similarity search with relevance score with filter
print("\nSimilarity search with relevance score with filter:")
print(vs.similarity_search_with_score(query, 2, filter=filter_criteria))

# Max marginal relevance search
print("\nMax marginal relevance search results:")
print(vs.max_marginal_relevance_search(query, 2, fetch_k=20, lambda_mult=0.5))

# Max marginal relevance search with filter
print("\nMax marginal relevance search results with filter:")
print(
vs.max_marginal_relevance_search(
query, 2, fetch_k=20, lambda_mult=0.5, filter=filter_criteria
)
)


conduct_advanced_searches(vector_store_list)

检索增强生成的使用

API 参考