Activeloop Deep Lake
Activeloop Deep Lake 作为一个多模态向量存储,可以存储嵌入向量及其元数据,包括文本、json、图像、音频、视频等。它可以将数据保存在本地、云端或 Activeloop 存储中。它执行混合搜索,包括嵌入向量及其属性。
这个笔记本展示了与 Activeloop Deep Lake 相关的基本功能。虽然 Deep Lake 可以存储嵌入向量,但它也能够存储任何类型的数据。它是一个具有版本控制、查询引擎和流式数据加载器到深度学习框架的无服务器数据湖。
欲了解更多信息,请参阅 Deep Lake 文档
设置
%pip install --upgrade --quiet langchain-openai langchain-deeplake tiktoken
Activeloop 提供的示例
Deep Lake 本地
from langchain_deeplake.vectorstores import DeeplakeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("activeloop token:")
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
创建本地数据集
在 ./my_deeplake/
本地创建一个数据集,然后运行相似性搜索。Deeplake+LangChain 集成在底层使用 Deep Lake 数据集,因此数据集和向量存储可以互换使用。要在您自己的云端或 Deep Lake 存储中创建数据集,请相应地调整路径。
db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True
)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True)
查询数据集
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
稍后,您可以重新加载数据集,而无需重新计算嵌入向量
db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, read_only=True
)
docs = db.similarity_search(query)
设置 read_only=True
可防止在不需要更新时意外修改向量存储。这确保数据保持不变,除非明确需要更改。通常,指定此参数以避免意外更新是一个好习惯。
检索问答
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=db.as_retriever(),
)
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)
基于属性的元数据过滤
让我们创建另一个包含元数据的向量存储,其中包含文档创建年份。
import random
for d in docs:
d.metadata["year"] = random.randint(2012, 2014)
db = DeeplakeVectorStore.from_documents(
docs, embeddings, dataset_path="./my_deeplake/", overwrite=True
)
db.similarity_search(
"What did the president say about Ketanji Brown Jackson",
filter={"metadata": {"year": 2013}},
)
选择距离函数
距离函数 L2 用于欧几里得距离,cos 用于余弦相似度
db.similarity_search(
"What did the president say about Ketanji Brown Jackson?", distance_metric="l2"
)
最大边缘相关性
使用最大边缘相关性
db.max_marginal_relevance_search(
"What did the president say about Ketanji Brown Jackson?"
)
删除数据集
db.delete_dataset()
云端(Activeloop、AWS、GCS 等)或内存中的 Deep Lake 数据集
默认情况下,Deep Lake 数据集存储在本地。要将它们存储在内存中、Deep Lake Managed DB 中或任何对象存储中,您可以在创建向量存储时提供相应的路径和凭据。某些路径需要注册 Activeloop 并创建 API 令牌,可以在此处检索
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing_python" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path, embedding_function=embeddings, overwrite=True
)
ids = db.add_documents(docs)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing"
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path,
embedding_function=embeddings,
overwrite=True,
)
ids = db.add_documents(docs)
TQL 搜索
此外,在 similarity_search 方法中也支持执行查询,其中可以利用 Deep Lake 的张量查询语言 (TQL) 指定查询。
search_id = db.dataset["ids"][0]
docs = db.similarity_search(
query=None,
tql=f"SELECT * WHERE ids == '{search_id}'",
)
db.dataset.summary()
在 AWS S3 上创建向量存储
dataset_path = "s3://BUCKET/langchain_test" # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore.from_documents(
docs,
dataset_path=dataset_path,
embedding=embeddings,
overwrite=True,
creds={
"aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
"aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
"aws_session_token": os.environ["AWS_SESSION_TOKEN"], # Optional
},
)
Deep Lake API
您可以在 db.vectorstore
访问 Deep Lake 数据集
# get structure of the dataset
db.dataset.summary()
# get embeddings numpy array
embeds = db.dataset["embeddings"][:]
将本地数据集传输到云端
将已创建的数据集复制到云端。您也可以从云端传输到本地。
import deeplake
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
source = f"hub://{username}/langchain_testing" # could be local, s3, gcs, etc.
destination = f"hub://{username}/langchain_test_copy" # could be local, s3, gcs, etc.
deeplake.copy(src=source, dst=destination)
db = DeeplakeVectorStore(dataset_path=destination, embedding_function=embeddings)
db.add_documents(docs)