Timescale Vector (Postgres)

Timescale Vector 是用于 AI 应用程序的 PostgreSQL++。它使您能够有效地在 PostgreSQL 中存储和查询数十亿向量嵌入。

PostgreSQL，也称为 Postgres，是一个免费开源的关系型数据库管理系统 (RDBMS)，强调可扩展性和 SQL 合规性。

本 notebook 展示了如何使用 Postgres 向量数据库 (TimescaleVector) 进行自查询。在 notebook 中，我们将演示围绕 TimescaleVector 向量存储的 SelfQueryRetriever。

什么是 Timescale Vector？

Timescale Vector 是用于 AI 应用程序的 PostgreSQL++。

Timescale Vector 使您能够高效地在 PostgreSQL 中存储和查询数百万向量嵌入。

通过受 DiskANN 启发的索引算法，增强了 pgvector 在超过 10 亿个向量上的更快、更准确的相似度搜索能力。
通过自动基于时间的分区和索引，实现了快速的基于时间的向量搜索。
提供熟悉的 SQL 接口，用于查询向量嵌入和关系数据。

Timescale Vector 是一种面向 AI 的云 PostgreSQL，可随您从概念验证 (POC) 扩展到生产环境

通过让您在单个数据库中存储关系元数据、向量嵌入和时间序列数据来简化操作。
受益于坚如磐石的 PostgreSQL 基础，具备企业级功能，如流式备份和复制、高可用性以及行级安全性。
提供无忧的企业级安全和合规体验。

如何访问 Timescale Vector

Timescale Vector 可在 Timescale（云 PostgreSQL 平台）上使用。（目前没有自托管版本。）

LangChain 用户可获得 Timescale Vector 的 90 天免费试用。

要开始使用，请注册 Timescale，创建一个新数据库并按照本 notebook 进行操作！
请参阅Timescale Vector 解释器博客，了解更多详细信息和性能基准。
请参阅安装说明，了解在 Python 中使用 Timescale Vector 的更多详细信息。

创建 TimescaleVector 向量存储

首先，我们将创建一个 Timescale Vector 向量存储并使用一些数据对其进行初始化。我们创建了一个包含电影摘要的少量文档演示集。

注意：自查询检索器要求您安装 lark（pip install lark）。我们还需要 timescale-vector 包。

%pip install --upgrade --quiet  lark

%pip install --upgrade --quiet  timescale-vector

在此示例中，我们将使用 OpenAIEmbeddings，因此请加载您的 OpenAI API 密钥。

# Get openAI api key by reading local .env file
# The .env file should contain a line starting with `OPENAI_API_KEY=sk-`
import os

from dotenv import find_dotenv, load_dotenv

_ = load_dotenv(find_dotenv())

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
# Alternatively, use getpass to enter the key in a prompt
# import os
# import getpass
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

要连接到您的 PostgreSQL 数据库，您需要您的服务 URI，该 URI 可以在您创建新数据库后下载的速查表或 .env 文件中找到。

如果您还没有注册，请注册 Timescale 并创建一个新数据库。

URI 将如下所示：postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require

# Get the service url by reading local .env file
# The .env file should contain a line starting with `TIMESCALE_SERVICE_URL=postgresql://`
_ = load_dotenv(find_dotenv())
TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

# Alternatively, use getpass to enter the key in a prompt
# import os
# import getpass
# TIMESCALE_SERVICE_URL = getpass.getpass("Timescale Service URL:")

from langchain_community.vectorstores.timescalevector import TimescaleVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

API 参考：TimescaleVector | Document | OpenAIEmbeddings

以下是我们将用于此演示的示例文档。数据是关于电影的，包含内容和元数据字段，其中包含特定电影的信息。

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "science fiction",
            "rating": 9.9,
        },
    ),
]

最后，我们将创建 Timescale Vector 向量存储。请注意，集合名称将是存储文档的 PostgreSQL 表的名称。

COLLECTION_NAME = "langchain_self_query_demo"
vectorstore = TimescaleVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    service_url=TIMESCALE_SERVICE_URL,
)

创建我们的自查询检索器

现在我们可以实例化我们的检索器了。为此，我们需要预先提供一些关于我们的文档支持的元数据字段以及文档内容的简短描述。

from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI

# Give LLM info about the metadata fields
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"

# Instantiate the self-query retriever from an LLM
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

API 参考：AttributeInfo | SelfQueryRetriever | OpenAI

使用 Timescale Vector 进行自查询检索

现在我们可以尝试实际使用我们的检索器了！

运行下面的查询，请注意您如何以自然语言指定查询、过滤器、复合过滤器（带有 AND、OR 的过滤器），自查询检索器会将该查询转换为 SQL 并在 Timescale Vector (Postgres) 向量存储上执行搜索。

这展示了自查询检索器的强大功能。您可以使用它对您的向量存储执行复杂搜索，而无需您或您的用户直接编写任何 SQL！

# This example only specifies a relevant query
retriever.invoke("What are some movies about dinosaurs")

/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/libs/langchain/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
  warnings.warn(
``````output
query='dinosaur' filter=None limit=None

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),
 Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]

# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")

query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None

[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),
 Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'})]

# This example specifies a query and a filter
retriever.invoke("Has Greta Gerwig directed any movies about women")

query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig') limit=None

[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'}),
 Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'})]

# This example specifies a composite filter
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='rating', value=8.5), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction')]) limit=None

[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),
 Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'})]

# This example specifies a query and composite filter
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')]) limit=None

[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]

过滤 k

我们还可以使用自查询检索器来指定 k：要获取的文档数量。

我们可以通过将 enable_limit=True 传递给构造函数来做到这一点。

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True,
)

# This example specifies a query with a LIMIT value
retriever.invoke("what are two movies about dinosaurs")

query='dinosaur' filter=None limit=2

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),
 Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7})]

什么是 Timescale Vector？​

如何访问 Timescale Vector​

创建 TimescaleVector 向量存储​

创建我们的自查询检索器​

使用 Timescale Vector 进行自查询检索​

过滤 k​