跳至主要内容

OpenSearch

OpenSearch 是一个可扩展、灵活且可扩展的开源软件套件,用于根据 Apache 2.0 许可的搜索、分析和可观察性应用程序。OpenSearch 是基于Apache Lucene 的分布式搜索和分析引擎。

此笔记本展示了如何使用与OpenSearch 数据库相关的功能。

要运行,您应该运行一个 OpenSearch 实例:请参阅此处的简单 Docker 安装

similarity_search 默认执行近似 k-NN 搜索,该搜索使用几种算法之一,例如 lucene、nmslib、faiss,建议用于大型数据集。为了执行蛮力搜索,我们还有其他搜索方法,称为脚本评分和无痛脚本。查看此处了解更多详细信息。

安装

安装 Python 客户端。

%pip install --upgrade --quiet  opensearch-py langchain-community

我们希望使用 OpenAIEmbeddings,因此我们必须获取 OpenAI API 密钥。

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
API 参考:TextLoader

使用近似 k-NN 的 similarity_search

使用近似 k-NN 搜索的similarity_search,具有自定义参数

docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="http://localhost:9200"
)

# If using the default Docker installation, use this instantiation instead:
# docsearch = OpenSearchVectorSearch.from_documents(
# docs,
# embeddings,
# opensearch_url="https://localhost:9200",
# http_auth=("admin", "admin"),
# use_ssl = False,
# verify_certs = False,
# ssl_assert_hostname = False,
# ssl_show_warn = False,
# )
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query, k=10)
print(docs[0].page_content)
docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="http://localhost:9200",
engine="faiss",
space_type="innerproduct",
ef_construction=256,
m=48,
)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

使用脚本评分的 similarity_search

使用脚本评分similarity_search,具有自定义参数

docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="http://localhost:9200", is_appx_search=False
)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
"What did the president say about Ketanji Brown Jackson",
k=1,
search_type="script_scoring",
)
print(docs[0].page_content)

使用无痛脚本的 similarity_search

使用无痛脚本similarity_search,具有自定义参数

docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="http://localhost:9200", is_appx_search=False
)
filter = {"bool": {"filter": {"term": {"text": "smuggling"}}}}
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
"What did the president say about Ketanji Brown Jackson",
search_type="painless_scripting",
space_type="cosineSimilarity",
pre_filter=filter,
)
print(docs[0].page_content)

最大边缘相关性搜索(MMR)

如果您想查找一些类似的文档,但您还想收到多样化的结果,那么 MMR 是您应该考虑的方法。最大边缘相关性优化了对查询的相似性和选定文档之间的多样性。

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10, lambda_param=0.5)

使用预先存在的 OpenSearch 实例

也可以使用已经存在向量数据的预先存在的 OpenSearch 实例。

# this is just an example, you would need to change these values to point to another opensearch instance
docsearch = OpenSearchVectorSearch(
index_name="index-*",
embedding_function=embeddings,
opensearch_url="http://localhost:9200",
)

# you can specify custom field names to match the fields you're using to store your embedding, document text value, and metadata
docs = docsearch.similarity_search(
"Who was asking about getting lunch today?",
search_type="script_scoring",
space_type="cosinesimil",
vector_field="message_embedding",
text_field="message",
metadata_field="message_metadata",
)

使用 AOSS(Amazon OpenSearch 服务无服务器)

这是一个使用faiss 引擎和efficient_filterAOSS 示例。

我们需要安装几个python 包。

%pip install --upgrade --quiet  boto3 requests requests-aws4auth
import boto3
from opensearchpy import RequestsHttpConnection
from requests_aws4auth import AWS4Auth

service = "aoss" # must set the service as 'aoss'
region = "us-east-2"
credentials = boto3.Session(
aws_access_key_id="xxxxxx", aws_secret_access_key="xxxxx"
).get_credentials()
awsauth = AWS4Auth("xxxxx", "xxxxxx", region, service, session_token=credentials.token)

docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="host url",
http_auth=awsauth,
timeout=300,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
index_name="test-index-using-aoss",
engine="faiss",
)

docs = docsearch.similarity_search(
"What is feature selection",
efficient_filter=filter,
k=200,
)

使用 AOS(Amazon OpenSearch 服务)

%pip install --upgrade --quiet  boto3
# This is just an example to show how to use Amazon OpenSearch Service, you need to set proper values.
import boto3
from opensearchpy import RequestsHttpConnection

service = "es" # must set the service as 'es'
region = "us-east-2"
credentials = boto3.Session(
aws_access_key_id="xxxxxx", aws_secret_access_key="xxxxx"
).get_credentials()
awsauth = AWS4Auth("xxxxx", "xxxxxx", region, service, session_token=credentials.token)

docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="host url",
http_auth=awsauth,
timeout=300,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
index_name="test-index",
)

docs = docsearch.similarity_search(
"What is feature selection",
k=200,
)

此页面是否有帮助?


您也可以在 GitHub 上留下详细的反馈 在 GitHub 上.