Azure Cosmos DB No SQL
此笔记本展示了如何利用这个集成的向量数据库,将文档存储在集合中,创建索引,并使用近似最近邻算法(如 COS(余弦距离)、L2(欧几里得距离)和 IP(内积))执行向量搜索查询,以定位接近查询向量的文档。
Azure Cosmos DB 是为 OpenAI 的 ChatGPT 服务提供支持的数据库。它提供个位数的毫秒响应时间、自动和即时可扩展性,以及在任何规模下都能保证的速度。
Azure Cosmos DB for NoSQL 现在提供预览版的向量索引和搜索。此功能旨在处理高维向量,从而在任何规模下实现高效且准确的向量搜索。您现在可以将向量直接与数据一起存储在文档中。这意味着数据库中的每个文档不仅可以包含传统的无模式数据,还可以包含高维向量作为文档的其他属性。这种数据和向量的共置允许高效的索引和搜索,因为向量存储在与它们表示的数据相同的逻辑单元中。这简化了数据管理、AI 应用程序架构以及基于向量的操作的效率。
请参阅此处了解更多详情
注册即可获得永久免费访问权限,立即开始使用。
%pip install --upgrade --quiet azure-cosmos langchain-openai langchain-community
Note: you may need to restart the kernel to use updated packages.
OPENAI_API_KEY = "YOUR_KEY"
OPENAI_API_TYPE = "azure"
OPENAI_API_VERSION = "2023-05-15"
OPENAI_API_BASE = "YOUR_ENDPOINT"
OPENAI_EMBEDDINGS_MODEL_NAME = "text-embedding-ada-002"
OPENAI_EMBEDDINGS_MODEL_DEPLOYMENT = "text-embedding-ada-002"
插入数据
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
API 参考:PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)
print(docs[0])
page_content='GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction' metadata={'source': 'https://arxiv.org/pdf/2303.08774.pdf', 'page': 0}
创建 AzureCosmosDB NoSQL 向量搜索
indexing_policy = {
"indexingMode": "consistent",
"includedPaths": [{"path": "/*"}],
"excludedPaths": [{"path": '/"_etag"/?'}],
"vectorIndexes": [{"path": "/embedding", "type": "diskANN"}],
"fullTextIndexes": [{"path": "/text"}],
}
vector_embedding_policy = {
"vectorEmbeddings": [
{
"path": "/embedding",
"dataType": "float32",
"distanceFunction": "cosine",
"dimensions": 1536,
}
]
}
full_text_policy = {
"defaultLanguage": "en-US",
"fullTextPaths": [{"path": "/text", "language": "en-US"}],
}
from azure.cosmos import CosmosClient, PartitionKey
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
AzureCosmosDBNoSqlVectorSearch,
)
from langchain_openai import OpenAIEmbeddings
HOST = "AZURE_COSMOS_DB_ENDPOINT"
KEY = "AZURE_COSMOS_DB_KEY"
cosmos_client = CosmosClient(HOST, KEY)
database_name = "langchain_python_db"
container_name = "langchain_python_container"
partition_key = PartitionKey(path="/id")
cosmos_container_properties = {"partition_key": partition_key}
openai_embeddings = OpenAIEmbeddings(
deployment="smart-agent-embedding-ada",
model="text-embedding-ada-002",
chunk_size=1,
openai_api_key="OPENAI_API_KEY",
)
# insert the documents in AzureCosmosDBNoSql with their embedding
vector_search = AzureCosmosDBNoSqlVectorSearch.from_documents(
documents=docs,
embedding=openai_embeddings,
cosmos_client=cosmos_client,
database_name=database_name,
container_name=container_name,
vector_embedding_policy=vector_embedding_policy,
full_text_policy=full_text_policy,
indexing_policy=indexing_policy,
cosmos_container_properties=cosmos_container_properties,
cosmos_database_properties={},
full_text_search_enabled=True,
)
向量搜索
# Perform a similarity search between the embedding of the query and the embeddings of the documents
import json
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print(results[0].page_content)
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and
text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.
带分数的向量搜索
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search_with_score(
query=query,
k=5,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 1: 0.8394796122122777
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":43,"id":"e5610de3-8af6-43b9-8266-51c26d76eaa3"},"page_content":"2 GPT-4 Observed Safety Challenges\nGPT-4 demonstrates increased performance in areas such as reasoning, knowledge retention, and\ncoding, compared to earlier models such as GPT-2[ 22] and GPT-3.[ 10] Many of these improvements\nalso present new safety challenges, which we highlight in this section.\nWe conducted a range of qualitative and quantitative evaluations of GPT-4. These evaluations\nhelped us gain an understanding of GPT-4’s capabilities, limitations, and risks; prioritize our\nmitigation efforts; and iteratively test and build safer versions of the model. Some of the specific\nrisks we explored are:6\n•Hallucinations\n•Harmful content\n•Harms of representation, allocation, and quality of service\n•Disinformation and influence operations\n•Proliferation of conventional and unconventional weapons\n•Privacy\n•Cybersecurity\n•Potential for risky emergent behaviors\n•Interactions with other systems\n•Economic impacts\n•Acceleration\n•Overreliance","type":"Document"}
Score 2: 0.8299261339098007
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"cddfb7ac-d953-46f4-8a48-76655f116bcf"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Score 3: 0.8286253517208215
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":3,"id":"4f3152cd-c543-4a4f-b94e-c52c4139c4a8"},"page_content":"plan to refine these methods and register performance predictions across various capabilities before\nlarge model training begins, and we hope this becomes a common goal in the field.\n4 Capabilities\nWe tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally\ndesigned for humans.4We did no specific training for these exams. A minority of the problems in the\nexams were seen by the model during training; for each exam we run a variant with these questions\nremoved and report the lower score of the two. We believe the results to be representative. For further\ndetails on contamination (methodology and per-exam statistics), see Appendix C.\nExams were sourced from publicly-available materials. Exam questions included both multiple-\nchoice and free-response questions; we designed separate prompts for each format, and images were\nincluded in the input for questions which required it. The evaluation setup was designed based","type":"Document"}
Score 4: 0.8278858118680015
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":28,"id":"18b4f43d-27d2-404e-9d66-f9328a3588c6"},"page_content":"overall GPT-4 training budget. When mixing in data from these math benchmarks, a portion of the\ntraining data was held back, so each individual training example may or may not have been seen by\nGPT-4 during training.\nWe conducted contamination checking to verify the test set for GSM-8K is not included in the training\nset (see Appendix D). We recommend interpreting the performance results reported for GPT-4\nGSM-8K in Table 2 as something in-between true few-shot transfer and full benchmark-specific\ntuning.\nF Multilingual MMLU\nWe translated all questions and answers from MMLU [ 49] using Azure Translate. We used an\nexternal model to perform the translation, instead of relying on GPT-4 itself, in case the model had\nunrepresentative performance for its own translations. We selected a range of languages that cover\ndifferent geographic regions and scripts, we show an example question taken from the astronomy","type":"Document"}
Score 5: 0.8272138555588135
带过滤的向量搜索
query = "What were the compute requirements for training GPT 4"
pre_filter = {
"conditions": [
{"property": "metadata.page", "operator": "$eq", "value": 0},
],
}
results = vector_search.similarity_search_with_score(
query=query,
k=5,
pre_filter=pre_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 1: 0.8394796122122777
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"cddfb7ac-d953-46f4-8a48-76655f116bcf"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Score 2: 0.8286253517208215
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"ba814d15-2c12-40d2-8934-db58b393ecb8"},"page_content":"model capability results, as well as model safety improvements and results, in more detail in later\nsections.\nThis report also discusses a key challenge of the project, developing deep learning infrastructure and\noptimization methods that behave predictably across a wide range of scales. This allowed us to make\npredictions about the expected performance of GPT-4 (based on small runs trained in similar ways)\nthat were tested against the final run to increase confidence in our training.\nDespite its capabilities, GPT-4 has similar limitations to earlier GPT models [ 1,37,38]: it is not fully\nreliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn\n∗Please cite this work as “OpenAI (2023)\". Full authorship contribution statements appear at the end of the\ndocument. Correspondence regarding this technical report can be sent to [email protected]:2303.08774v6 [cs.CL] 4 Mar 2024","type":"Document"}
Score 3: 0.8215997601854081
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"dd040c08-6ae1-4c73-8b85-7f034d337891"},"page_content":"these evaluations it performs quite well and often outscores the vast majority of human test takers.\nFor example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.\nThis contrasts with GPT-3.5, which scores in the bottom 10%.\nOn a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models\nand most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).\nOn the MMLU benchmark [ 35,36], an English-language suite of multiple-choice questions covering\n57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but\nalso demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4\nsurpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these\nmodel capability results, as well as model safety improvements and results, in more detail in later\nsections.","type":"Document"}
Score 4: 0.8209972517303962
全文搜索
from langchain_community.vectorstores.azure_cosmos_db_no_sql import CosmosDBQueryType
query = "What were the compute requirements for training GPT 4"
pre_filter = {
"conditions": [
{
"property": "text",
"operator": "$full_text_contains_any",
"value": "What were the compute requirements for training GPT 4",
},
],
}
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.FULL_TEXT_SEARCH,
pre_filter=pre_filter,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print("\n")
API 参考:CosmosDBQueryType
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"cddfb7ac-d953-46f4-8a48-76655f116bcf"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"dd040c08-6ae1-4c73-8b85-7f034d337891"},"page_content":"these evaluations it performs quite well and often outscores the vast majority of human test takers.\nFor example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.\nThis contrasts with GPT-3.5, which scores in the bottom 10%.\nOn a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models\nand most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).\nOn the MMLU benchmark [ 35,36], an English-language suite of multiple-choice questions covering\n57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but\nalso demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4\nsurpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these\nmodel capability results, as well as model safety improvements and results, in more detail in later\nsections.","type":"Document"}
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"ba814d15-2c12-40d2-8934-db58b393ecb8"},"page_content":"model capability results, as well as model safety improvements and results, in more detail in later\nsections.\nThis report also discusses a key challenge of the project, developing deep learning infrastructure and\noptimization methods that behave predictably across a wide range of scales. This allowed us to make\npredictions about the expected performance of GPT-4 (based on small runs trained in similar ways)\nthat were tested against the final run to increase confidence in our training.\nDespite its capabilities, GPT-4 has similar limitations to earlier GPT models [ 1,37,38]: it is not fully\nreliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn\n∗Please cite this work as “OpenAI (2023)\". Full authorship contribution statements appear at the end of the\ndocument. Correspondence regarding this technical report can be sent to [email protected]:2303.08774v6 [cs.CL] 4 Mar 2024","type":"Document"}
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"9edf6760-a2d0-4a0b-a652-25fc89de1d34"},"page_content":"from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts\nwhere reliability is important.\nGPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe\ncareful study of these challenges is an important area of research given the potential societal impact.\nThis report includes an extensive system card (after the Appendix) describing some of the risks we\nforesee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.\nIt also describes interventions we made to mitigate potential harms from the deployment of GPT-4,\nincluding adversarial testing with domain experts, and a model-assisted safety pipeline.\n2 Scope and Limitations of this Technical Report\nThis report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a\nTransformer-style model [ 39] pre-trained to predict the next token in a document, using both publicly","type":"Document"}
全文搜索 BM 25 排名
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.FULL_TEXT_RANK,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"f2746fd3-bbcb-4197-b2d5-ee7b355b6009"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"20153a6c-7c2c-4328-9b0e-e3502d7ac4dd"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"6d88f369-4147-4530-9bfb-0ed008413211"},"page_content":"Observed\nPrediction\ngpt-4\n100p 10n 1µ 100µ 0.01 1\nCompute1.02.03.04.05.06.0Bits per wordOpenAI codebase next word predictionFigure 1. Performance of GPT-4 and smaller models. The metric is final loss on a dataset derived\nfrom our internal codebase. This is a convenient, large dataset of code tokens which is not contained in\nthe training set. We chose to look at loss because it tends to be less noisy than other measures across\ndifferent amounts of training compute. A power law fit to the smaller models (excluding GPT-4) is\nshown as the dotted line; this fit accurately predicts GPT-4’s final loss. The x-axis is training compute\nnormalized so that GPT-4 is 1.\nObserved\nPrediction\ngpt-4\n1µ 10µ 100µ 0.001 0.01 0.1 1\nCompute012345– Mean Log Pass RateCapability prediction on 23 coding problems\nFigure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of\nthe HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted","type":"Document"}
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"90e4bafe-55bb-406b-afba-a0143c810842"},"page_content":"which measures the ability to synthesize Python functions of varying complexity. We successfully\npredicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained\nwith at most 1,000×less compute (Figure 2).\nFor an individual problem in HumanEval, performance may occasionally worsen with scale. Despite\nthese challenges, we find an approximate power law relationship −EP[log(pass _rate(C))] = α∗C−k\n2In addition to the accompanying system card, OpenAI will soon publish additional thoughts on the social\nand economic implications of AI systems, including the need for effective regulation.\n2","type":"Document"}
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":71,"id":"10ff5a7f-6638-4446-85b2-6e4314eca938"},"page_content":"Unsupervised Multitask Learners,” 2019.\n[23]G. C. Bowker and S. L. Star, Sorting Things Out . MIT Press, Aug. 2000.\n[24]L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng,\nB. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane,\nL. A. Hendricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel, “Taxonomy\nof Risks posed by Language Models,” in 2022 ACM Conference on Fairness, Accountability,\nand Transparency , FAccT ’22, (New York, NY, USA), pp. 214–229, Association for Computing\nMachinery, June 2022.\n72","type":"Document"}
混合搜索
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.HYBRID,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":97,"id":"36dfcd6c-d3cf-4e34-a5d6-cc4d63013cba"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1: 0.8173275975778744
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":7,"id":"3d6e4715-4a38-40b1-89f1-e768bad5f9c8"},"page_content":"Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog\npost [ 65]. We plan to release more information about GPT-4’s visual capabilities in follow-up work.\n8","type":"Document"}
Score 2: 0.8176419674549597
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"f2746fd3-bbcb-4197-b2d5-ee7b355b6009"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 3: 0.8053881702559759
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 4: 0.8394796122122777
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"20153a6c-7c2c-4328-9b0e-e3502d7ac4dd"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 5: 0.8213247840132897
带过滤的混合搜索
query = "What were the compute requirements for training GPT 4"
pre_filter = {
"conditions": [
{
"property": "text",
"operator": "$full_text_contains_any",
"value": "compute requirements",
},
{"property": "metadata.page", "operator": "$eq", "value": 0},
],
"logical_operator": "$and",
}
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.HYBRID,
)
# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":97,"id":"36dfcd6c-d3cf-4e34-a5d6-cc4d63013cba"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1: 0.8173275975778744
Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":7,"id":"3d6e4715-4a38-40b1-89f1-e768bad5f9c8"},"page_content":"Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog\npost [ 65]. We plan to release more information about GPT-4’s visual capabilities in follow-up work.\n8","type":"Document"}
Score 2: 0.8176419674549597
Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"f2746fd3-bbcb-4197-b2d5-ee7b355b6009"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 3: 0.8053881702559759
Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 4: 0.8394796122122777
Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"20153a6c-7c2c-4328-9b0e-e3502d7ac4dd"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 5: 0.8213247840132897