如何缓存聊天模型响应
LangChain 为 聊天模型 提供了一个可选的缓存层。这主要有两个原因很有用
- 如果您经常多次请求相同的完成,它可以减少您向 LLM 提供程序发出的 API 调用次数,从而节省您的资金。这在应用程序开发期间尤其有用。
- 它可以减少您向 LLM 提供程序发出的 API 调用次数,从而加快您的应用程序速度。
本指南将引导您了解如何在您的应用程序中启用此功能。
选择 聊天模型
pip install -qU langchain-openai
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
# <!-- ruff: noqa: F821 -->
from langchain_core.globals import set_llm_cache
API 参考:set_llm_cache
内存缓存
这是一个临时的缓存,它将模型调用存储在内存中。当您的环境重启时,它将被清除,并且不会跨进程共享。
%%time
from langchain_core.caches import InMemoryCache
set_llm_cache(InMemoryCache())
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
API 参考:InMemoryCache
CPU times: user 645 ms, sys: 214 ms, total: 859 ms
Wall time: 829 ms
AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-b6836bdd-8c30-436b-828f-0ac5fc9ab50e-0')
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 822 µs, sys: 288 µs, total: 1.11 ms
Wall time: 1.06 ms
AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-b6836bdd-8c30-436b-828f-0ac5fc9ab50e-0')
SQLite 缓存
此缓存实现使用 SQLite
数据库来存储响应,并且会在进程重启后仍然保留。
!rm .langchain.db
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
API 参考:SQLiteCache
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 9.91 ms, sys: 7.68 ms, total: 17.6 ms
Wall time: 657 ms
AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field!', response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-39d9e1e8-7766-4970-b1d8-f50213fd94c5-0')
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 52.2 ms, sys: 60.5 ms, total: 113 ms
Wall time: 127 ms
AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field!', id='run-39d9e1e8-7766-4970-b1d8-f50213fd94c5-0')
下一步
您现在已经学会了如何缓存模型响应以节省时间和金钱。
接下来,查看本节中有关聊天模型的其他操作指南,例如 如何让模型返回结构化输出 或 如何创建您自己的自定义聊天模型。