如何将多模态数据传递给模型
这里我们演示了如何将多模态输入直接传递给模型。
LangChain 支持将多模态数据作为聊天模型的输入
- 遵循提供商特定格式
- 遵循跨提供商标准
下面,我们演示了跨提供商标准。有关特定提供商的原生格式的详细信息,请参阅聊天模型集成。
大多数支持多模态**图像**输入的聊天模型也接受 OpenAI 的聊天补全格式中的这些值
{
"type": "image_url",
"image_url": {"url": image_url},
}
图像
许多提供商将接受以内联方式作为 Base64 数据传递的图像。有些还将直接接受来自 URL 的图像。
来自 Base64 数据的图像
要以内联方式传递图像,请将其格式化为以下形式的内容块
{
"type": "image",
"source_type": "base64",
"mime_type": "image/jpeg", # or image/png, etc.
"data": "<base64 data string>",
}
示例
import base64
import httpx
from langchain.chat_models import init_chat_model
# Fetch image data
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
# Pass to LLM
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "base64",
"data": image_data,
"mime_type": "image/jpeg",
},
],
}
response = llm.invoke([message])
print(response.text())
The image shows a beautiful clear day with bright blue skies and wispy cirrus clouds stretching across the horizon. The clouds are thin and streaky, creating elegant patterns against the blue backdrop. The lighting suggests it's during the day, possibly late afternoon given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no indication of rain. It's the kind of perfect, mild weather that's ideal for walking along the wooden boardwalk through the marsh grass.
有关更多详细信息,请参阅LangSmith 追踪。
来自 URL 的图像
一些提供商(包括OpenAI、Anthropic和Google Gemini)也将直接接受来自 URL 的图像。
要将图像作为 URL 传递,请将其格式化为以下形式的内容块
{
"type": "image",
"source_type": "url",
"url": "https://...",
}
示例
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "url",
"url": image_url,
},
],
}
response = llm.invoke([message])
print(response.text())
The weather in this image appears to be pleasant and clear. The sky is mostly blue with a few scattered, light clouds, and there is bright sunlight illuminating the green grass and plants. There are no signs of rain or stormy conditions, suggesting it is a calm, likely warm day—typical of spring or summer.
我们还可以传入多张图像
message = {
"role": "user",
"content": [
{"type": "text", "text": "Are these two images the same?"},
{"type": "image", "source_type": "url", "url": image_url},
{"type": "image", "source_type": "url", "url": image_url},
],
}
response = llm.invoke([message])
print(response.text())
Yes, these two images are the same. They depict a wooden boardwalk going through a grassy field under a blue sky with some clouds. The colors, composition, and elements in both images are identical.
文档 (PDF)
一些提供商(包括OpenAI、Anthropic和Google Gemini)将接受 PDF 文档。
OpenAI 要求为 PDF 输入指定文件名。使用 LangChain 格式时,请包含 filename
键。请参阅下面的示例。
来自 Base64 数据的文档
要以内联方式传递文档,请将其格式化为以下形式的内容块
{
"type": "file",
"source_type": "base64",
"mime_type": "application/pdf",
"data": "<base64 data string>",
}
示例
import base64
import httpx
from langchain.chat_models import init_chat_model
# Fetch PDF data
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
pdf_data = base64.b64encode(httpx.get(pdf_url).content).decode("utf-8")
# Pass to LLM
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
},
],
}
response = llm.invoke([message])
print(response.text())
This document appears to be a sample PDF file that contains Lorem ipsum placeholder text. It begins with a title "Sample PDF" followed by the subtitle "This is a simple PDF file. Fun fun fun."
The rest of the document consists of several paragraphs of Lorem ipsum text, which is a commonly used placeholder text in design and publishing. The text is formatted in a clean, readable layout with consistent paragraph spacing. The document appears to be a single page containing four main paragraphs of this placeholder text.
The Lorem ipsum text, while appearing to be Latin, is actually scrambled Latin-like text that is used primarily to demonstrate the visual form of a document or typeface without the distraction of meaningful content. It's commonly used in publishing and graphic design when the actual content is not yet available but the layout needs to be demonstrated.
The document has a professional, simple layout with generous margins and clear paragraph separation, making it an effective example of basic PDF formatting and structure.
来自 URL 的文档
一些提供商(特别是Anthropic)也将直接接受来自 URL 的文档。
要将文档作为 URL 传递,请将其格式化为以下形式的内容块
{
"type": "file",
"source_type": "url",
"url": "https://...",
}
示例
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "url",
"url": pdf_url,
},
],
}
response = llm.invoke([message])
print(response.text())
This document appears to be a sample PDF file with both text and an image. It begins with a title "Sample PDF" followed by the text "This is a simple PDF file. Fun fun fun." The rest of the document contains Lorem ipsum placeholder text arranged in several paragraphs. The content is shown both as text and as an image of the formatted PDF, with the same content displayed in a clean, formatted layout with consistent spacing and typography. The document consists of a single page containing this sample text.
音频
一些提供商(包括OpenAI和Google Gemini)将接受音频输入。
来自 Base64 数据的音频
要以内联方式传递音频,请将其格式化为以下形式的内容块
{
"type": "audio",
"source_type": "base64",
"mime_type": "audio/wav", # or appropriate mime-type
"data": "<base64 data string>",
}
示例
import base64
import httpx
from langchain.chat_models import init_chat_model
# Fetch audio data
audio_url = "https://upload.wikimedia.org/wikipedia/commons/3/3d/Alcal%C3%A1_de_Henares_%28RPS_13-04-2024%29_canto_de_ruise%C3%B1or_%28Luscinia_megarhynchos%29_en_el_Soto_del_Henares.wav"
audio_data = base64.b64encode(httpx.get(audio_url).content).decode("utf-8")
# Pass to LLM
llm = init_chat_model("google_genai:gemini-2.0-flash-001")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this audio:",
},
{
"type": "audio",
"source_type": "base64",
"data": audio_data,
"mime_type": "audio/wav",
},
],
}
response = llm.invoke([message])
print(response.text())
The audio appears to consist primarily of bird sounds, specifically bird vocalizations like chirping and possibly other bird songs.
提供商特定参数
一些提供商将支持或要求在包含多模态数据的内容块上添加额外字段。例如,Anthropic 允许您指定缓存特定内容以减少令牌消耗。
要使用这些字段,您可以
- 直接将它们存储在内容块上;或
- 使用每个提供商支持的原生格式(有关详细信息,请参阅聊天模型集成)。
下面我们展示三个示例。
示例:Anthropic 提示缓存
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "url",
"url": image_url,
"cache_control": {"type": "ephemeral"},
},
],
}
response = llm.invoke([message])
print(response.text())
response.usage_metadata
The image shows a beautiful, clear day with partly cloudy skies. The sky is a vibrant blue with wispy, white cirrus clouds stretching across it. The lighting suggests it's during daylight hours, possibly late afternoon or early evening given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no threatening weather conditions. It's the kind of perfect weather you'd want for a walk along this wooden boardwalk through the marshland or grassland area.
{'input_tokens': 1586,
'output_tokens': 117,
'total_tokens': 1703,
'input_token_details': {'cache_read': 0, 'cache_creation': 1582}}
next_message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize that in 5 words.",
}
],
}
response = llm.invoke([message, response, next_message])
print(response.text())
response.usage_metadata
Clear blue skies, wispy clouds.
{'input_tokens': 1716,
'output_tokens': 12,
'total_tokens': 1728,
'input_token_details': {'cache_read': 1582, 'cache_creation': 0}}
示例:Anthropic 引文
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Generate a 5 word summary of this document.",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
"citations": {"enabled": True},
},
],
}
response = llm.invoke([message])
response.content
[{'citations': [{'cited_text': 'Sample PDF\r\nThis is a simple PDF file. Fun fun fun.\r\n',
'document_index': 0,
'document_title': None,
'end_page_number': 2,
'start_page_number': 1,
'type': 'page_location'}],
'text': 'Simple PDF file: fun fun',
'type': 'text'}]
示例:OpenAI 文件名
OpenAI 要求 PDF 文档与文件名关联
llm = init_chat_model("openai:gpt-4.1")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
"filename": "my-file",
},
],
}
response = llm.invoke([message])
print(response.text())
The document is a sample PDF file containing placeholder text. It consists of one page, titled "Sample PDF". The content is a mixture of English and the commonly used filler text "Lorem ipsum dolor sit amet..." and its extensions, which are often used in publishing and web design as generic text to demonstrate font, layout, and other visual elements.
**Key points about the document:**
- Length: 1 page
- Purpose: Demonstrative/sample content
- Content: No substantive or meaningful information, just demonstration text in paragraph form
- Language: English (with the Latin-like "Lorem Ipsum" text used for layout purposes)
There are no charts, tables, diagrams, or images on the page—only plain text. The document serves as an example of what a PDF file looks like rather than providing actual, useful content.
工具调用
一些多模态模型也支持工具调用功能。要使用此类模型调用工具,只需以常用方式将工具绑定到它们,并使用所需类型的内容块(例如,包含图像数据)调用模型。
from typing import Literal
from langchain_core.tools import tool
@tool
def weather_tool(weather: Literal["sunny", "cloudy", "rainy"]) -> None:
"""Describe the weather"""
pass
llm_with_tools = llm.bind_tools([weather_tool])
message = {
"role": "user",
"content": [
{"type": "text", "text": "Describe the weather in this image:"},
{"type": "image", "source_type": "url", "url": image_url},
],
}
response = llm_with_tools.invoke([message])
response.tool_calls
[{'name': 'weather_tool',
'args': {'weather': 'sunny'},
'id': 'toolu_01G6JgdkhwggKcQKfhXZQPjf',
'type': 'tool_call'}]