如何解析 XML 输出
先决条件
本指南假设您熟悉以下概念
来自不同提供商的大型语言模型(LLM)通常根据其训练的特定数据而具有不同的优势。这也意味着某些模型在生成非 JSON 格式的输出时可能“更好”且更可靠。
本指南将向您展示如何使用 XMLOutputParser
来提示模型生成 XML 输出,然后解析该输出为可用的格式。
注意
请记住,大型语言模型是存在漏洞的抽象!您需要使用具有足够能力生成格式良好的 XML 的 LLM。
在以下示例中,我们使用 Anthropic 的 Claude-2 模型(https://docs.anthropic.com/claude/docs),该模型是针对 XML 标签优化的模型之一。
%pip install -qU langchain langchain-anthropic
import os
from getpass import getpass
if "ANTHROPIC_API_KEY" not in os.environ:
os.environ["ANTHROPIC_API_KEY"] = getpass()
让我们从对模型的一个简单请求开始。
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import XMLOutputParser
from langchain_core.prompts import PromptTemplate
model = ChatAnthropic(model="claude-2.1", max_tokens_to_sample=512, temperature=0.1)
actor_query = "Generate the shortened filmography for Tom Hanks."
output = model.invoke(
f"""{actor_query}
Please enclose the movies in <movie></movie> tags"""
)
print(output.content)
Here is the shortened filmography for Tom Hanks, with movies enclosed in XML tags:
<movie>Splash</movie>
<movie>Big</movie>
<movie>A League of Their Own</movie>
<movie>Sleepless in Seattle</movie>
<movie>Forrest Gump</movie>
<movie>Toy Story</movie>
<movie>Apollo 13</movie>
<movie>Saving Private Ryan</movie>
<movie>Cast Away</movie>
<movie>The Da Vinci Code</movie>
这实际上效果很好!但是如果能将 XML 解析成更易于使用的格式就更好了。我们可以使用 XMLOutputParser
来向提示添加默认的格式指令,并将输出的 XML 解析为字典。
parser = XMLOutputParser()
# We will add these instructions to the prompt below
parser.get_format_instructions()
'The output should be formatted as a XML file.\n1. Output should conform to the tags below. \n2. If tags are not given, make them on your own.\n3. Remember to always open and close all the tags.\n\nAs an example, for the tags ["foo", "bar", "baz"]:\n1. String "<foo>\n <bar>\n <baz></baz>\n </bar>\n</foo>" is a well-formatted instance of the schema. \n2. String "<foo>\n <bar>\n </foo>" is a badly-formatted instance.\n3. String "<foo>\n <tag>\n </tag>\n</foo>" is a badly-formatted instance.\n\nHere are the output tags:\n\`\`\`\nNone\n\`\`\`'
prompt = PromptTemplate(
template="""{query}\n{format_instructions}""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser
output = chain.invoke({"query": actor_query})
print(output)
{'filmography': [{'movie': [{'title': 'Big'}, {'year': '1988'}]}, {'movie': [{'title': 'Forrest Gump'}, {'year': '1994'}]}, {'movie': [{'title': 'Toy Story'}, {'year': '1995'}]}, {'movie': [{'title': 'Saving Private Ryan'}, {'year': '1998'}]}, {'movie': [{'title': 'Cast Away'}, {'year': '2000'}]}]}
我们还可以添加一些标签来定制输出以满足我们的需求。您可以在提示的其他部分尝试添加您自己的格式提示,以增强或替换默认指令。
parser = XMLOutputParser(tags=["movies", "actor", "film", "name", "genre"])
# We will add these instructions to the prompt below
parser.get_format_instructions()
'The output should be formatted as a XML file.\n1. Output should conform to the tags below. \n2. If tags are not given, make them on your own.\n3. Remember to always open and close all the tags.\n\nAs an example, for the tags ["foo", "bar", "baz"]:\n1. String "<foo>\n <bar>\n <baz></baz>\n </bar>\n</foo>" is a well-formatted instance of the schema. \n2. String "<foo>\n <bar>\n </foo>" is a badly-formatted instance.\n3. String "<foo>\n <tag>\n </tag>\n</foo>" is a badly-formatted instance.\n\nHere are the output tags:\n\`\`\`\n[\'movies\', \'actor\', \'film\', \'name\', \'genre\']\n\`\`\`'
prompt = PromptTemplate(
template="""{query}\n{format_instructions}""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser
output = chain.invoke({"query": actor_query})
print(output)
{'movies': [{'actor': [{'name': 'Tom Hanks'}, {'film': [{'name': 'Forrest Gump'}, {'genre': 'Drama'}]}, {'film': [{'name': 'Cast Away'}, {'genre': 'Adventure'}]}, {'film': [{'name': 'Saving Private Ryan'}, {'genre': 'War'}]}]}]}
此输出解析器还支持部分块的流式传输。这是一个例子
for s in chain.stream({"query": actor_query}):
print(s)
{'movies': [{'actor': [{'name': 'Tom Hanks'}]}]}
{'movies': [{'actor': [{'film': [{'name': 'Forrest Gump'}]}]}]}
{'movies': [{'actor': [{'film': [{'genre': 'Drama'}]}]}]}
{'movies': [{'actor': [{'film': [{'name': 'Cast Away'}]}]}]}
{'movies': [{'actor': [{'film': [{'genre': 'Adventure'}]}]}]}
{'movies': [{'actor': [{'film': [{'name': 'Saving Private Ryan'}]}]}]}
{'movies': [{'actor': [{'film': [{'genre': 'War'}]}]}]}
后续步骤
您现在已经了解了如何提示模型返回 XML。接下来,查看关于获取结构化输出的更广泛指南,了解其他相关技术。