跳到主要内容
Open In ColabOpen on GitHub

如何对CSV文件进行问答

大型语言模型(LLM)非常适合构建针对各种数据源的问答系统。在本节中,我们将介绍如何针对存储在CSV文件中的数据构建问答系统。与使用SQL数据库类似,处理CSV文件的关键是让LLM能够访问查询和与数据交互的工具。主要有两种方法可以实现这一点:

  • 推荐:将CSV文件加载到SQL数据库中,并使用《SQL 教程》中概述的方法。
  • 让LLM访问Python环境,在其中可以使用Pandas等库与数据交互。

本指南将介绍这两种方法。

⚠️ 安全注意事项 ⚠️

上述两种方法都存在重大风险。使用SQL需要执行模型生成的SQL查询。使用Pandas等库则需要允许模型执行Python代码。由于SQL连接权限更容易严格限定并且SQL查询更容易进行清理,而不是沙盒化Python环境,因此**我们强烈建议通过SQL与CSV数据进行交互。**有关一般安全最佳实践的更多信息,请参见此处

设置

本指南的依赖项

%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas

设置所需环境变量

# Using LangSmith is recommended but not required. Uncomment below lines to use.
# import os
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

如果尚未拥有,请下载泰坦尼克号数据集

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv
import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())
(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

SQL

使用SQL与CSV数据交互是推荐的方法,因为它比任意Python代码更容易限制权限和清理查询。

大多数SQL数据库都可以轻松地将CSV文件加载为表(如DuckDBSQLite等)。完成此操作后,您就可以使用《SQL 教程》中概述的所有链和代理创建技术。以下是使用SQLite实现此目的的快速示例:

from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)
API 参考:SQLDatabase
887
db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM titanic WHERE Age < 2;"))
sqlite
['titanic']
[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]

并创建一个SQL 代理与其交互

pip install -qU "langchain[google-genai]"
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")
from langchain_community.agent_toolkits import create_sql_agent

agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)
API 参考:create_sql_agent
agent_executor.invoke({"input": "what's the average age of survivors"})


> Entering new SQL Agent Executor chain...

Invoking: `sql_db_list_tables` with `{}`


titanic
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`



CREATE TABLE titanic (
"Survived" BIGINT,
"Pclass" BIGINT,
"Name" TEXT,
"Sex" TEXT,
"Age" FLOAT,
"Siblings/Spouses Aboard" BIGINT,
"Parents/Children Aboard" BIGINT,
"Fare" FLOAT
)

/*
3 rows from titanic table:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 3 Mr. Owen Harris Braund male 22.0 1 0 7.25
1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38.0 1 0 71.2833
1 3 Miss. Laina Heikkinen female 26.0 0 0 7.925
*/
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS Average_Age FROM titanic WHERE Survived = 1'}`


[(28.408391812865496,)]The average age of survivors in the Titanic dataset is approximately 28.41 years.

> Finished chain.
{'input': "what's the average age of survivors",
'output': 'The average age of survivors in the Titanic dataset is approximately 28.41 years.'}

这种方法可以轻松推广到多个CSV文件,因为我们可以将每个文件作为独立的表加载到数据库中。请参阅下面的“多个 CSV 文件”部分。

Pandas

除了SQL,我们还可以使用Pandas等数据分析库以及LLM的代码生成能力来与CSV数据交互。同样,**除非您有完善的安全防护措施,否则这种方法不适合生产环境使用**。因此,我们的代码执行工具和构造函数位于 langchain-experimental 包中。

大多数LLM都经过了足够的Pandas Python代码训练,以至于只要提出要求,它们就能生成相应的代码。

ai_msg = llm.invoke(
"I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)
\`\`\`python
correlation = df['Age'].corr(df['Fare'])
correlation
\`\`\`

我们可以将这种能力与Python执行工具结合起来,创建一个简单的数据分析链。我们首先需要将CSV表加载为DataFrame,并赋予该工具访问此DataFrame的权限。

import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool

df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")
32.30542018038331

为了帮助强制正确使用我们的Python工具,我们将使用工具调用功能。

llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
response = llm_with_tools.invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
response
AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_SBrK246yUbdnJemXFC8Iod05', 'function': {'arguments': '{"query":"df.corr()[\'Age\'][\'Fare\']"}', 'name': 'python_repl_ast'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 125, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-1fd332ba-fa72-4351-8182-d464e7368311-0', tool_calls=[{'name': 'python_repl_ast', 'args': {'query': "df.corr()['Age']['Fare']"}, 'id': 'call_SBrK246yUbdnJemXFC8Iod05'}])
response.tool_calls
[{'name': 'python_repl_ast',
'args': {'query': "df.corr()['Age']['Fare']"},
'id': 'call_SBrK246yUbdnJemXFC8Iod05'}]

我们将添加一个工具输出解析器,将函数调用提取为字典。

from langchain_core.output_parsers.openai_tools import JsonOutputKeyToolsParser

parser = JsonOutputKeyToolsParser(key_name=tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
{'query': "df[['Age', 'Fare']].corr()"}

并结合提示,这样我们只需指定问题,而无需每次调用都指定DataFrame信息。

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})
{'query': "df[['Age', 'Fare']].corr()"}

最后,我们将添加我们的Python工具,以便实际执行生成的代码。

chain = prompt | llm_with_tools | parser | tool
chain.invoke({"question": "What's the correlation between age and fare"})
0.11232863699941621

就这样,我们有了一个简单的数据分析链。我们可以通过查看LangSmith跟踪来了解中间步骤:https://smith.langchain.com/public/b1309290-7212-49b7-bde2-75b39a32b49a/r

我们可以在最后添加一个额外的LLM调用来生成对话式响应,这样我们就不仅仅是回复工具输出了。为此,我们需要在提示中添加一个聊天历史消息占位符(MessagesPlaceholder)。

from operator import itemgetter

from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
system,
),
("human", "{question}"),
# This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
# at the end of the prompt using the 'chat_history' arg.
MessagesPlaceholder("chat_history", optional=True),
]
)


def _get_chat_history(x: dict) -> list:
"""Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
ai_msg = x["ai_msg"]
tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
return [ai_msg, tool_msg]


chain = (
RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
.assign(tool_output=itemgetter("ai_msg") | parser | tool)
.assign(chat_history=_get_chat_history)
.assign(response=prompt | llm | StrOutputParser())
.pick(["tool_output", "response"])
)
chain.invoke({"question": "What's the correlation between age and fare"})
{'tool_output': 0.11232863699941616,
'response': 'The correlation between age and fare is approximately 0.1123.'}

以下是此运行的LangSmith跟踪:https://smith.langchain.com/public/14e38d70-45b1-4b81-8477-9fd2b7c07ea6/r

代理

对于复杂问题,LLM能够迭代执行代码并同时维护其先前执行的输入和输出会很有帮助。这就是代理(Agents)发挥作用的地方。它们允许LLM决定工具需要被调用的次数,并跟踪迄今为止的执行情况。`create_pandas_dataframe_agent` 是一个内置代理,可以轻松处理DataFrames。

from langchain_experimental.agents import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(
llm, df, agent_type="openai-tools", verbose=True, allow_dangerous_code=True
)
agent.invoke(
{
"input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
}
)


> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr().iloc[0,1]"}`


0.11232863699941621
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr().iloc[0,1]"}`


0.2561785496289603The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.

Therefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).

> Finished chain.
{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
'output': 'The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.\n\nTherefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).'}

以下是此运行的LangSmith跟踪:https://smith.langchain.com/public/6a86aee2-4f22-474a-9264-bd4c7283e665/r

多个 CSV 文件

为了处理多个CSV文件(或DataFrames),我们只需将多个DataFrames传递给我们的Python工具。我们的 `create_pandas_dataframe_agent` 构造函数可以直接做到这一点,我们可以传入一个DataFrame列表而不是单个DataFrame。如果我们要自己构建一个链,可以这样做:

df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]

tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """\`\`\`python
{df_name}.head().to_markdown()
>>> {df_head}
\`\`\`"""
df_context = "\n\n".join(
df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)

system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:

{df_context}

Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])

chain = prompt | llm_with_tool | parser | tool
chain.invoke(
{
"question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
}
)
0.14384991262954416

以下是此运行的LangSmith跟踪:https://smith.langchain.com/public/cc2a7d7f-7c5a-4e77-a10c-7b5420fcd07f/r

沙盒代码执行

有一些工具,如E2BBearly,提供了Python代码执行的沙盒环境,以实现更安全的代码执行链和代理。

下一步

对于更高级的数据分析应用,我们建议查看:

  • SQL 教程:处理SQL数据库和CSV文件时遇到的许多挑战对于任何结构化数据类型都是通用的,因此即使您使用Pandas进行CSV数据分析,阅读SQL技术也会很有用。
  • 工具使用:关于使用调用工具的链和代理时一般最佳实践的指南
  • 代理:了解构建LLM代理的基础知识。
  • 集成:E2BBearly 等沙盒环境,SQLDatabase 等实用程序,Spark DataFrame agent 等相关代理。