如何加载 JSON
JSON (JavaScript 对象表示法) 是一种开放标准文件格式和数据交换格式,它使用人类可读的文本来存储和传输由属性-值对和数组(或其他可序列化的值)组成的数据对象。
JSON Lines 是一种文件格式,其中每行都是有效的 JSON 值。
LangChain 实现了 JSONLoader,用于将 JSON 和 JSONL 数据转换为 LangChain 文档 对象。它使用指定的 jq 模式来解析 JSON 文件,从而允许将特定字段提取到 LangChain 文档的内容和元数据中。
它使用 jq
Python 包。查看此 手册以获取有关 jq
语法的详细文档。
在这里我们将演示
- 如何将 JSON 和 JSONL 数据加载到 LangChain
文档
的内容中; - 如何将 JSON 和 JSONL 数据加载到与
文档
关联的元数据中。
#!pip install jq
from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint
file_path='./example_data/facebook_chat.json'
data = json.loads(Path(file_path).read_text())
pprint(data)
{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
'is_still_participant': True,
'joinable_mode': {'link': '', 'mode': 1},
'magic_words': [],
'messages': [{'content': 'Bye!',
'sender_name': 'User 2',
'timestamp_ms': 1675597571851},
{'content': 'Oh no worries! Bye',
'sender_name': 'User 1',
'timestamp_ms': 1675597435669},
{'content': 'No Im sorry it was my mistake, the blue one is not '
'for sale',
'sender_name': 'User 2',
'timestamp_ms': 1675596277579},
{'content': 'I thought you were selling the blue one!',
'sender_name': 'User 1',
'timestamp_ms': 1675595140251},
{'content': 'Im not interested in this bag. Im interested in the '
'blue one!',
'sender_name': 'User 1',
'timestamp_ms': 1675595109305},
{'content': 'Here is $129',
'sender_name': 'User 2',
'timestamp_ms': 1675595068468},
{'photos': [{'creation_timestamp': 1675595059,
'uri': 'url_of_some_picture.jpg'}],
'sender_name': 'User 2',
'timestamp_ms': 1675595060730},
{'content': 'Online is at least $100',
'sender_name': 'User 2',
'timestamp_ms': 1675595045152},
{'content': 'How much do you want?',
'sender_name': 'User 1',
'timestamp_ms': 1675594799696},
{'content': 'Goodmorning! $50 is too low.',
'sender_name': 'User 2',
'timestamp_ms': 1675577876645},
{'content': 'Hi! Im interested in your bag. Im offering $50. Let '
'me know if you are interested. Thanks!',
'sender_name': 'User 1',
'timestamp_ms': 1675549022673}],
'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
'thread_path': 'inbox/User 1 and User 2 chat',
'title': 'User 1 and User 2 chat'}
使用 JSONLoader
假设我们有兴趣提取 JSON 数据中 messages
键下的 content
字段中的值。这可以通过 JSONLoader
轻松完成,如下所示。
JSON 文件
loader = JSONLoader(
file_path='./example_data/facebook_chat.json',
jq_schema='.messages[].content',
text_content=False)
data = loader.load()
pprint(data)
[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]
JSON Lines 文件
如果你想从 JSON Lines 文件加载文档,你需要传入 json_lines=True
并指定 jq_schema
来从单个 JSON 对象中提取 page_content
。
file_path = './example_data/facebook_chat_messages.jsonl'
pprint(Path(file_path).read_text())
('{"sender_name": "User 2", "timestamp_ms": 1675597571851, "content": "Bye!"}\n'
'{"sender_name": "User 1", "timestamp_ms": 1675597435669, "content": "Oh no '
'worries! Bye"}\n'
'{"sender_name": "User 2", "timestamp_ms": 1675596277579, "content": "No Im '
'sorry it was my mistake, the blue one is not for sale"}\n')
loader = JSONLoader(
file_path='./example_data/facebook_chat_messages.jsonl',
jq_schema='.content',
text_content=False,
json_lines=True)
data = loader.load()
pprint(data)
[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]
另一种选择是设置 jq_schema='.'
并提供 content_key
。
loader = JSONLoader(
file_path='./example_data/facebook_chat_messages.jsonl',
jq_schema='.',
content_key='sender_name',
json_lines=True)
data = loader.load()
pprint(data)
[Document(page_content='User 2', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
Document(page_content='User 1', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
Document(page_content='User 2', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]
使用 jq schema 和 content_key
的 JSON 文件
要使用 jq schema 中的 content_key 从 JSON 文件加载文档,请设置 is_content_key_jq_parsable=True。确保 content_key 与 jq schema 兼容并且可以使用它进行解析。
file_path = './sample.json'
pprint(Path(file_path).read_text())
{"data": [
{"attributes": {
"message": "message1",
"tags": [
"tag1"]},
"id": "1"},
{"attributes": {
"message": "message2",
"tags": [
"tag2"]},
"id": "2"}]}
loader = JSONLoader(
file_path=file_path,
jq_schema=".data[]",
content_key=".attributes.message",
is_content_key_jq_parsable=True,
)
data = loader.load()
pprint(data)
[Document(page_content='message1', metadata={'source': '/path/to/sample.json', 'seq_num': 1}),
Document(page_content='message2', metadata={'source': '/path/to/sample.json', 'seq_num': 2})]
提取元数据
通常,我们希望将 JSON 文件中可用的元数据包含到我们从内容创建的文档中。
以下演示了如何使用 JSONLoader
提取元数据。
需要注意一些关键的变化。在之前我们没有收集元数据的示例中,我们设法直接在 schema 中指定了 page_content
的值可以从哪里提取。
.messages[].content
在当前的示例中,我们必须告诉加载器遍历 messages
字段中的记录。然后 jq_schema 必须是
.messages[]
这允许我们将记录(字典)传递到必须实现的 metadata_func
中。metadata_func
负责识别记录中的哪些信息应包含在最终 Document
对象中存储的元数据中。
此外,我们现在必须通过 content_key
参数在加载器中明确指定从记录中提取 page_content
值的键。
# Define the metadata extraction function.
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["sender_name"] = record.get("sender_name")
metadata["timestamp_ms"] = record.get("timestamp_ms")
return metadata
loader = JSONLoader(
file_path='./example_data/facebook_chat.json',
jq_schema='.messages[]',
content_key="content",
metadata_func=metadata_func
)
data = loader.load()
pprint(data)
[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),
Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),
Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),
Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),
Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),
Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]
现在,您将看到文档包含与我们提取的内容关联的元数据。
metadata_func
如上所示,metadata_func
接受由 JSONLoader
生成的默认元数据。 这允许用户完全控制元数据的格式。
例如,默认元数据包含 source
和 seq_num
键。但是,JSON 数据也可能包含这些键。然后,用户可以利用 metadata_func
来重命名默认键,并使用来自 JSON 数据的键。
下面的示例展示了如何修改 source
,使其仅包含相对于 langchain
目录的文件源的信息。
# Define the metadata extraction function.
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["sender_name"] = record.get("sender_name")
metadata["timestamp_ms"] = record.get("timestamp_ms")
if "source" in metadata:
source = metadata["source"].split("/")
source = source[source.index("langchain"):]
metadata["source"] = "/".join(source)
return metadata
loader = JSONLoader(
file_path='./example_data/facebook_chat.json',
jq_schema='.messages[]',
content_key="content",
metadata_func=metadata_func
)
data = loader.load()
pprint(data)
[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),
Document(page_content='I thought you were selling the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),
Document(page_content='Here is $129', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),
Document(page_content='', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),
Document(page_content='Online is at least $100', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),
Document(page_content='How much do you want?', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]
具有 jq schema 的常见 JSON 结构
以下列表提供了用户可以用来从 JSON 数据中提取内容的可能 jq_schema
的参考,具体取决于结构。
JSON -> [{"text": ...}, {"text": ...}, {"text": ...}]
jq_schema -> ".[].text"
JSON -> {"key": [{"text": ...}, {"text": ...}, {"text": ...}]}
jq_schema -> ".key[].text"
JSON -> ["...", "...", "..."]
jq_schema -> ".[]"