生成合成数据

合成数据是人工生成的数据，而不是从真实事件中收集的数据。它用于模拟真实数据，而不会损害隐私或遇到现实世界的限制。

合成数据的优势

隐私和安全性：没有真实个人数据面临泄露风险。
数据增强：扩展机器学习的数据集。
灵活性：创建特定或罕见的情况。
经济高效：通常比现实世界的数据收集更便宜。
法规遵从：帮助遵守严格的数据保护法律。
模型鲁棒性：可能导致更好的泛化 AI 模型。
快速原型设计：能够在没有真实数据的情况下快速测试。
受控实验：模拟特定条件。
数据访问：在没有真实数据的情况下提供替代方案。

注意：尽管有这些优点，合成数据应谨慎使用，因为它可能无法始终捕获现实世界的复杂性。

快速入门

在本笔记本中，我们将深入探讨使用 langchain 库生成合成医疗账单记录。当您想开发或测试算法但不想使用真实患者数据（由于隐私问题或数据可用性问题）时，此工具特别有用。

设置

首先，您需要安装 langchain 库及其依赖项。由于我们使用的是 OpenAI 生成器链，因此我们也会安装它。由于这是一个实验性的库，我们需要在安装中包含 langchain_experimental。然后我们将导入必要的模块。

%pip install --upgrade --quiet  langchain langchain_experimental langchain-openai

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI
from pydantic import BaseModel

API 参考：FewShotPromptTemplate | PromptTemplate | OPENAI_TEMPLATE | create_openai_data_generator | ChatOpenAI

1. 定义您的数据模型

每个数据集都有一个结构或“模式”。下面的 MedicalBilling 类充当我们合成数据的模式。通过定义它，我们向合成数据生成器提供有关我们期望的数据形状和性质的信息。

class MedicalBilling(BaseModel):
    patient_id: int
    patient_name: str
    diagnosis_code: str
    procedure_code: str
    total_charge: float
    insurance_claim_amount: float

例如，每条记录都将有一个 patient_id（整数），一个 patient_name（字符串）等等。

2. 样本数据

为了引导合成数据生成器，向其提供一些现实世界的示例非常有用。这些示例充当“种子” - 它们代表您想要的数据类型，并且生成器将使用它们来创建更多看起来类似的数据。

以下是一些虚构的医疗账单记录

examples = [
    {
        "example": """Patient ID: 123456, Patient Name: John Doe, Diagnosis Code: 
        J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350"""
    },
    {
        "example": """Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis 
        Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120"""
    },
    {
        "example": """Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code: 
        E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250"""
    },
]

3. 创建提示模板

生成器并不会神奇地知道如何创建我们的数据；我们需要引导它。我们通过创建一个提示模板来做到这一点。此模板有助于指示底层的语言模型如何以所需格式生成合成数据。

OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

FewShotPromptTemplate 包括

prefix 和 suffix：它们可能包含指导性的上下文或说明。
examples：我们之前定义的样本数据。
input_variables：这些变量（“subject”，"extra"）是您稍后可以动态填充的占位符。例如，"subject" 可以填充为“medical_billing”以进一步引导模型。
example_prompt：此提示模板是我们希望在提示中包含的每个示例行的格式。

4. 创建数据生成器

准备了模式和提示后，下一步是创建数据生成器。此对象知道如何与底层语言模型通信以获取合成数据。

synthetic_data_generator = create_openai_data_generator(
    output_schema=MedicalBilling,
    llm=ChatOpenAI(
        temperature=1
    ),  # You'll need to replace with your actual Language Model instance
    prompt=prompt_template,
)

5. 生成合成数据

最后，让我们获取合成数据！

synthetic_results = synthetic_data_generator.generate(
    subject="medical_billing",
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
    runs=10,
)

此命令要求生成器生成 10 条合成医疗账单记录。结果存储在 synthetic_results 中。输出将是 MedicalBilling pydantic 模型的列表。

其他实现

from langchain_experimental.synthetic_data import (
    DatasetGenerator,
    create_data_generation_chain,
)
from langchain_openai import ChatOpenAI

API 参考：DatasetGenerator | create_data_generation_chain | ChatOpenAI

# LLM
model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)
chain = create_data_generation_chain(model)

chain({"fields": ["blue", "yellow"], "preferences": {}})

{'fields': ['blue', 'yellow'],
 'preferences': {},
 'text': 'The vibrant blue sky contrasted beautifully with the bright yellow sun, creating a stunning display of colors that instantly lifted the spirits of all who gazed upon it.'}

chain(
    {
        "fields": {"colors": ["blue", "yellow"]},
        "preferences": {"style": "Make it in a style of a weather forecast."},
    }
)

{'fields': {'colors': ['blue', 'yellow']},
 'preferences': {'style': 'Make it in a style of a weather forecast.'},
 'text': "Good morning! Today's weather forecast brings a beautiful combination of colors to the sky, with hues of blue and yellow gently blending together like a mesmerizing painting."}

chain(
    {
        "fields": {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
        "preferences": None,
    }
)

{'fields': {'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
 'preferences': None,
 'text': 'Tom Hanks, the renowned actor known for his incredible versatility and charm, has graced the silver screen in unforgettable movies such as "Forrest Gump" and "Green Mile".'}

chain(
    {
        "fields": [
            {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
            {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]},
        ],
        "preferences": {"minimum_length": 200, "style": "gossip"},
    }
)

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
  {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],
 'preferences': {'minimum_length': 200, 'style': 'gossip'},
 'text': 'Did you know that Tom Hanks, the beloved Hollywood actor known for his roles in "Forrest Gump" and "Green Mile", has shared the screen with the talented Mads Mikkelsen, who gained international acclaim for his performances in "Hannibal" and "Another round"? These two incredible actors have brought their exceptional skills and captivating charisma to the big screen, delivering unforgettable performances that have enthralled audiences around the world. Whether it\'s Hanks\' endearing portrayal of Forrest Gump or Mikkelsen\'s chilling depiction of Hannibal Lecter, these movies have solidified their places in cinematic history, leaving a lasting impact on viewers and cementing their status as true icons of the silver screen.'}

我们可以看到，创建的示例是多样化的，并且包含我们希望它们具有的信息。此外，它们的风格也很好地反映了给定的偏好。

为提取基准测试目的生成示例数据集

inp = [
    {
        "Actor": "Tom Hanks",
        "Film": [
            "Forrest Gump",
            "Saving Private Ryan",
            "The Green Mile",
            "Toy Story",
            "Catch Me If You Can",
        ],
    },
    {
        "Actor": "Tom Hardy",
        "Film": [
            "Inception",
            "The Dark Knight Rises",
            "Mad Max: Fury Road",
            "The Revenant",
            "Dunkirk",
        ],
    },
]

generator = DatasetGenerator(model, {"style": "informal", "minimal length": 500})
dataset = generator(inp)

dataset

[{'fields': {'Actor': 'Tom Hanks',
   'Film': ['Forrest Gump',
    'Saving Private Ryan',
    'The Green Mile',
    'Toy Story',
    'Catch Me If You Can']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hanks, the versatile and charismatic actor, has graced the silver screen in numerous iconic films including the heartwarming and inspirational "Forrest Gump," the intense and gripping war drama "Saving Private Ryan," the emotionally charged and thought-provoking "The Green Mile," the beloved animated classic "Toy Story," and the thrilling and captivating true story adaptation "Catch Me If You Can." With his impressive range and genuine talent, Hanks continues to captivate audiences worldwide, leaving an indelible mark on the world of cinema.'},
 {'fields': {'Actor': 'Tom Hardy',
   'Film': ['Inception',
    'The Dark Knight Rises',
    'Mad Max: Fury Road',
    'The Revenant',
    'Dunkirk']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hardy, the versatile actor known for his intense performances, has graced the silver screen in numerous iconic films, including "Inception," "The Dark Knight Rises," "Mad Max: Fury Road," "The Revenant," and "Dunkirk." Whether he\'s delving into the depths of the subconscious mind, donning the mask of the infamous Bane, or navigating the treacherous wasteland as the enigmatic Max Rockatansky, Hardy\'s commitment to his craft is always evident. From his breathtaking portrayal of the ruthless Eames in "Inception" to his captivating transformation into the ferocious Max in "Mad Max: Fury Road," Hardy\'s dynamic range and magnetic presence captivate audiences and leave an indelible mark on the world of cinema. In his most physically demanding role to date, he endured the harsh conditions of the freezing wilderness as he portrayed the rugged frontiersman John Fitzgerald in "The Revenant," earning him critical acclaim and an Academy Award nomination. In Christopher Nolan\'s war epic "Dunkirk," Hardy\'s stoic and heroic portrayal of Royal Air Force pilot Farrier showcases his ability to convey deep emotion through nuanced performances. With his chameleon-like ability to inhabit a wide range of characters and his unwavering commitment to his craft, Tom Hardy has undoubtedly solidified his place as one of the most talented and sought-after actors of his generation.'}]

从生成的示例中提取

好的，让我们看看我们现在是否可以从这些生成的数据中提取输出，以及它与我们的案例的比较！

from typing import List

from langchain.chains import create_extraction_chain_pydantic
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from pydantic import BaseModel, Field

API 参考：create_extraction_chain_pydantic | PydanticOutputParser | PromptTemplate | OpenAI

class Actor(BaseModel):
    Actor: str = Field(description="name of an actor")
    Film: List[str] = Field(description="list of names of films they starred in")

解析器

llm = OpenAI()
parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

_input = prompt.format_prompt(text=dataset[0]["text"])
output = llm(_input.to_string())

parsed = parser.parse(output)
parsed

Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])

(parsed.Actor == inp[0]["Actor"]) & (parsed.Film == inp[0]["Film"])

True

提取器

extractor = create_extraction_chain_pydantic(pydantic_schema=Actor, llm=model)
extracted = extractor.run(dataset[1]["text"])
extracted

[Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])]

(extracted[0].Actor == inp[1]["Actor"]) & (extracted[0].Film == inp[1]["Film"])

True

生成合成数据

快速入门

设置

1. 定义您的数据模型

2. 样本数据

3. 创建提示模板

4. 创建数据生成器

5. 生成合成数据

其他实现

为提取基准测试目的生成示例数据集

从生成的示例中提取

解析器

提取器

此页面是否有用？

您也可以在 GitHub 上留下详细的反馈 GitHub.

快速入门​

设置​

1. 定义您的数据模型​

2. 样本数据​

3. 创建提示模板​

4. 创建数据生成器​

5. 生成合成数据​

其他实现​

为提取基准测试目的生成示例数据集​

从生成的示例中提取​

解析器​

提取器​

此页面是否有用？

您也可以在 GitHub 上留下详细的反馈 GitHub.

快速入门

设置

1. 定义您的数据模型

2. 样本数据

3. 创建提示模板

4. 创建数据生成器

5. 生成合成数据

其他实现

为提取基准测试目的生成示例数据集

从生成的示例中提取

解析器

提取器