scrapfly

ScrapFly

ScrapFly 是一款网页抓取 API，具有无头浏览器功能、代理和反机器人绕过功能。它允许将网页数据提取到可访问的 LLM Markdown 或文本中。

安装

使用 pip 安装 ScrapFly Python SDK 和所需的 Langchain 包

pip install scrapfly-sdk langchain langchain-community

用法

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_loader = ScrapflyLoader(
    ["https://web-scraping.dev/products"],
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)

API 参考：ScrapflyLoader

ScrapflyLoader 还允许传递 ScrapeConfig 对象以自定义抓取请求。有关完整的功能详细信息及其 API 参数，请参阅文档：https://scrapfly.io/docs/scrape-api/getting-started

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

scrapfly_loader = ScrapflyLoader(
    ["https://web-scraping.dev/products"],
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
    scrape_config=scrapfly_scrape_config,  # Pass the scrape_config object
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)

API 参考：ScrapflyLoader

文档加载器概念指南
文档加载器操作指南

scrapfly

ScrapFly

安装

用法

此页面是否有帮助？

您还可以留下详细的反馈在 GitHub 上.

ScrapFly​

安装​

用法​

相关​

此页面是否有帮助？

您还可以留下详细的反馈 在 GitHub 上.

ScrapFly

安装

用法

相关

您还可以留下详细的反馈在 GitHub 上.