Olostep + LangChain 集成

Olostep LangChain 集成提供了全面的工具，用于构建可以搜索、抓取、分析和结构化任何网站数据的 AI 代理。非常适合 LangChain 和 LangGraph 应用。

功能

该集成提供访问所有 5 个 Olostep API 功能的权限：

抓取

从任何单一 URL 提取内容，支持多种格式（Markdown、HTML、JSON、文本）

批处理

并行处理多达 10,000 个 URL。批处理作业在 5-8 分钟内完成

答案

支持自然语言查询和结构化输出的 AI 驱动网页搜索

地图

提取网站的所有 URL 以进行网站结构分析

爬取

通过跟随链接自动发现并抓取整个网站

安装

pip install langchain-olostep

设置

将你的 Olostep API 密钥设置为环境变量：

export OLOSTEP_API_KEY="your_olostep_api_key_here"

从 Olostep Dashboard 获取你的 API 密钥。

可用工具

scrape_website

从单个 URL 提取内容。支持多种格式和 JavaScript 渲染。

url

string

必填

要抓取的网站 URL（必须包含 http:// 或 https://）

format

string

默认值:"markdown"

输出格式：markdown、html、json 或 text

country

string

用于特定位置内容的国家代码（例如，“US”、“GB”、“CA”）

wait_before_scraping

integer

JavaScript 渲染的等待时间（0-10000 毫秒）

parser

string

用于专门提取的可选解析器 ID（例如，“@olostep/amazon-product”）

from langchain_olostep import scrape_website
import asyncio

# 抓取网站
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

scrape_batch

并行处理多个 URL（一次最多 10,000 个）。

urls

array

必填

要抓取的 URL 列表

format

string

默认值:"markdown"

所有 URL 的输出格式：markdown、html、json 或 text

country

string

用于特定位置内容的国家代码

wait_before_scraping

integer

JavaScript 渲染的等待时间

parser

string

用于专门提取的可选解析器 ID

from langchain_olostep import scrape_batch
import asyncio

# 抓取多个 URL
result = asyncio.run(scrape_batch.ainvoke({
    "urls": [
        "https://example1.com",
        "https://example2.com",
        "https://example3.com"
    ],
    "format": "markdown"
}))

print(result)
# 返回: {"batch_id": "batch_xxx", "status": "in_progress", ...}

answer_question

搜索网络并获取 AI 驱动的答案和来源。非常适合数据丰富和研究。

task

string

必填

要搜索的问题或任务

json_schema

object

描述所需输出格式的可选 JSON 架构字典/字符串

from langchain_olostep import answer_question
import asyncio

# 提问简单问题
result = asyncio.run(answer_question.ainvoke({
    "task": "What is the capital of France?"
}))

print(result)
# 返回: {"answer": {"result": "Paris"}, "sources": [...]}

extract_urls

提取网站的所有 URL 以进行网站结构分析。

url

string

必填

要提取 URL 的网站 URL

search_query

string

用于过滤 URL 的可选搜索查询

top_n

integer

限制返回的 URL 数量

include_urls

array

要包含的 Glob 模式（例如，[“/blog/**”]）

exclude_urls

array

要排除的 Glob 模式（例如，[“/admin/**”]）

from langchain_olostep import extract_urls
import asyncio

# 获取网站的所有 URL
result = asyncio.run(extract_urls.ainvoke({
    "url": "https://example.com",
    "top_n": 100
}))

print(result)
# 返回: {"urls": [...], "total_urls": 100, ...}

crawl_website

通过跟随链接自动发现并抓取整个网站。

start_url

string

必填

爬取的起始 URL

max_pages

integer

默认值:"100"

要爬取的最大页面数

include_urls

array

要包含的 Glob 模式（例如，[”/**”] 表示全部）

exclude_urls

array

要排除的 Glob 模式（例如，[“/admin/**”]）

max_depth

integer

从 start_url 开始爬取的最大深度

include_external

boolean

默认值:"false"

包含外部 URL

from langchain_olostep import crawl_website
import asyncio

# 爬取整个文档网站
result = asyncio.run(crawl_website.ainvoke({
    "start_url": "https://docs.example.com",
    "max_pages": 100
}))

print(result)
# 返回: {"crawl_id": "crawl_xxx", "status": "in_progress", ...}

LangChain 代理集成

构建可以搜索和抓取网页的智能代理：

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import (
    scrape_website,
    answer_question,
    extract_urls
)

# 使用 Olostep 工具创建代理
tools = [scrape_website, answer_question, extract_urls]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# 使用代理
result = agent.run("""
Research the company at https://company.com:
1. Scrape their about page
2. Search for their latest funding round
3. Extract all their product pages
""")

print(result)

LangGraph 集成

使用 LangGraph 构建复杂的多步骤工作流：

from langgraph.graph import StateGraph, END
from langchain_olostep import (
    scrape_website,
    scrape_batch,
    answer_question,
    extract_urls
)
from langchain_openai import ChatOpenAI
import json

def create_research_agent():
    workflow = StateGraph(dict)
    
    def discover_pages(state):
        # 从目标网站提取所有 URL
        result = extract_urls.invoke({
            "url": state["target_url"],
            "include_urls": ["/product/**"],
            "top_n": 50
        })
        state["urls"] = json.loads(result)["urls"]
        return state
    
    def scrape_pages(state):
        # 批量抓取发现的页面
        result = scrape_batch.invoke({
            "urls": state["urls"],
            "format": "markdown"
        })
        state["batch_id"] = json.loads(result)["batch_id"]
        return state
    
    def answer_questions(state):
        # 使用 AI 回答有关数据的问题
        result = answer_question.invoke({
            "task": state["research_question"],
            "json_schema": state["desired_format"]
        })
        state["answer"] = json.loads(result)["answer"]
        return state
    
    workflow.add_node("discover", discover_pages)
    workflow.add_node("scrape", scrape_pages)
    workflow.add_node("analyze", answer_questions)
    
    workflow.set_entry_point("discover")
    workflow.add_edge("discover", "scrape")
    workflow.add_edge("scrape", "analyze")
    workflow.add_edge("analyze", END)
    
    return workflow.compile()

# 使用代理
agent = create_research_agent()
result = agent.invoke({
    "target_url": "https://store.com",
    "research_question": "What are the top 5 most expensive products?",
    "desired_format": {
        "products": [{"name": "", "price": "", "url": ""}]
    }
})

高级用例

数据丰富

使用网络信息丰富电子表格数据：

from langchain_olostep import answer_question

companies = ["Stripe", "Shopify", "Square"]

for company in companies:
    result = answer_question.invoke({
        "task": f"Find information about {company}",
        "json_schema": {
            "ceo": "",
            "headquarters": "",
            "employee_count": "",
            "latest_funding": ""
        }
    })
    print(f"{company}: {result}")

电商产品抓取

使用专门的解析器抓取产品数据：

from langchain_olostep import scrape_website

# 抓取亚马逊产品
result = scrape_website.invoke({
    "url": "https://www.amazon.com/dp/PRODUCT_ID",
    "parser": "@olostep/amazon-product",
    "format": "json"
})
# 返回结构化的产品数据：价格、标题、评分等

SEO 审核

分析整个网站的 SEO：

from langchain_olostep import extract_urls, scrape_batch
import json

# 1. 发现所有页面
urls_result = extract_urls.invoke({
    "url": "https://yoursite.com",
    "top_n": 1000
})

# 2. 抓取所有页面
urls = json.loads(urls_result)["urls"]
batch_result = scrape_batch.invoke({
    "urls": urls,
    "format": "html"
})

文档抓取

爬取并提取文档：

from langchain_olostep import crawl_website

# 爬取整个文档网站
result = crawl_website.invoke({
    "start_url": "https://docs.example.com",
    "max_pages": 500,
    "include_urls": ["/docs/**"],
    "exclude_urls": ["/api/**", "/v1/**"]
})

专用解析器

Olostep 提供了针对流行网站的预构建解析器：

@olostep/google-search - Google 搜索结果

使用 parser 参数使用它们：

scrape_website.invoke({
    "url": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
    "parser": "@olostep/google-search"
})

错误处理

from langchain_core.exceptions import LangChainException

try:
    result = await scrape_website.ainvoke({
        "url": "https://example.com"
    })
except LangChainException as e:
    print(f"Scraping failed: {e}")

最佳实践

为多个 URL 使用批处理

当抓取超过 3-5 个 URL 时，使用 scrape_batch 而不是多个 scrape_website 调用。批处理速度更快且更具成本效益。

设置适当的超时

对于 JavaScript 密集型网站，使用 wait_before_scraping 参数（2000-5000ms 是典型值）。这确保动态内容完全加载。

使用专用解析器

对于流行网站（亚马逊、LinkedIn、Google），使用我们的预构建解析器以自动获取结构化数据。

高效过滤 URL

使用 extract_urls 或 crawl_website 时，使用 glob 模式专注于相关页面，避免不必要的处理。

处理速率限制

对于速率限制错误，实施指数退避。API 自动处理大多数速率限制。

支持

PyPI 包: langchain-olostep
文档: docs.olostep.com
问题: GitHub Issues
电子邮件: info@olostep.com

抓取 API

了解抓取端点

批处理 API

了解批处理端点

答案 API

了解答案端点

地图 API

了解地图端点

爬取 API

了解爬取端点

Python SDK

探索 Python SDK

LangChain 网站

LangChain 平台

开始使用

功能

集成

Olostep + LangChain 集成

功能

抓取

批处理

答案

地图

爬取

安装

设置

可用工具

scrape_website

scrape_batch

answer_question

extract_urls

crawl_website

LangChain 代理集成

LangGraph 集成

高级用例

数据丰富

电商产品抓取

SEO 审核

文档抓取

专用解析器

错误处理

最佳实践

支持

相关资源

抓取 API

批处理 API

答案 API

地图 API

爬取 API

Python SDK

LangChain 网站

开始使用

功能

集成

Documentation Index

​功能

抓取

批处理

答案

地图

爬取

​安装

​设置

​可用工具

​scrape_website

​scrape_batch

​answer_question

​extract_urls

​crawl_website

​LangChain 代理集成

​LangGraph 集成

​高级用例

​数据丰富

​电商产品抓取

​SEO 审核

​文档抓取

​专用解析器

​错误处理

​最佳实践

​支持

​相关资源

抓取 API

批处理 API

答案 API

地图 API

爬取 API

Python SDK

LangChain 网站

功能

安装

设置

可用工具

scrape_website

scrape_batch

answer_question

extract_urls

crawl_website

LangChain 代理集成

LangGraph 集成

高级用例

数据丰富

电商产品抓取

SEO 审核

文档抓取

专用解析器

错误处理

最佳实践

支持

相关资源