Olostep + LangChain 統合

Olostep LangChain統合は、あらゆるウェブサイトからデータを検索、スクレイピング、分析、構造化できるAIエージェントを構築するための包括的なツールを提供します。LangChainおよびLangGraphアプリケーションに最適です。

特徴

この統合は、Olostep APIのすべての5つの機能にアクセスできます：

スクレイプ

任意の単一URLから複数の形式（Markdown、HTML、JSON、テキスト）でコンテンツを抽出

バッチ

最大10,000のURLを並行して処理。バッチジョブは5-8分で完了

回答

自然言語クエリと構造化出力を備えたAI駆動のウェブ検索

マップ

サイト構造分析のためにウェブサイトからすべてのURLを抽出

クロール

リンクをたどってウェブサイト全体を自動的に発見しスクレイプ

インストール

pip install langchain-olostep

セットアップ

Olostep APIキーを環境変数として設定します：

export OLOSTEP_API_KEY="your_olostep_api_key_here"

APIキーはOlostep Dashboardから取得してください。

利用可能なツール

scrape_website

単一のURLからコンテンツを抽出します。複数の形式とJavaScriptレンダリングをサポート。

url

string

必須

スクレイプするウェブサイトのURL（http://またはhttps://を含む必要があります）

format

string

デフォルト:"markdown"

出力形式：markdown、html、json、またはtext

country

string

ロケーション固有のコンテンツのための国コード（例：“US”、“GB”、“CA”）

wait_before_scraping

integer

JavaScriptレンダリングの待機時間（ミリ秒単位、0-10000）

parser

string

特殊な抽出のためのオプションのパーサーID（例：“@olostep/amazon-product”）

from langchain_olostep import scrape_website
import asyncio

# ウェブサイトをスクレイプ
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

scrape_batch

複数のURLを並行して処理（最大10,000件まで一度に）。

urls

array

必須

スクレイプするURLのリスト

format

string

デフォルト:"markdown"

すべてのURLの出力形式：markdown、html、json、またはtext

country

string

ロケーション固有のコンテンツのための国コード

wait_before_scraping

integer

JavaScriptレンダリングの待機時間

parser

string

特殊な抽出のためのオプションのパーサーID

from langchain_olostep import scrape_batch
import asyncio

# 複数のURLをスクレイプ
result = asyncio.run(scrape_batch.ainvoke({
    "urls": [
        "https://example1.com",
        "https://example2.com",
        "https://example3.com"
    ],
    "format": "markdown"
}))

print(result)
# 戻り値: {"batch_id": "batch_xxx", "status": "in_progress", ...}

answer_question

ウェブを検索し、ソース付きのAI駆動の回答を取得。データ強化や調査に最適。

task

string

必須

検索する質問またはタスク

json_schema

object

望ましい出力形式を記述するオプションのJSONスキーマ辞書/文字列

from langchain_olostep import answer_question
import asyncio

# 簡単な質問をする
result = asyncio.run(answer_question.ainvoke({
    "task": "What is the capital of France?"
}))

print(result)
# 戻り値: {"answer": {"result": "Paris"}, "sources": [...]}

extract_urls

サイト構造分析のためにウェブサイトからすべてのURLを抽出。

url

string

必須

URLを抽出するウェブサイトのURL

search_query

string

URLをフィルタリングするためのオプションの検索クエリ

top_n

integer

返されるURLの数を制限

include_urls

array

含めるグロブパターン（例：[“/blog/**”]）

exclude_urls

array

除外するグロブパターン（例：[“/admin/**”]）

from langchain_olostep import extract_urls
import asyncio

# ウェブサイトからすべてのURLを取得
result = asyncio.run(extract_urls.ainvoke({
    "url": "https://example.com",
    "top_n": 100
}))

print(result)
# 戻り値: {"urls": [...], "total_urls": 100, ...}

crawl_website

リンクをたどってウェブサイト全体を自動的に発見しスクレイプ。

start_url

string

必須

クロールの開始URL

max_pages

integer

デフォルト:"100"

クロールする最大ページ数

include_urls

array

含めるグロブパターン（例：[”/**“]で全て）

exclude_urls

array

除外するグロブパターン（例：[“/admin/**”]）

max_depth

integer

start_urlからクロールする最大深度

include_external

boolean

デフォルト:"false"

外部URLを含める

from langchain_olostep import crawl_website
import asyncio

# ドキュメントサイト全体をクロール
result = asyncio.run(crawl_website.ainvoke({
    "start_url": "https://docs.example.com",
    "max_pages": 100
}))

print(result)
# 戻り値: {"crawl_id": "crawl_xxx", "status": "in_progress", ...}

LangChainエージェント統合

ウェブを検索しスクレイプできるインテリジェントなエージェントを構築：

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import (
    scrape_website,
    answer_question,
    extract_urls
)

# Olostepツールでエージェントを作成
tools = [scrape_website, answer_question, extract_urls]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# エージェントを使用
result = agent.run("""
Research the company at https://company.com:
1. Scrape their about page
2. Search for their latest funding round
3. Extract all their product pages
""")

print(result)

LangGraph統合

LangGraphで複雑なマルチステップワークフローを構築：

from langgraph.graph import StateGraph, END
from langchain_olostep import (
    scrape_website,
    scrape_batch,
    answer_question,
    extract_urls
)
from langchain_openai import ChatOpenAI
import json

def create_research_agent():
    workflow = StateGraph(dict)
    
    def discover_pages(state):
        # ターゲットサイトからすべてのURLを抽出
        result = extract_urls.invoke({
            "url": state["target_url"],
            "include_urls": ["/product/**"],
            "top_n": 50
        })
        state["urls"] = json.loads(result)["urls"]
        return state
    
    def scrape_pages(state):
        # 発見されたページをバッチでスクレイプ
        result = scrape_batch.invoke({
            "urls": state["urls"],
            "format": "markdown"
        })
        state["batch_id"] = json.loads(result)["batch_id"]
        return state
    
    def answer_questions(state):
        # データに関する質問にAIを使用して回答
        result = answer_question.invoke({
            "task": state["research_question"],
            "json_schema": state["desired_format"]
        })
        state["answer"] = json.loads(result)["answer"]
        return state
    
    workflow.add_node("discover", discover_pages)
    workflow.add_node("scrape", scrape_pages)
    workflow.add_node("analyze", answer_questions)
    
    workflow.set_entry_point("discover")
    workflow.add_edge("discover", "scrape")
    workflow.add_edge("scrape", "analyze")
    workflow.add_edge("analyze", END)
    
    return workflow.compile()

# エージェントを使用
agent = create_research_agent()
result = agent.invoke({
    "target_url": "https://store.com",
    "research_question": "What are the top 5 most expensive products?",
    "desired_format": {
        "products": [{"name": "", "price": "", "url": ""}]
    }
})

高度なユースケース

データ強化

スプレッドシートデータをウェブ情報で強化：

from langchain_olostep import answer_question

companies = ["Stripe", "Shopify", "Square"]

for company in companies:
    result = answer_question.invoke({
        "task": f"Find information about {company}",
        "json_schema": {
            "ceo": "",
            "headquarters": "",
            "employee_count": "",
            "latest_funding": ""
        }
    })
    print(f"{company}: {result}")

Eコマース製品スクレイピング

特殊なパーサーで製品データをスクレイプ：

from langchain_olostep import scrape_website

# Amazon製品をスクレイプ
result = scrape_website.invoke({
    "url": "https://www.amazon.com/dp/PRODUCT_ID",
    "parser": "@olostep/amazon-product",
    "format": "json"
})
# 構造化された製品データを返す：価格、タイトル、評価など

SEO監査

SEOのためにウェブサイト全体を分析：

from langchain_olostep import extract_urls, scrape_batch
import json

# 1. すべてのページを発見
urls_result = extract_urls.invoke({
    "url": "https://yoursite.com",
    "top_n": 1000
})

# 2. すべてのページをスクレイプ
urls = json.loads(urls_result)["urls"]
batch_result = scrape_batch.invoke({
    "urls": urls,
    "format": "html"
})

ドキュメントスクレイピング

ドキュメントをクロールして抽出：

from langchain_olostep import crawl_website

# ドキュメントサイト全体をクロール
result = crawl_website.invoke({
    "start_url": "https://docs.example.com",
    "max_pages": 500,
    "include_urls": ["/docs/**"],
    "exclude_urls": ["/api/**", "/v1/**"]
})

専門パーサー

Olostepは人気のあるウェブサイト向けの事前構築されたパーサーを提供：

@olostep/google-search - Google検索結果

parserパラメータで使用：

scrape_website.invoke({
    "url": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
    "parser": "@olostep/google-search"
})

エラーハンドリング

from langchain_core.exceptions import LangChainException

try:
    result = await scrape_website.ainvoke({
        "url": "https://example.com"
    })
except LangChainException as e:
    print(f"スクレイピング失敗: {e}")

ベストプラクティス

複数のURLに対してバッチ処理を使用

3-5以上のURLをスクレイプする場合は、複数のscrape_website呼び出しの代わりにscrape_batchを使用してください。バッチ処理ははるかに高速でコスト効率が高いです。

適切なタイムアウトを設定

JavaScriptが多用されているサイトの場合、wait_before_scrapingパラメータを使用してください（2000-5000msが一般的）。これにより、動的コンテンツが完全に読み込まれます。

専門パーサーを使用

人気のあるウェブサイト（Amazon、LinkedIn、Google）については、事前構築されたパーサーを使用して自動的に構造化データを取得してください。

効率的にURLをフィルタリング

extract_urlsまたはcrawl_websiteを使用する際は、グロブパターンを使用して関連するページに焦点を当て、不要な処理を避けてください。

レート制限を処理

レート制限エラーに対して指数バックオフを実装してください。APIは内部でほとんどのレート制限を自動的に処理します。

サポート

PyPIパッケージ: langchain-olostep
ドキュメント: docs.olostep.com
問題: GitHub Issues
メール: info@olostep.com

Scrapes API

Scrapesエンドポイントについて学ぶ

Batches API

Batchesエンドポイントについて学ぶ

Answers API

Answersエンドポイントについて学ぶ

Maps API

Mapsエンドポイントについて学ぶ

Crawls API

Crawlsエンドポイントについて学ぶ

Python SDK

Python SDKを探索

LangChain Website

LangChainプラットフォーム

​特徴

スクレイプ

バッチ

回答

マップ

クロール

​インストール

​セットアップ

​利用可能なツール

​scrape_website

​scrape_batch

​answer_question

​extract_urls

​crawl_website

​LangChainエージェント統合

​LangGraph統合

​高度なユースケース

​データ強化

​Eコマース製品スクレイピング

​SEO監査

​ドキュメントスクレイピング

​専門パーサー

​エラーハンドリング

​ベストプラクティス

​サポート

​関連リソース

Scrapes API

Batches API

Answers API

Maps API

Crawls API

Python SDK

LangChain Website

特徴

インストール

セットアップ

利用可能なツール

scrape_website

scrape_batch

answer_question

extract_urls

crawl_website

LangChainエージェント統合

LangGraph統合

高度なユースケース

データ強化

Eコマース製品スクレイピング

SEO監査

ドキュメントスクレイピング

専門パーサー

エラーハンドリング

ベストプラクティス

サポート

関連リソース