创建抓取

curl --request POST \ --url https://api.olostep.com/v1/crawls \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data ' { "start_url": "<string>", "max_pages": 123, "include_urls": [ "<string>" ], "exclude_urls": [ "<string>" ], "max_depth": 123, "include_external": true, "include_subdomain": true, "search_query": "<string>", "top_n": 123, "webhook": "<string>", "timeout": 123, "follow_robots_txt": true, "scrape_options": { "formats": [ "markdown", "screenshot" ], "parser": "@olostep/extract-emails" } } '

{ "id": "<string>", "object": "<string>", "status": "<string>", "created": 123, "start_date": "<string>", "start_url": "<string>", "max_pages": 123, "max_depth": 123, "exclude_urls": [ "<string>" ], "include_urls": [ "<string>" ], "include_external": true, "search_query": "<string>", "top_n": 123, "current_depth": 123, "pages_count": 123, "webhook": "<string>", "follow_robots_txt": true }

授权

Authorization

string

header

必填

Bearer 认证头的格式为 Bearer ，其中是你的认证令牌。

请求体

application/json

start_url

string

必填

爬虫的起始点。

max_pages

number

必填

要爬取的最大页面数。推荐用于大多数用例，如爬取整个网站。

include_urls

string[]

使用 glob 语法在爬虫中包含的 URL 路径模式。默认为 /**，包括所有 URL。使用类似 /blog/** 的模式来爬取特定部分（例如，仅博客页面），/products/*.html 用于产品页面，或为不同部分使用多个模式。支持标准 glob 特性，如 *（任意字符）和 **（递归匹配）。

exclude_urls

string[]

在 glob 模式中排除的 URL 路径名称。例如：/careers/**。排除的 URL 将优先于包含的 URL。

max_depth

number

爬虫的最大深度。用于仅提取最多 n 级链接。

include_external

boolean

爬取一级外部链接。

include_subdomain

boolean

包含网站的子域名。默认值为 false。

search_query

string

可选的搜索查询，用于查找特定链接并按相关性排序结果。

top_n

number

可选的数字，仅爬取每个页面上搜索查询中最相关的前 N 个链接。

webhook

string<uri>

爬虫完成时接收 POST 请求的 HTTPS URL。必须是使用 http:// 或 https:// 协议的公开可访问 URL。不能指向 localhost 或私有 IP 地址。查看 Webhooks 了解负载格式和重试行为。

timeout

number

在 n 秒后结束爬虫，并完成到那时为止的页面。可能会比提供的超时时间多花费约 10 秒。

follow_robots_txt

boolean

默认值:true

是否遵循 robots.txt 规则。如果设置为 false，爬虫将不顾 robots.txt 的禁止指令抓取网站。默认值为 true。

scrape_options

object

控制每个单独页面从 Olostep API 请求的内容。所有字段都是可选的。

Show child attributes

响应

爬虫启动成功。

string

爬虫 ID

object

string

The kind of object. "crawl" for this endpoint.

status

string

in_progress 或 completed

created

number

Created time in epoch

start_date

string

Created time in date

start_url

string

max_pages

number

max_depth

number

exclude_urls

string[]

include_urls

string[]

include_external

boolean

search_query

string

top_n

number

current_depth

number

The current depth of the crawl process.

pages_count

number

Count of pages crawled

webhook

string

follow_robots_txt

boolean

常用

刮擦

批次

抓取

地图

答案

搜索

文件

日程表

检索

授权

请求体

响应

常用

刮擦

批次

抓取

地图

答案

搜索

文件

日程表

检索

Documentation Index

授权

请求体

响应