/v1/crawls endpoint you can crawl a website and get the content from all the pages.
- Crawl a website and get the content from all subpages (or limit the depth of the crawl)
- Use special patterns to crawl specific pages (e.g.
/blog/**) - Pass a
webhook_urlto get notified when the crawl is completed - Search query to only find specific pages and sort by relevance
Installation
Start a crawl
Provide the starting URL, include/exclude URL globs, andmax_pages. Optional: max_depth, include_external, include_subdomain, search_query, top_n, webhook_url, timeout.
crawl object in response. The crawl object has a few properties like id and status, which you can use to track the crawl.
Check crawl status
Poll the crawl to track progress untilstatus is completed.
webhook_url when starting the crawl to be notified when the crawl is completed.
List pages (paginate/stream with cursor)
Fetch pages and iterate usingcursor and limit. Works while the crawl is in_progress or completed.
Search query (limit to top N relevant)
Usesearch_query at start, and optionally filter listing with search_query. Limit per-page exploration with top_n.
Retrieve content
Use each page’sretrieve_id with /v1/retrieve to fetch html_content and/or markdown_content.
Notes
- Pagination is cursor-based; repeat requests until
cursoris absent. - Content fields on
/v1/crawls/{crawl_id}/pagesare deprecated; prefer/v1/retrieve. - Webhooks: set
webhook_urlto receive a POST when the crawl completes.