Start Crawl
Starts a new crawl. You receive a id
to track the progress. The operation may take 1-10 mins depending upon the site and depth and pages parameters.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Body
The starting point of the crawl.
Maximum number of pages to crawl. Recommended for most use cases like crawling an entire website.
URL path patterns to include in the crawl using glob syntax.
Defaults to /**
which includes all URLs. Use patterns like /blog/**
to crawl specific sections (e.g., only blog pages), /products/*.html
for product pages, or multiple patterns for different sections. Supports standard glob features like * (any characters) and ** (recursive matching).
URL path names in glob pattern to exclude. For example: /careers/**
. Excluded URLs will supersede included URLs.
Maximum depth of the crawl. Useful to extract only up to n-degree of links.
Crawl first-degree external links.
An optional search query to find specific links and also sort the results by relevance.
An optional number to only crawl the top N most relevant links on every page as per search query.
An optional POST request endpoint called when this crawl is completed. The body of the request will be same as the response of this v1/crawls/{crawl_id}
endpoint.
Response
Crawl ID
The kind of object. "crawl" for this endpoint.
in_progress
or completed
Created time in epoch
Created time in date
The current depth of the crawl process.
Count of pages crawled