Skip to main content
POST
/
v1
/
crawls
Start a new crawl
curl --request POST \
  --url https://api.olostep.com/v1/crawls \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "start_url": "<string>",
  "max_pages": 123,
  "include_urls": [
    "<string>"
  ],
  "exclude_urls": [
    "<string>"
  ],
  "max_depth": 123,
  "include_external": true,
  "include_subdomain": true,
  "search_query": "<string>",
  "top_n": 123,
  "webhook": "<string>",
  "timeout": 123,
  "follow_robots_txt": true
}
'
{
  "id": "<string>",
  "object": "<string>",
  "status": "<string>",
  "created": 123,
  "start_date": "<string>",
  "start_url": "<string>",
  "max_pages": 123,
  "max_depth": 123,
  "exclude_urls": [
    "<string>"
  ],
  "include_urls": [
    "<string>"
  ],
  "include_external": true,
  "search_query": "<string>",
  "top_n": 123,
  "current_depth": 123,
  "pages_count": 123,
  "webhook": "<string>",
  "follow_robots_txt": true
}
Get notified on completion: Pass the webhook parameter with your endpoint URL to receive an HTTP POST when the crawl completes. See Webhooks for details.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
start_url
string
required

The starting point of the crawl.

max_pages
number
required

Maximum number of pages to crawl. Recommended for most use cases like crawling an entire website.

include_urls
string[]

URL path patterns to include in the crawl using glob syntax.

Defaults to /** which includes all URLs. Use patterns like /blog/** to crawl specific sections (e.g., only blog pages), /products/*.html for product pages, or multiple patterns for different sections. Supports standard glob features like * (any characters) and ** (recursive matching).

exclude_urls
string[]

URL path names in glob pattern to exclude. For example: /careers/**. Excluded URLs will supersede included URLs.

max_depth
number

Maximum depth of the crawl. Useful to extract only up to n-degree of links.

include_external
boolean

Crawl first-degree external links.

include_subdomain
boolean

Include subdomains of the website. false by default.

search_query
string

An optional search query to find specific links and also sort the results by relevance.

top_n
number

An optional number to only crawl the top N most relevant links on every page as per search query.

webhook
string<uri>

HTTPS URL to receive a POST request when the crawl completes. Must be a publicly accessible URL using http:// or https:// protocol. Cannot point to localhost or private IP addresses. See Webhooks for payload format and retry behavior.

timeout
number

End the crawl after n seconds with the pages completed until then. May take ~10s extra from provided timeout.

follow_robots_txt
boolean
default:true

Whether to respect robots.txt rules. If set to false, the crawler will scrape the website regardless of robots.txt disallow directives. true by default.

Response

Crawl started successfully.

id
string

Crawl ID

object
string

The kind of object. "crawl" for this endpoint.

status
string

in_progress or completed

created
number

Created time in epoch

start_date
string

Created time in date

start_url
string
max_pages
number
max_depth
number
exclude_urls
string[]
include_urls
string[]
include_external
boolean
search_query
string
top_n
number
current_depth
number

The current depth of the crawl process.

pages_count
number

Count of pages crawled

webhook
string
follow_robots_txt
boolean