> ## Documentation Index
> Fetch the complete documentation index at: https://docs.olostep.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Create Crawl

> Starts a new crawl. You receive a `id` to track the progress. The operation may take 1-10 mins depending upon the site and depth and pages parameters.

<Tip>
  **Get notified on completion:** Pass the `webhook` parameter with your endpoint URL to receive an HTTP POST when the crawl completes. See [Webhooks](/api-reference/common/webhooks) for details.
</Tip>


## OpenAPI

````yaml openapi/crawls.json POST /v1/crawls
openapi: 3.0.3
info:
  title: Crawl API
  version: 1.0.0
servers:
  - url: https://api.olostep.com
security: []
paths:
  /v1/crawls:
    post:
      summary: Start a new crawl
      description: Initiates a new crawl process with the specified parameters.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                start_url:
                  type: string
                  description: The starting point of the crawl.
                include_urls:
                  type: array
                  items:
                    type: string
                  description: >-
                    URL path patterns to include in the crawl using glob syntax.


                    Defaults to `/**` which includes all URLs. Use patterns like
                    `/blog/**` to crawl specific sections (e.g., only blog
                    pages), `/products/*.html` for product pages, or multiple
                    patterns for different sections. Supports standard glob
                    features like * (any characters) and ** (recursive
                    matching).
                exclude_urls:
                  type: array
                  items:
                    type: string
                  description: >-
                    URL path names in glob pattern to exclude. For example:
                    `/careers/**`. Excluded URLs will supersede included URLs.
                max_pages:
                  type: number
                  description: >-
                    Maximum number of pages to crawl. Recommended for most use
                    cases like crawling an entire website.
                max_depth:
                  type: number
                  description: >-
                    Maximum depth of the crawl. Useful to extract only up to
                    n-degree of links.
                include_external:
                  type: boolean
                  description: Crawl first-degree external links.
                include_subdomain:
                  type: boolean
                  description: Include subdomains of the website. `false` by default.
                search_query:
                  type: string
                  description: >-
                    An optional search query to find specific links and also
                    sort the results by relevance.
                top_n:
                  type: number
                  description: >-
                    An optional number to only crawl the top N most relevant
                    links on every page as per search query.
                webhook:
                  type: string
                  format: uri
                  description: >-
                    HTTPS URL to receive a POST request when the crawl
                    completes. Must be a publicly accessible URL using `http://`
                    or `https://` protocol. Cannot point to localhost or private
                    IP addresses. See [Webhooks](/api-reference/common/webhooks)
                    for payload format and retry behavior.
                timeout:
                  type: number
                  description: >-
                    End the crawl after n seconds with the pages completed until
                    then. May take ~10s extra from provided timeout.
                follow_robots_txt:
                  type: boolean
                  description: >-
                    Whether to respect robots.txt rules. If set to `false`, the
                    crawler will scrape the website regardless of robots.txt
                    disallow directives. `true` by default.
                  default: true
                scrape_options:
                  type: object
                  description: >-
                    Controls what each individual page scrape requests from the
                    Olostep API. All fields are optional.
                  properties:
                    formats:
                      type: array
                      items:
                        type: string
                        enum:
                          - html
                          - markdown
                          - text
                          - json
                          - screenshot
                      description: >-
                        Output formats to request for each scraped page.
                        Defaults to `["html", "markdown"]` when omitted. `html`
                        is always included automatically. `json` is
                        automatically added when a `parser` is provided. Note:
                        `raw_pdf` is not supported — PDFs cannot be crawled.
                      example:
                        - markdown
                        - screenshot
                    parser:
                      type: string
                      description: >-
                        Parser name to run on each page and produce structured
                        `json` output (e.g. `"@olostep/extract-emails"`).
                        Automatically adds `json` to `formats` when set.
                      example: '@olostep/extract-emails'
              required:
                - start_url
                - max_pages
      responses:
        '200':
          description: Crawl started successfully.
          content:
            application/json:
              schema:
                type: object
                properties:
                  id:
                    type: string
                    description: Crawl ID
                  object:
                    type: string
                    description: The kind of object. "crawl" for this endpoint.
                  status:
                    type: string
                    description: '`in_progress` or `completed`'
                  created:
                    type: number
                    description: Created time in epoch
                  start_date:
                    type: string
                    description: Created time in date
                  start_url:
                    type: string
                  max_pages:
                    type: number
                  max_depth:
                    type: number
                  exclude_urls:
                    type: array
                    items:
                      type: string
                  include_urls:
                    type: array
                    items:
                      type: string
                  include_external:
                    type: boolean
                  search_query:
                    type: string
                  top_n:
                    type: number
                  current_depth:
                    type: number
                    description: The current depth of the crawl process.
                  pages_count:
                    type: number
                    description: Count of pages crawled
                  webhook:
                    type: string
                  follow_robots_txt:
                    type: boolean
                  credits_consumed:
                    type: integer
                    nullable: true
                    description: >-
                      Number of credits consumed by this request. Populated
                      after execution completes. Credits are the source of truth
                      for billing.
                  cost_usd:
                    type: number
                    nullable: true
                    description: >-
                      Estimated cost in USD for this request. Populated after
                      execution completes. Calculated from credits consumed and
                      your plan rate — 99% accurate, but credits_consumed is the
                      authoritative value.
        '400':
          description: Bad request due to incorrect or missing parameters.
        '500':
          description: Internal server error.
      security:
        - Authorization: []
components:
  securitySchemes:
    Authorization:
      type: http
      scheme: bearer
      description: >-
        Bearer authentication header of the form Bearer <token>, where <token>
        is your auth token.

````