Skip to main content
Through the Olostep /v1/crawls endpoint you can crawl a website and get the content from all the pages.
  • Crawl a website and get the content from all subpages (or limit the depth of the crawl)
  • Use special patterns to crawl specific pages (e.g. /blog/**)
  • Pass a webhook_url to get notified when the crawl is completed
  • Search query to only find specific pages and sort by relevance
For details, see the Crawl Endpoint API Reference.

Installation

# pip install requests

import requests

Start a crawl

Provide the starting URL, include/exclude URL globs, and max_pages. Optional: max_depth, include_external, include_subdomain, search_query, top_n, webhook_url, timeout.
import time, json

API_URL = 'https://api.olostep.com'
API_KEY = '<YOUR_API_KEY>'
HEADERS = { 
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}' 
}

data = {
  "start_url": "https://sugarbooandco.com",
  "max_pages": 100,
  "include_urls": ["/**"],
  "exclude_urls": ["/collections/**"],
  "include_external": False
}

res = requests.post(f"{API_URL}/v1/crawls", headers=HEADERS, json=data)
crawl = res.json()
print(json.dumps(crawl, indent=2))
Since everything in Olostep is an object, you will receive a crawl object in response. The crawl object has a few properties like id and status, which you can use to track the crawl.

Check crawl status

Poll the crawl to track progress until status is completed.
import time

def get_crawl_info(crawl_id):
    return requests.get(f'{API_URL}/v1/crawls/{crawl_id}', headers=HEADERS).json()

crawl_id = crawl['id']
while True:
    info = get_crawl_info(crawl_id)
    print(info['status'], info.get('pages_count'))
    if info['status'] == 'completed':
        break
    time.sleep(5)
Alternatively, you can pass a webhook_url when starting the crawl to be notified when the crawl is completed.

List pages (paginate/stream with cursor)

Fetch pages and iterate using cursor and limit. Works while the crawl is in_progress or completed.
def get_pages(crawl_id, cursor=None, limit=10, search_query=None):
    params = { 
        'cursor': cursor, 
        'limit': limit
    }
    return requests.get(f'{API_URL}/v1/crawls/{crawl_id}/pages', headers=HEADERS, params=params).json()

cursor = 0
while True:
    page_batch = get_pages(crawl_id, cursor=cursor, limit=10)
    for page in page_batch['pages']:
        print(page['url'], page['retrieve_id'])
    if 'cursor' not in page_batch:
        break
    cursor = page_batch['cursor']
    time.sleep(5)

Search query (limit to top N relevant)

Use search_query at start, and optionally filter listing with search_query. Limit per-page exploration with top_n.
data = {
  "start_url": "https://sugarbooandco.com",
  "max_pages": 100,
  "include_urls": ["/**"],
  "search_query": "contact us",
  "top_n": 5
}
crawl = requests.post(f'{API_URL}/v1/crawls', headers=HEADERS, json=data).json()
pages = requests.get(f"{API_URL}/v1/crawls/{crawl['id']}/pages", headers=HEADERS, params={'search_query': 'contact us'}).json()
print(len(pages['pages']))

Retrieve content

Use each page’s retrieve_id with /v1/retrieve to fetch html_content and/or markdown_content.
def retrieve_content(retrieve_id):
    return requests.get(f"{API_URL}/v1/retrieve", headers=HEADERS, params={"retrieve_id": retrieve_id}).json()

for page in get_pages(crawl['id'], limit=5)['pages']:
    retrieved = retrieve_content(page['retrieve_id'])
    print(retrieved.get('markdown_content'))

Notes

  • Pagination is cursor-based; repeat requests until cursor is absent.
  • Content fields on /v1/crawls/{crawl_id}/pages are deprecated; prefer /v1/retrieve.
  • Webhooks: set webhook_url to receive a POST when the crawl completes.