Skip to main content
Through the Olostep /v1/crawls endpoint you can crawl a website and get the content from all the pages.
  • Crawl a website and get the content from all subpages (or limit the depth of the crawl)
  • Use special patterns to crawl specific pages (e.g. /blog/**)
  • Pass a webhook_url to get notified when the crawl is completed
  • Search query to only find specific pages and sort by relevance
For API details see the Crawl Endpoint API Reference.

Installation

pip install olostep

Start a crawl

Provide the starting URL, include/exclude URL globs, and max_pages. Optional: max_depth, include_external, include_subdomain, search_query, top_n, webhook_url, timeout.
from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

crawl = client.crawls.create(
    start_url="https://olostep.com",
    max_pages=100,
    include_urls=["/**"],
    exclude_urls=["/collections/**"],
    include_external=False,
)

print(crawl.id, crawl.status)
Since everything in Olostep is an object, you will receive a crawl object in response. The crawl object has a few properties like id and status, which you can use to track the crawl.

Check crawl status

Poll the crawl to track progress until status is completed.
# Using the crawl object from the previous step
info = crawl.info()
print(info.status, info.pages_count)

# Or wait until completed
crawl.wait_till_done(check_every_n_secs=5)
Alternatively, you can pass a webhook_url when starting the crawl to be notified when the crawl is completed.

List pages (paginate/stream with cursor)

Fetch pages and iterate using cursor and limit. Works while the crawl is in_progress or completed.
# Iterate all pages (auto-waits for crawl completion, handles pagination)
for page in crawl.pages():
    print(page.url, page.retrieve_id)

Search query (limit to top N relevant)

Use search_query at start, and optionally filter listing with search_query. Limit per-page exploration with top_n.
from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

crawl = client.crawls.create(
    start_url="https://olostep.com",
    max_pages=100,
    include_urls=["/**"],
    search_query="contact us",
    top_n=5,
)

for page in crawl.pages(search_query="contact us"):
    print(page.url)

Retrieve content

Use each page’s retrieve_id with /v1/retrieve to fetch html_content and/or markdown_content.
# Retrieve content for each crawled page
for page in crawl.pages():
    content = page.retrieve(["markdown"])
    print(content.markdown_content)

Notes

  • Pagination is cursor-based; repeat requests until cursor is absent.
  • Content fields on /v1/crawls/{crawl_id}/pages are deprecated; prefer /v1/retrieve.
  • Webhooks: set webhook_url to receive a POST when the crawl completes.

Pricing

Crawl costs 1 credit per page crawled.