Crawl

Through the Olostep /v1/crawls endpoint you can crawl a website and get the content from all the pages.

Crawl a website and get the content from all subpages (or limit the depth of the crawl)
Use special patterns to crawl specific pages (e.g. /blog/**)
Pass a webhook_url to get notified when the crawl is completed
Search query to only find specific pages and sort by relevance

For API details see the Crawl Endpoint API Reference.

Installation

pip install olostep

Start a crawl

Provide the starting URL, include/exclude URL globs, and max_pages. Optional: max_depth, include_external, include_subdomain, search_query, top_n, webhook_url, timeout.

from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

crawl = client.crawls.create(
    start_url="https://olostep.com",
    max_pages=100,
    include_urls=["/**"],
    exclude_urls=["/collections/**"],
    include_external=False,
)

print(crawl.id, crawl.status)

Since everything in Olostep is an object, you will receive a crawl object in response. The crawl object has a few properties like id and status, which you can use to track the crawl.

Check crawl status

Poll the crawl to track progress until status is completed.

# Using the crawl object from the previous step
info = crawl.info()
print(info.status, info.pages_count)

# Or wait until completed
crawl.wait_till_done(check_every_n_secs=5)

Alternatively, you can pass a webhook_url when starting the crawl to be notified when the crawl is completed.

List pages (paginate/stream with cursor)

Fetch pages and iterate using cursor and limit. Works while the crawl is in_progress or completed.

# Iterate all pages (auto-waits for crawl completion, handles pagination)
for page in crawl.pages():
    print(page.url, page.retrieve_id)

Search query (limit to top N relevant)

Use search_query at start, and optionally filter listing with search_query. Limit per-page exploration with top_n.

from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

crawl = client.crawls.create(
    start_url="https://olostep.com",
    max_pages=100,
    include_urls=["/**"],
    search_query="contact us",
    top_n=5,
)

for page in crawl.pages(search_query="contact us"):
    print(page.url)

Retrieve content

Use each page’s retrieve_id with /v1/retrieve to fetch html_content and/or markdown_content.

# Retrieve content for each crawled page
for page in crawl.pages():
    content = page.retrieve(["markdown"])
    print(content.markdown_content)

Notes

Pagination is cursor-based; repeat requests until cursor is absent.
Content fields on /v1/crawls/{crawl_id}/pages are deprecated; prefer /v1/retrieve.
Webhooks: set webhook_url to receive a POST when the crawl completes.

Pricing

Crawl costs 1 credit per page crawled.

Get Started

Features

Integrations

Installation

Start a crawl

Check crawl status

List pages (paginate/stream with cursor)

Search query (limit to top N relevant)

Retrieve content

Notes

Pricing

Get Started

Features

Integrations

Documentation Index

​Installation

​Start a crawl

​Check crawl status

​List pages (paginate/stream with cursor)

​Search query (limit to top N relevant)

​Retrieve content

​Notes

​Pricing

Installation

Start a crawl

Check crawl status

List pages (paginate/stream with cursor)

Search query (limit to top N relevant)

Retrieve content

Notes

Pricing