Overview

Olostep’s Batches endpoint allows you to start a batch of up to 10,000 URLs and get back the content in 5–7 minutes. You can start up to 10 batches at a time to extract content from 100,000 URLs in one go. If you need more scale, please reach out to us

This is useful if you already have the URLs you want to process —for example, to aggregate data for analysis, build a specialized search tool, or monitor multiple websites for changes.

In this guide, we’ll walk through how to start a batch with a list of URLs and retrieve the content in markdown format.

Gist with Full Code

Here’s all the code in one gist that you can copy and paste to try out batch scraping with Olostep: https://gist.github.com/olostep/e903f2e4fc28f8093b834b4df68b8031

In this gist we have shown how to start a batch with 5 google search queries, check the status, and retrieve the content for each item.

Prerequisites

Before getting started, ensure you have the following:

  • A valid Olostep API key. You can get one by signing up at Olostep.
  • Python installed on your system.
  • The requests and hashlib libraries (install requests with pip install requests if needed).

Step 1: Create a Batch from Local URLs

If you already have a list of URLs you want to process, you can define them directly in your script. Otherwise, you can read them from a file or database.

filename="start_batch.py"
import requests
import hashlib

API_KEY = "YOUR_API_KEY"

def create_hash_id(url):
    return hashlib.sha256(url.encode()).hexdigest()[:16]

def compose_items_array():
    urls = [
        "https://www.google.com/search?q=nikola+tesla&gl=us&hl=en",
        "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
        "https://www.google.com/search?q=google+solar+eclipse&gl=us&hl=en",
        "https://www.google.com/search?q=crispr&gl=us&hl=en",
        "https://www.google.com/search?q=genghis%20khan&gl=us&hl=en"
    ]
    return [{"custom_id": create_hash_id(url), "url": url} for url in urls]

def start_batch(items):
    payload = {
        "items": items
    }
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.post(
        "https://api.olostep.com/v1/batches",
        headers=headers,
        json=payload
    )
    return response.json()["id"]

if __name__ == "__main__":
    items = compose_items_array()
    batch_id = start_batch(items)
    print("Batch started. ID:", batch_id)

Step 2: Monitor Batch Status

Once the batch is started, you can monitor its status using the batch_id that is returned when you start the batch

filename="check_status.py"
import requests

def check_batch_status(batch_id):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.get(
        f"https://api.olostep.com/v1/batches/{batch_id}",
        headers=headers
    )
    return response.json()["status"]

You can poll the status every few seconds (e.g. 10 seconds) until the batch is complete:

import time

def recursive_check(batch_id):
    status = check_batch_status(batch_id)
    print("Status:", status)
    if status == "completed":
        print("Batch is complete!")
    else:
        time.sleep(60)
        recursive_check(batch_id)

Step 3: Retrieve Completed Items

Once the batch is marked complete, fetch the processed items.

import requests

def get_completed_items(batch_id):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.get(
        f"https://api.olostep.com/v1/batches/{batch_id}/items",
        headers=headers
    )
    return response.json()["items"]

Each item will include a retrieve_id which you can use to fetch the scraped content.

items = get_completed_items(batch_id)
for item in items:
    print(f"URL: {item['url']}\nCustom ID: {item['custom_id']}\nRetrieve ID: {item['retrieve_id']}\n---")

Step 4: Retrieve the Content

Use the retrieve_id to get the extracted content in markdown, html or json. Here is an example to retrieve the content in markdown format:

filename="retrieve_content.py"
def retrieve_content(retrieve_id):
    url = "https://api.olostep.com/v1/retrieve"
    headers = {"Authorization": f"Bearer {API_KEY}"}
    params = {"retrieve_id": retrieve_id, "formats": ["markdown"]}

    response = requests.get(url, headers=headers, params=params)
    return response.json()

# Example usage:
items = get_completed_items(batch_id)
for item in items:
    content = retrieve_content(item['retrieve_id'])
    print(content)

Hosted Content

We also host the content for 7 days, so you can retrieve it multiple times without re-scraping. Example of a hosted url for markdown content

Example Use Cases

1. Build Search Engines

Use Olostep to extract content from industry-specific websites (legal, medical, AI) and build a searchable database.

2. Website Monitoring

Monitor product availability, price changes, or news updates on multiple websites by scheduling daily batch scrapes.

3. Social Media Monitoring

Scrape mentions of your brand or keywords across forums or content sources and extract structured data.

4. Aggregators

Build a job board, news aggregator, or real estate listing platform by pulling data from dozens of sources.

Conclusion

With batch scraping, you can extract content from up to 100k URLs quickly and efficiently. Whether you’re building search tools, aggregators, or monitoring systems, Olostep Batches simplify the job.

Want to extract only structured data? Use Parsers to get just the fields you need. Need help? Reach out to info@olostep.com for support or have us write custom scripts for your use case.