Olostep Docs - Scrape endpoint

Through the Olostep /v1/scrapes endpoint you can extract LLM-friendly Markdown, HTML, text, screenshots, or structured JSON from any URL in real time.

Outputs clean markdown, structured data, screenshots, or html
Extract JSON through Parsers or LLM extraction
Handles dynamic content: js-rendered sites, login flows via actions, PDFs

For API details see the Scrape Endpoint API Reference.

Scraping a URL

Use the /v1/scrapes endpoint to scrape a single URL and choose output formats.

Installation

pip install olostep

Usage

You can use the endpoint to scrape a single URL and choose output formats. The mandatory parameters are url_to_scrape and formats. Some other common parameters are wait_before_scraping (in milliseconds), remove_css_selectors (default, none, or an array of selectors), and country.

from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

result = client.scrapes.create(
    url_to_scrape="https://en.wikipedia.org/wiki/Alexander_the_Great",
    formats=["markdown", "html"],
)

print(result.markdown_content)
print(result.html_content)

Response

The API returns a scrape object in response. The scrape has a few properties like id and result. The result object has the following fields (according to the formats parameter some might be null):

html_content: the HTML content of the page. Pass formats: ["html"] to get this.
markdown_content: the MD content of the page. Pass formats: ["markdown"] to get this.
text_content: the text content of the page. Pass formats: ["text"] to get this.
json_content: the JSON content of the page. Pass formats: ["json"] to get this and also provide a parser or llm_extract parameter.
screenshot_hosted_url: the hosted URL of the screenshot.
html_hosted_url: the hosted URL of the HTML content
markdown_hosted_url: the hosted URL of the Markdown content
json_hosted_url: the hosted URL of the JSON content
text_hosted_url: the hosted URL of the text content
links_on_page: the links on the page
page_metadata: the metadata of the page

{
  "id": "scrape_6h89o8u1kt",
  "object": "scrape",
  "created": 1745673871,
  "metadata": {},
  "retrieve_id": "6h89o8u1kt",
  "url_to_scrape": "https://en.wikipedia.org/wiki/Alexander_the_Great",
  "result": {
    "html_content": "<html...",
    "markdown_content": "## Alexander the Great...",
    "text_content": null,
    "json_content": null,
    "screenshot_hosted_url": null,
    "html_hosted_url": "https://olostep-storage.s3.us-east-1.amazonaws.com/text_6h89o8u1kt.txt",
    "markdown_hosted_url": "https://olostep-storage.s3.us-east-1.amazonaws.com/markDown_6h89o8u1kt.txt",
    "json_hosted_url": null,
    "text_hosted_url": null,
    "links_on_page": [],
    "page_metadata": { "status_code": 200, "title": "" }
  }
}

Caching

To optimize speed, Olostep provides an optional shared caching layer for HTML, Markdown, text, and parsed JSON results.

How it works

When a scrape is requested, Olostep checks if a matching scrape already exists with the same parameters. If a fresh enough match is found, the content is served instantly from Olostep’s storage without launching a new browser scrape.

Shared Cache: The cache is shared globally. If another request scraped the exact same URL with the exact same configuration within your freshness window, you benefit from the speedup.
Post-processing is still live: Operations like llm_extract and links_on_page filters are run on-the-fly on top of the cached document. You only cache the core page retrieval, keeping your structured extractions dynamic.

Freshness and `max_age`

By default, the production API always performs a live scrape to guarantee real-time accuracy. You can opt-in to caching using the max_age parameter.

Parameter	Type	Default	Description
`max_age`	`integer`	`0`	Acceptable content age in seconds. If a cached copy exists and is newer than `max_age` seconds, it is served from the cache.

Default API Behavior (max_age: 0): Every API request triggers a fresh scrape.
Default Playground Behavior: In the dashboard playground, max_age defaults to 24 hours (86400 seconds) to prevent redundant scrapes and save credits while you build and test.
Maximum Age: The cache has a hard limit of 7 days (604800 seconds). Any max_age requested above this limit will fallback to a maximum of 7 days.

Usage Examples

from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

# Opt-in to caching: Accept results up to 1 day (86400 seconds) old
result = client.scrapes.create(
    url_to_scrape="https://example.com",
    formats=["markdown"],
    max_age=86400
)

When is the cache skipped?

The cache is automatically bypassed (forcing a live scrape) for features that require unique sessions, real-time visual outputs, or custom file handling:

Interactive sessions: Requests using session_id or loading a custom browser context.
Visuals: Visualizer tools and screenshots (htmlVisualizer).
Special file types: Binary file downloads or raw PDF rendering.
Debugging & Network: Capturing network_calls or using async parser jobs.

Extracting links

Pass a links_on_page object in the request to collect the links found on the page. All links are returned as absolute URLs.

"links_on_page": {
  "include_links": ["/blog/*"],
  "exclude_links": ["*.pdf"],
  "query_to_order_links_by": "pricing"
}

include_links / exclude_links: glob patterns matched against each link’s URL path.
query_to_order_links_by: re-orders the returned links by relevance to this text.

Glob patterns match path segments. A single * does not cross /, so "/blog/*" matches "/blog/post-1" but not the index "/blog" itself — and it never matches "/blog?tag=x" because query strings are not part of the path. To include the index too, use "/blog*" or "{/blog,/blog/**}".

Scrape Formats

Choose one or more output formats via formats:

markdown: LLM-friendly markdown
html: cleaned HTML
text: plain text
json: structured output (via parser or llm_extract)
raw_pdf: raw PDF bytes extracted to hosted URL
screenshot: set via actions to capture a screenshot and return a hosted URL

Output keys are returned inside result as *_content fields and a *_hosted_url as well.

Extract structured data

You can extract structured JSON in two ways: using Parsers or LLM extraction.

Using a Parser (recommended for scale)

Define formats: ["json"] and provide a parser id.

from olostep import Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

result = client.scrapes.create(
    url_to_scrape="https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
    formats=["json"],
    parser="@olostep/google-search",
)

print(result.json_content)

Olostep has a few pre-built parsers for popular websites but you can also create your own parsers through the dashboard or ask our team to do it for you. Parsers are self-healing and will update themselves to the latest version of the website.

Using LLM extraction (schema and/or prompt)

Provide llm_extract with a JSON Schema (schema) and/or a natural language instruction (prompt). You can pass both parameters, but if both are provided, schema takes precedence. Instead, if you just pass a prompt, the LLM will extract the data based on the prompt and will decide the data structure on its own.

from olostep import LLMExtract, Olostep

client = Olostep(api_key="YOUR_REAL_KEY")

result = client.scrapes.create(
    url_to_scrape="https://www.berklee.edu/events/stefano-marchese-friends",
    formats=["markdown", "json"],
    llm_extract=LLMExtract(
        schema={
            "event": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "date": {"type": "string"},
                    "description": {"type": "string"},
                    "venue": {"type": "string"},
                    "address": {"type": "string"},
                    "start_time": {"type": "string"},
                },
            }
        }
    ),
)

print(result.json_content)

Note: result.json_content returns a stringified JSON. Parse it in your code if you need an object. Pricing: llm_extract costs 10 credits per scrape. To lower the cost, you can bring your own API keys or enable usage-based pricing. Contact info@olostep.com to get access.

Extract links on the page

With the links_on_page option, you can extract all the links present on the page you scrape. It accepts the following parameters to help filter and order the extracted links:

absolute_links (boolean, default: true): When true, it returns complete URLs (e.g., https://example.com/page) instead of relative paths (e.g., /page).
query_to_order_links_by (string): Orders the returned links by their similarity to the provided query text, prioritizing the most relevant matches first.
include_links (array of strings): Filter extracted links using glob patterns. Use patterns like *.pdf to match file extensions, /blog/* for specific paths, or full URLs like https://example.com/*. Supports wildcards (*), character classes ([a-z]), and alternation ({pattern1,pattern2}).
exclude_links (array of strings): Exclude specific links using glob patterns, following the same syntax as include_links.

Interacting with the page with Actions

Perform actions before scraping to interact with dynamic sites. Supported actions:

wait with milliseconds
click with selector
fill_input with selector and value
scroll with direction and amount

It is often useful to use wait before/after other actions to allow the page to load.

Example

from olostep import FillInputAction, Olostep, WaitAction

client = Olostep(api_key="YOUR_REAL_KEY")

result = client.scrapes.create(
    url_to_scrape="https://example.com/login",
    formats=["markdown"],
    actions=[
        FillInputAction(selector="input[type=email]", value="john@example.com"),
        WaitAction(milliseconds=500),
        FillInputAction(selector="input[type=password]", value="secret"),
        {"type": "click", "selector": "button[type=\"submit\"]"},
        WaitAction(milliseconds=1500),
    ],
)

print(result.markdown_content)

The response will include any requested formats (e.g., markdown_content).

Use Cases

Below are a few practical applications of customers using the /scrapes endpoint.

Content Analysis & Research

Competitive Analysis: Extract product details, pricing, and features from competitor websites
Market Research: Analyze landing pages, product descriptions, and customer testimonials
Academic Research: Gather specific data from scientific publications or research portals
Legal Documentation: Extract case studies, regulations, or legal precedents from official websites

E-commerce & Retail

Dynamic Pricing Strategies: Get real-time product pricing from competing stores
Product Information Management: Extract detailed specifications and descriptions
Stock/Inventory Monitoring: Check product availability at other retailers
Review Analysis: Gather consumer feedback and sentiment for specific products

Marketing & Content Creation

Content Curation: Extract relevant articles and blog posts for newsletters
SEO Analysis: Examine competitors’ keyword usage, meta descriptions, and page structure
Lead Generation: Extract contact information from business directories or company pages
Influencer Research: Gather engagement metrics and content styles from influencer profiles
Personalised Social Media generation: Create AI-powered social media marketing by analyzing customers websites

Data Applications

AI Training Data Collection: Gather specific examples for machine learning models
Custom Knowledge Base Building: Extract documentation or instructions from software sites
Historical Data Archives: Preserve website content at specific points in time
Structured Data Extraction: Transform web content into formatted datasets for analysis

Monitoring & Alerts

Regulatory Compliance Monitoring: Track changes to legal or regulatory websites
Crisis Management: Monitor news sites for mentions of specific events or organizations
Event Tracking: Extract details about upcoming events from venue or organizer websites
Service Status Monitoring: Check service status pages for specific platforms or tools

Publishing & Media

News Aggregation: Extract breaking news from official sources
Media Monitoring: Track specific topics across news sites
Content Verification: Extract information to fact-check claims or statements
Multimedia Extraction: Gather embedded videos, images, or audio for media libraries

Financial Applications

Investment Research: Extract financial statements or annual reports from company websites
Economic Indicators: Gather economic data from government or financial institution websites
Cryptocurrency Data: Extract real-time pricing and market cap information
Financial News Analysis: Monitor financial news sites for specific market signals

Technical Applications

API Documentation Extraction: Gather technical documentation for reference
Integration Testing: Extract website elements to verify third-party integrations
Accessibility Testing: Analyze website structure for compliance with accessibility standards
Web Archive Creation: Capture full website content for historical preservation

Integration Scenarios

CRM Systems: Enhance customer profiles with data from company websites or Linkedin
Content Management Systems: Import relevant external content
Business Intelligence Tools: Supplement internal data with external market information
Project Management Software: Extract specifications or requirements from client websites
Custom Dashboards: Display extracted data alongside internal metrics

Error Handling

All errors follow a shared envelope shape. Check error.type and error.code to branch programmatically:

{
  "id": "error_abc123",
  "object": "error",
  "created": 1745673871,
  "url": "https://example.com",
  "metadata": {},
  "error": {
    "type": "...",
    "code": "...",
    "message": "..."
  }
}

HTTP	`error.type`	`error.code`	Meaning
400	`invalid_request_error`	`dns_resolution_failed`	The domain does not exist or the URL has a typo.
400	`invalid_request_error`	`invalid_url`	The URL is malformed.
502	`invalid_request_error`	`tls_error`	The website has an invalid or incompatible TLS/SSL certificate. `error.detail` carries the low-level SSL code.
504	`request_timeout`	`scrape_poll_timeout`	The scrape did not finish within the ~55-second wait budget.

DNS failure (400)

The domain does not resolve. Check the URL for typos.

{
  "error": {
    "type": "invalid_request_error",
    "code": "dns_resolution_failed",
    "message": "The URL contains a typo, or the domain does not exist."
  }
}

TLS/SSL error (502)

The target website has a broken or incompatible HTTPS configuration. error.detail provides the specific SSL error code for diagnostics; error.code is always tls_error.

{
  "error": {
    "type": "invalid_request_error",
    "code": "tls_error",
    "detail": "err_ssl_tlsv1_alert_internal_error",
    "message": "The website closed or rejected the TLS handshake. The server may be misconfigured or use an unsupported SSL/TLS version."
  }
}

Request timeout (504)

The scrape did not complete within the wait budget. The page may be slow, bot-protected, or temporarily unavailable. This response is safe to retry.

{
  "error": {
    "type": "request_timeout",
    "code": "scrape_poll_timeout",
    "message": "Request timed out while waiting for scrape result. The page may be slow, blocked for our fetchers, or temporarily unavailable."
  }
}

Pricing

Scrape costs 1 credit by default. If you also pass parsers, the costs vary by parser (1-5 credits). If you use LLM extract, it costs 10 credits.

​Scraping a URL

​Installation

​Usage

​Response

​Caching

​How it works

​Freshness and max_age

​Usage Examples

​When is the cache skipped?

​Extracting links

​Scrape Formats

​Extract structured data

​Using a Parser (recommended for scale)

​Using LLM extraction (schema and/or prompt)

​Extract links on the page

​Interacting with the page with Actions

​Example

​Use Cases

​Content Analysis & Research

​E-commerce & Retail

​Marketing & Content Creation

​Data Applications

​Monitoring & Alerts

​Publishing & Media

​Financial Applications

​Technical Applications

​Integration Scenarios

​Error Handling

​DNS failure (400)

​TLS/SSL error (502)

​Request timeout (504)

​Pricing

Scraping a URL

Installation

Usage

Response

Caching

How it works

Freshness and `max_age`

Usage Examples

When is the cache skipped?

Extracting links

Scrape Formats

Extract structured data

Using a Parser (recommended for scale)

Using LLM extraction (schema and/or prompt)

Extract links on the page

Interacting with the page with Actions

Example

Use Cases

Content Analysis & Research

E-commerce & Retail

Marketing & Content Creation

Data Applications

Monitoring & Alerts

Publishing & Media

Financial Applications

Technical Applications

Integration Scenarios

Error Handling

DNS failure (400)

TLS/SSL error (502)

Request timeout (504)

Pricing