Skip to main content
Through the Olostep /v1/scrapes endpoint you can extract LLM-friendly Markdown, HTML, text, screenshots, or structured JSON from any URL in real time.
  • Outputs clean markdown, structured data, screenshots, or html
  • Extract JSON through Parsers or LLM extraction
  • Handles dynamic content: js-rendered sites, login flows via actions, PDFs
For API details see the Scrape Endpoint API Reference.

Scraping a URL

Use the /v1/scrapes endpoint to scrape a single URL and choose output formats.

Installation

# pip install requests

import requests

Usage

You can use the endpoint to scrape a single URL and choose output formats. The mandatory parameters are url_to_scrape and formats. Some other common parameters are wait_before_scraping (in milliseconds), remove_css_selectors (default, none, or an array of selectors), and country.
import json

endpoint = "https://api.olostep.com/v1/scrapes"
payload = {
    "url_to_scrape": "https://en.wikipedia.org/wiki/Alexander_the_Great",
    "formats": ["markdown", "html"]
}
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json"
}

response = requests.post(endpoint, json=payload, headers=headers)
print(json.dumps(response.json(), indent=2))

Response

The API returns a scrape object in response. The scrape has a few properties like id and result. The result object has the following fields (according to the formats parameter some might be null):
  • html_content: the HTML content of the page. Pass formats: ["html"] to get this.
  • markdown_content: the MD content of the page. Pass formats: ["markdown"] to get this.
  • text_content: the text content of the page. Pass formats: ["text"] to get this.
  • json_content: the JSON content of the page. Pass formats: ["json"] to get this and also provide a parser or llm_extract parameter.
  • screenshot_hosted_url: the hosted URL of the screenshot.
  • html_hosted_url: the hosted URL of the HTML content
  • markdown_hosted_url: the hosted URL of the Markdown content
  • json_hosted_url: the hosted URL of the JSON content
  • text_hosted_url: the hosted URL of the text content
  • links_on_page: the links on the page
  • page_metadata: the metadata of the page
{
  "id": "scrape_6h89o8u1kt",
  "object": "scrape",
  "created": 1745673871,
  "metadata": {},
  "retrieve_id": "6h89o8u1kt",
  "url_to_scrape": "https://en.wikipedia.org/wiki/Alexander_the_Great",
  "result": {
    "html_content": "<html...",
    "markdown_content": "## Alexander the Great...",
    "text_content": null,
    "json_content": null,
    "screenshot_hosted_url": null,
    "html_hosted_url": "https://olostep-storage.s3.us-east-1.amazonaws.com/text_6h89o8u1kt.txt",
    "markdown_hosted_url": "https://olostep-storage.s3.us-east-1.amazonaws.com/markDown_6h89o8u1kt.txt",
    "json_hosted_url": null,
    "text_hosted_url": null,
    "links_on_page": [],
    "page_metadata": { "status_code": 200, "title": "" }
  }
}

Scrape Formats

Choose one or more output formats via formats:
  • markdown: LLM-friendly markdown
  • html: cleaned HTML
  • text: plain text
  • json: structured output (via parser or llm_extract)
  • raw_pdf: raw PDF bytes extracted to hosted URL
  • screenshot: set via actions to capture a screenshot and return a hosted URL
Output keys are returned inside result as *_content fields and a *_hosted_url as well.

Extract structured data

You can extract structured JSON in two ways: using Parsers or LLM extraction. Define formats: ["json"] and provide a parser id.
import requests, json

endpoint = "https://api.olostep.com/v1/scrapes"
payload = {
  "url_to_scrape": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
  "formats": ["json"],
  "parser": { 
    "id": "@olostep/google-search" 
  }
}
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>", 
    "Content-Type": "application/json"
}

res = requests.post(endpoint, json=payload, headers=headers)
print(json.dumps(res.json(), indent=2))
Olostep has a few pre-built parsers for popular websites but you can also create your own parsers through the dashboard or ask our team to do it for you. Parsers are self-healing and will update themselves to the latest version of the website.

Using LLM extraction (schema and/or prompt)

Provide llm_extract with a JSON Schema (schema) and/or a natural language instruction (prompt). You can pass both parameters, but if both are provided, schema takes precedence. Instead, if you just pass a prompt, the LLM will extract the data based on the prompt and will decide the data structure on its own.
import requests, json

endpoint = "https://api.olostep.com/v1/scrapes"
payload = {
  "url_to_scrape": "https://www.berklee.edu/events/stefano-marchese-friends",
  "formats": ["markdown", "json"],
  "llm_extract": {
    "schema": {
      "event": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "date": {"type": "string"},
          "description": {"type": "string"},
          "venue": {"type": "string"},
          "address": {"type": "string"},
          "start_time": {"type": "string"}
        }
      }
    }
  }
}
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json"
}
res = requests.post(endpoint, json=payload, headers=headers)
print(json.dumps(res.json(), indent=2))
Note: result.json_content returns a stringified JSON. Parse it in your code if you need an object.

Interacting with the page with Actions

Perform actions before scraping to interact with dynamic sites. Supported actions:
  • wait with milliseconds
  • click with selector
  • fill_input with selector and value
  • scroll with direction and amount
It is often useful to use wait before/after other actions to allow the page to load.

Example

import requests, json

endpoint = "https://api.olostep.com/v1/scrapes"
payload = {
  "url_to_scrape": "https://example.com/login",
  "formats": ["markdown"],
  "actions": [
    {"type": "fill_input", "selector": "input[type=email]", "value": "[email protected]"},
    {"type": "wait", "milliseconds": 500},
    {"type": "fill_input", "selector": "input[type=password]", "value": "secret"},
    {"type": "click", "selector": "button[type=\"submit\"]"},
    {"type": "wait", "milliseconds": 1500}
  ]
}
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>", 
    "Content-Type": "application/json"
}
res = requests.post(endpoint, json=payload, headers=headers)
print(json.dumps(res.json(), indent=2))
The response will include any requested formats (e.g., markdown_content).

Use Cases

Below are a few practical applications of customers using the /scrapes endpoint.

Content Analysis & Research

  • Competitive Analysis: Extract product details, pricing, and features from competitor websites
  • Market Research: Analyze landing pages, product descriptions, and customer testimonials
  • Academic Research: Gather specific data from scientific publications or research portals
  • Legal Documentation: Extract case studies, regulations, or legal precedents from official websites

E-commerce & Retail

  • Dynamic Pricing Strategies: Get real-time product pricing from competing stores
  • Product Information Management: Extract detailed specifications and descriptions
  • Stock/Inventory Monitoring: Check product availability at other retailers
  • Review Analysis: Gather consumer feedback and sentiment for specific products

Marketing & Content Creation

  • Content Curation: Extract relevant articles and blog posts for newsletters
  • SEO Analysis: Examine competitors’ keyword usage, meta descriptions, and page structure
  • Lead Generation: Extract contact information from business directories or company pages
  • Influencer Research: Gather engagement metrics and content styles from influencer profiles
  • Personalised Social Media generation: Create AI-powered social media marketing by analyzing customers websites

Data Applications

  • AI Training Data Collection: Gather specific examples for machine learning models
  • Custom Knowledge Base Building: Extract documentation or instructions from software sites
  • Historical Data Archives: Preserve website content at specific points in time
  • Structured Data Extraction: Transform web content into formatted datasets for analysis

Monitoring & Alerts

  • Regulatory Compliance Monitoring: Track changes to legal or regulatory websites
  • Crisis Management: Monitor news sites for mentions of specific events or organizations
  • Event Tracking: Extract details about upcoming events from venue or organizer websites
  • Service Status Monitoring: Check service status pages for specific platforms or tools

Publishing & Media

  • News Aggregation: Extract breaking news from official sources
  • Media Monitoring: Track specific topics across news sites
  • Content Verification: Extract information to fact-check claims or statements
  • Multimedia Extraction: Gather embedded videos, images, or audio for media libraries

Financial Applications

  • Investment Research: Extract financial statements or annual reports from company websites
  • Economic Indicators: Gather economic data from government or financial institution websites
  • Cryptocurrency Data: Extract real-time pricing and market cap information
  • Financial News Analysis: Monitor financial news sites for specific market signals

Technical Applications

  • API Documentation Extraction: Gather technical documentation for reference
  • Integration Testing: Extract website elements to verify third-party integrations
  • Accessibility Testing: Analyze website structure for compliance with accessibility standards
  • Web Archive Creation: Capture full website content for historical preservation

Integration Scenarios

  • CRM Systems: Enhance customer profiles with data from company websites or Linkedin
  • Content Management Systems: Import relevant external content
  • Business Intelligence Tools: Supplement internal data with external market information
  • Project Management Software: Extract specifications or requirements from client websites
  • Custom Dashboards: Display extracted data alongside internal metrics

Pricing

Scrape costs 1 credit by default. If you also pass parsers, the costs vary by parser (1-5 credits). If you use LLM extract, it costs 20 credits.