> ## Documentation Index
> Fetch the complete documentation index at: https://docs.olostep.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Scrape

> Turn any url into clean data

Through the Olostep `/v1/scrapes` endpoint you can extract LLM-friendly Markdown, HTML, text, screenshots, or structured JSON from any URL in real time.

* Outputs clean markdown, structured data, screenshots, or html
* Extract JSON through [Parsers](/features/structured-content/parsers) or [LLM extraction](/features/structured-content/llm-extraction)
* Handles dynamic content: js-rendered sites, login flows via actions, PDFs

For API details see the [Scrape Endpoint API Reference](/api-reference/scrapes/create).

## Scraping a URL

Use the `/v1/scrapes` endpoint to scrape a single URL and choose output formats.

### Installation

<CodeGroup>
  ```python Python theme={null}
  pip install olostep
  ```

  ```javascript Node theme={null}
  npm install olostep
  ```

  ```bash cURL theme={null}
  # curl is available by default on macOS, Linux, and Windows
  ```

  ```javascript Node (API) theme={null}
  npm install node-fetch
  ```

  ```bash Python (API) theme={null}
  pip install requests
  ```
</CodeGroup>

### Usage

You can use the endpoint to scrape a single URL and choose output formats. The mandatory parameters are `url_to_scrape` and `formats`.

Some other common parameters are `wait_before_scraping` (in milliseconds), `remove_css_selectors` (default, none, or an array of selectors), and `country`.

<CodeGroup>
  ```python Python theme={null}
  from olostep import Olostep

  client = Olostep(api_key="YOUR_REAL_KEY")

  result = client.scrapes.create(
      url_to_scrape="https://en.wikipedia.org/wiki/Alexander_the_Great",
      formats=["markdown", "html"],
  )

  print(result.markdown_content)
  print(result.html_content)
  ```

  ```js Node theme={null}
  import Olostep from 'olostep'

  const client = new Olostep({ apiKey: 'YOUR_REAL_KEY' })

  const result = await client.scrapes.create({
    url: 'https://en.wikipedia.org/wiki/Alexander_the_Great',
    formats: ['markdown', 'html'],
  })

  console.log(result.markdown_content)
  console.log(result.html_content)
  ```

  ```bash cURL theme={null}
  curl -s -X POST "https://api.olostep.com/v1/scrapes" \
    -H "Authorization: Bearer $OLOSTEP_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url_to_scrape": "https://en.wikipedia.org/wiki/Alexander_the_Great",
      "formats": ["markdown", "html"]
    }'
  ```

  ```js Node (API) theme={null}
  const endpoint = 'https://api.olostep.com/v1/scrapes'
  const payload = {
    url_to_scrape: 'https://en.wikipedia.org/wiki/Alexander_the_Great',
    formats: ['markdown', 'html']
  }
  const res = await fetch(endpoint, {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer <YOUR_API_KEY>',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify(payload)
  })
  const data = await res.json()
  console.log(data)
  ```

  ```python Python (API) theme={null}
  import requests
  import json

  endpoint = "https://api.olostep.com/v1/scrapes"
  payload = {
      "url_to_scrape": "https://en.wikipedia.org/wiki/Alexander_the_Great",
      "formats": ["markdown", "html"]
  }
  headers = {
      "Authorization": "Bearer <YOUR_API_KEY>",
      "Content-Type": "application/json"
  }

  response = requests.post(endpoint, json=payload, headers=headers)
  print(json.dumps(response.json(), indent=2))
  ```
</CodeGroup>

### Response

The API returns a  `scrape` object in response.

The `scrape` has a few properties like `id` and `result`.

The `result` object has the following fields (according to the `formats` parameter some might be null):

* `html_content`: the HTML content of the page. Pass `formats: ["html"]` to get this.
* `markdown_content`: the MD content of the page. Pass `formats: ["markdown"]` to get this.
* `text_content`: the text content of the page. Pass `formats: ["text"]` to get this.
* `json_content`: the JSON content of the page. Pass `formats: ["json"]` to get this and also provide a `parser` or `llm_extract` parameter.
* `screenshot_hosted_url`: the hosted URL of the screenshot.
* `html_hosted_url`: the hosted URL of the HTML content
* `markdown_hosted_url`: the hosted URL of the Markdown content
* `json_hosted_url`: the hosted URL of the JSON content
* `text_hosted_url`: the hosted URL of the text content
* `links_on_page`: the links on the page
* `page_metadata`: the metadata of the page

```json theme={null}
{
  "id": "scrape_6h89o8u1kt",
  "object": "scrape",
  "created": 1745673871,
  "metadata": {},
  "retrieve_id": "6h89o8u1kt",
  "url_to_scrape": "https://en.wikipedia.org/wiki/Alexander_the_Great",
  "result": {
    "html_content": "<html...",
    "markdown_content": "## Alexander the Great...",
    "text_content": null,
    "json_content": null,
    "screenshot_hosted_url": null,
    "html_hosted_url": "https://olostep-storage.s3.us-east-1.amazonaws.com/text_6h89o8u1kt.txt",
    "markdown_hosted_url": "https://olostep-storage.s3.us-east-1.amazonaws.com/markDown_6h89o8u1kt.txt",
    "json_hosted_url": null,
    "text_hosted_url": null,
    "links_on_page": [],
    "page_metadata": { "status_code": 200, "title": "" }
  }
}
```

## Scrape Formats

Choose one or more output formats via `formats`:

* `markdown`: LLM-friendly markdown
* `html`: cleaned HTML
* `text`: plain text
* `json`: structured output (via parser or llm\_extract)
* `raw_pdf`: raw PDF bytes extracted to hosted URL
* `screenshot`: set via actions to capture a screenshot and return a hosted URL

Output keys are returned inside `result` as `*_content` fields and a `*_hosted_url` as well.

## Extract structured data

You can extract structured JSON in two ways: using Parsers or LLM extraction.

### Using a Parser (recommended for scale)

Define `formats: ["json"]` and provide a parser `id`.

<CodeGroup>
  ```python Python theme={null}
  from olostep import Olostep

  client = Olostep(api_key="YOUR_REAL_KEY")

  result = client.scrapes.create(
      url_to_scrape="https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
      formats=["json"],
      parser="@olostep/google-search",
  )

  print(result.json_content)
  ```

  ```js Node theme={null}
  import Olostep from 'olostep'

  const client = new Olostep({ apiKey: 'YOUR_REAL_KEY' })

  const result = await client.scrapes.create({
    url: 'https://www.google.com/search?q=alexander+the+great&gl=us&hl=en',
    formats: ['json'],
    parser: '@olostep/google-search',
  })

  console.log(result.json_content)
  ```

  ```bash cURL theme={null}
  curl -s -X POST "https://api.olostep.com/v1/scrapes" \
    -H "Authorization: Bearer $OLOSTEP_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url_to_scrape": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
      "formats": ["json"],
      "parser": {"id": "@olostep/google-search"}
    }'
  ```

  ```js Node (API) theme={null}
  const res = await fetch('https://api.olostep.com/v1/scrapes', {
    method: 'POST',
    headers: { 'Authorization': 'Bearer <YOUR_API_KEY>', 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url_to_scrape: 'https://www.google.com/search?q=alexander+the+great&gl=us&hl=en',
      formats: ['json'],
      parser: { id: '@olostep/google-search' }
    })
  })
  console.log(await res.json())
  ```

  ```python Python (API) theme={null}
  import requests, json

  endpoint = "https://api.olostep.com/v1/scrapes"
  payload = {
    "url_to_scrape": "https://www.google.com/search?q=alexander+the+great&gl=us&hl=en",
    "formats": ["json"],
    "parser": { 
      "id": "@olostep/google-search" 
    }
  }
  headers = {
      "Authorization": "Bearer <YOUR_API_KEY>", 
      "Content-Type": "application/json"
  }

  res = requests.post(endpoint, json=payload, headers=headers)
  print(json.dumps(res.json(), indent=2))
  ```
</CodeGroup>

Olostep has a few pre-built parsers for [popular websites](https://www.olostep.com/store) but you can also create your own parsers through the dashboard or ask our team to do it for you.

Parsers are self-healing and will update themselves to the latest version of the website.

### Using LLM extraction (schema and/or prompt)

Provide `llm_extract` with a JSON Schema (`schema`) and/or a natural language instruction (`prompt`). You can pass both parameters, but if both are provided, `schema` takes precedence.

Instead, if you just pass a `prompt`, the LLM will extract the data based on the prompt and will decide the data structure on its own.

<CodeGroup>
  ```python Python theme={null}
  from olostep import LLMExtract, Olostep

  client = Olostep(api_key="YOUR_REAL_KEY")

  result = client.scrapes.create(
      url_to_scrape="https://www.berklee.edu/events/stefano-marchese-friends",
      formats=["markdown", "json"],
      llm_extract=LLMExtract(
          schema={
              "event": {
                  "type": "object",
                  "properties": {
                      "title": {"type": "string"},
                      "date": {"type": "string"},
                      "description": {"type": "string"},
                      "venue": {"type": "string"},
                      "address": {"type": "string"},
                      "start_time": {"type": "string"},
                  },
              }
          }
      ),
  )

  print(result.json_content)
  ```

  ```js Node theme={null}
  import Olostep from 'olostep'

  const client = new Olostep({ apiKey: 'YOUR_REAL_KEY' })

  const result = await client.scrapes.create({
    url: 'https://www.berklee.edu/events/stefano-marchese-friends',
    formats: ['markdown', 'json'],
    llmExtract: {
      schema: {
        event: {
          type: 'object',
          properties: {
            title: { type: 'string' },
            date: { type: 'string' },
            description: { type: 'string' },
            venue: { type: 'string' },
            address: { type: 'string' },
            start_time: { type: 'string' },
          },
        },
      },
    },
  })

  console.log(result.json_content)
  ```

  ```bash cURL theme={null}
  curl -s -X POST "https://api.olostep.com/v1/scrapes" \
    -H "Authorization: Bearer $OLOSTEP_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url_to_scrape": "https://www.berklee.edu/events/stefano-marchese-friends",
      "formats": ["json"],
      "llm_extract": {
        "prompt": "Extract the event title, date, description, venue, address, and start time from the page."
      }
    }'
  ```

  ```js Node (API) theme={null}
  const res = await fetch('https://api.olostep.com/v1/scrapes', {
    method: 'POST',
    headers: { 'Authorization': 'Bearer <YOUR_API_KEY>', 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url_to_scrape: 'https://www.berklee.edu/events/stefano-marchese-friends',
      formats: ['json'],
      llm_extract: {
        prompt: 'Extract the event title, date, description, venue, address, and start time from the page.'
      }
    })
  })
  console.log(await res.json())
  ```

  ```python Python (API) theme={null}
  import requests, json

  endpoint = "https://api.olostep.com/v1/scrapes"
  payload = {
    "url_to_scrape": "https://www.berklee.edu/events/stefano-marchese-friends",
    "formats": ["markdown", "json"],
    "llm_extract": {
      "schema": {
        "event": {
          "type": "object",
          "properties": {
            "title": {"type": "string"},
            "date": {"type": "string"},
            "description": {"type": "string"},
            "venue": {"type": "string"},
            "address": {"type": "string"},
            "start_time": {"type": "string"}
          }
        }
      }
    }
  }
  headers = {
      "Authorization": "Bearer <YOUR_API_KEY>",
      "Content-Type": "application/json"
  }
  res = requests.post(endpoint, json=payload, headers=headers)
  print(json.dumps(res.json(), indent=2))
  ```
</CodeGroup>

Note: `result.json_content` returns a stringified JSON. Parse it in your code if you need an object.

## Interacting with the page with Actions

Perform actions before scraping to interact with dynamic sites. Supported actions:

* `wait` with `milliseconds`
* `click` with `selector`
* `fill_input` with `selector` and `value`
* `scroll` with `direction` and `amount`

It is often useful to use `wait` before/after other actions to allow the page to load.

### Example

<CodeGroup>
  ```python Python theme={null}
  from olostep import FillInputAction, Olostep, WaitAction

  client = Olostep(api_key="YOUR_REAL_KEY")

  result = client.scrapes.create(
      url_to_scrape="https://example.com/login",
      formats=["markdown"],
      actions=[
          FillInputAction(selector="input[type=email]", value="john@example.com"),
          WaitAction(milliseconds=500),
          FillInputAction(selector="input[type=password]", value="secret"),
          {"type": "click", "selector": "button[type=\"submit\"]"},
          WaitAction(milliseconds=1500),
      ],
  )

  print(result.markdown_content)
  ```

  ```js Node theme={null}
  import Olostep from 'olostep'

  const client = new Olostep({ apiKey: 'YOUR_REAL_KEY' })

  const result = await client.scrapes.create({
    url: 'https://example.com/login',
    formats: ['markdown'],
    actions: [
      { type: 'fill_input', selector: 'input[type=email]', value: 'john@example.com' },
      { type: 'wait', milliseconds: 500 },
      { type: 'fill_input', selector: 'input[type=password]', value: 'secret' },
      { type: 'click', selector: 'button[type="submit"]' },
      { type: 'wait', milliseconds: 1500 },
    ],
  })

  console.log(result.markdown_content)
  ```

  ```bash cURL theme={null}
  curl -s -X POST "https://api.olostep.com/v1/scrapes" \
    -H "Authorization: Bearer $OLOSTEP_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url_to_scrape": "https://example.com/login",
      "formats": ["markdown"],
      "actions": [
        { "type": "fill_input", "selector": "input[type=email]", "value": "john@example.com" },
        { "type": "wait", "milliseconds": 500 },
        { "type": "fill_input", "selector": "input[type=password]", "value": "secret" },
        { "type": "click", "selector": "button[type=\"submit\"]" },
        { "type": "wait", "milliseconds": 1500 }
      ]
    }'
  ```

  ```js Node (API) theme={null}
  const res = await fetch('https://api.olostep.com/v1/scrapes', {
    method: 'POST',
    headers: { 'Authorization': 'Bearer <YOUR_API_KEY>', 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url_to_scrape: 'https://example.com/login',
      formats: ['markdown'],
      actions: [
        { type: 'fill_input', selector: 'input[type=email]', value: 'john@example.com' },
        { type: 'wait', milliseconds: 500 },
        { type: 'fill_input', selector: 'input[type=password]', value: 'secret' },
        { type: 'click', selector: 'button[type="submit"]' },
        { type: 'wait', milliseconds: 1500 }
      ]
    })
  })
  console.log(await res.json())
  ```

  ```python Python (API) theme={null}
  import requests, json

  endpoint = "https://api.olostep.com/v1/scrapes"
  payload = {
    "url_to_scrape": "https://example.com/login",
    "formats": ["markdown"],
    "actions": [
      {"type": "fill_input", "selector": "input[type=email]", "value": "john@example.com"},
      {"type": "wait", "milliseconds": 500},
      {"type": "fill_input", "selector": "input[type=password]", "value": "secret"},
      {"type": "click", "selector": "button[type=\"submit\"]"},
      {"type": "wait", "milliseconds": 1500}
    ]
  }
  headers = {
      "Authorization": "Bearer <YOUR_API_KEY>", 
      "Content-Type": "application/json"
  }
  res = requests.post(endpoint, json=payload, headers=headers)
  print(json.dumps(res.json(), indent=2))
  ```
</CodeGroup>

The response will include any requested formats (e.g., `markdown_content`).

## Use Cases

Below are a few practical applications of customers using the `/scrapes` endpoint.

### Content Analysis & Research

* **Competitive Analysis**: Extract product details, pricing, and features from competitor websites
* **Market Research**: Analyze landing pages, product descriptions, and customer testimonials
* **Academic Research**: Gather specific data from scientific publications or research portals
* **Legal Documentation**: Extract case studies, regulations, or legal precedents from official websites

### E-commerce & Retail

* **Dynamic Pricing Strategies**: Get real-time product pricing from competing stores
* **Product Information Management**: Extract detailed specifications and descriptions
* **Stock/Inventory Monitoring**: Check product availability at other retailers
* **Review Analysis**: Gather consumer feedback and sentiment for specific products

### Marketing & Content Creation

* **Content Curation**: Extract relevant articles and blog posts for newsletters
* **SEO Analysis**: Examine competitors' keyword usage, meta descriptions, and page structure
* **Lead Generation**: Extract contact information from business directories or company pages
* **Influencer Research**: Gather engagement metrics and content styles from influencer profiles
* **Personalised Social Media generation**: Create AI-powered social media marketing by analyzing customers websites

### Data Applications

* **AI Training Data Collection**: Gather specific examples for machine learning models
* **Custom Knowledge Base Building**: Extract documentation or instructions from software sites
* **Historical Data Archives**: Preserve website content at specific points in time
* **Structured Data Extraction**: Transform web content into formatted datasets for analysis

### Monitoring & Alerts

* **Regulatory Compliance Monitoring**: Track changes to legal or regulatory websites
* **Crisis Management**: Monitor news sites for mentions of specific events or organizations
* **Event Tracking**: Extract details about upcoming events from venue or organizer websites
* **Service Status Monitoring**: Check service status pages for specific platforms or tools

### Publishing & Media

* **News Aggregation**: Extract breaking news from official sources
* **Media Monitoring**: Track specific topics across news sites
* **Content Verification**: Extract information to fact-check claims or statements
* **Multimedia Extraction**: Gather embedded videos, images, or audio for media libraries

### Financial Applications

* **Investment Research**: Extract financial statements or annual reports from company websites
* **Economic Indicators**: Gather economic data from government or financial institution websites
* **Cryptocurrency Data**: Extract real-time pricing and market cap information
* **Financial News Analysis**: Monitor financial news sites for specific market signals

### Technical Applications

* **API Documentation Extraction**: Gather technical documentation for reference
* **Integration Testing**: Extract website elements to verify third-party integrations
* **Accessibility Testing**: Analyze website structure for compliance with accessibility standards
* **Web Archive Creation**: Capture full website content for historical preservation

### Integration Scenarios

* **CRM Systems**: Enhance customer profiles with data from company websites or Linkedin
* **Content Management Systems**: Import relevant external content
* **Business Intelligence Tools**: Supplement internal data with external market information
* **Project Management Software**: Extract specifications or requirements from client websites
* **Custom Dashboards**: Display extracted data alongside internal metrics

## Error Handling

All errors follow a shared envelope shape. Check `error.type` and `error.code` to branch programmatically:

```json theme={null}
{
  "id": "error_abc123",
  "object": "error",
  "created": 1745673871,
  "url": "https://example.com",
  "metadata": {},
  "error": {
    "type": "...",
    "code": "...",
    "message": "..."
  }
}
```

| HTTP | `error.type`            | `error.code`            | Meaning                                                                                                        |
| ---- | ----------------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------- |
| 400  | `invalid_request_error` | `dns_resolution_failed` | The domain does not exist or the URL has a typo.                                                               |
| 400  | `invalid_request_error` | `invalid_url`           | The URL is malformed.                                                                                          |
| 502  | `invalid_request_error` | `tls_error`             | The website has an invalid or incompatible TLS/SSL certificate. `error.detail` carries the low-level SSL code. |
| 504  | `request_timeout`       | `scrape_poll_timeout`   | The scrape did not finish within the \~55-second wait budget.                                                  |

### DNS failure (400)

The domain does not resolve. Check the URL for typos.

```json theme={null}
{
  "error": {
    "type": "invalid_request_error",
    "code": "dns_resolution_failed",
    "message": "The URL contains a typo, or the domain does not exist."
  }
}
```

### TLS/SSL error (502)

The target website has a broken or incompatible HTTPS configuration. `error.detail` provides the specific SSL error code for diagnostics; `error.code` is always `tls_error`.

```json theme={null}
{
  "error": {
    "type": "invalid_request_error",
    "code": "tls_error",
    "detail": "err_ssl_tlsv1_alert_internal_error",
    "message": "The website closed or rejected the TLS handshake. The server may be misconfigured or use an unsupported SSL/TLS version."
  }
}
```

### Request timeout (504)

The scrape did not complete within the wait budget. The page may be slow, bot-protected, or temporarily unavailable. This response is safe to retry.

```json theme={null}
{
  "error": {
    "type": "request_timeout",
    "code": "scrape_poll_timeout",
    "message": "Request timed out while waiting for scrape result. The page may be slow, blocked for our fetchers, or temporarily unavailable."
  }
}
```

## Pricing

Scrape costs 1 credit by default. If you also pass [parsers](/features/structured-content/parsers), the costs vary by parser (1-5 credits). If you use [LLM extract](/features/structured-content/llm-extraction), it costs 20 credits.
