POST
/
v1
/
scrapes
Initiate a web page scrape
curl --request POST \
  --url https://api.olostep.com/v1/scrapes \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "url_to_scrape": "<string>",
  "wait_before_scraping": 123,
  "formats": [
    "html"
  ],
  "remove_css_selectors": "default",
  "actions": [
    {
      "type": "wait",
      "milliseconds": 1
    }
  ],
  "country": "<string>",
  "transformer": "postlight",
  "remove_images": false,
  "remove_class_names": [
    "<string>"
  ],
  "parser": {
    "id": "<string>"
  },
  "llm_extract": {
    "schema": {}
  },
  "links_on_page": {
    "absolute_links": true,
    "query_to_order_links_by": "<string>",
    "include_links": [
      "<string>"
    ],
    "exclude_links": [
      "<string>"
    ]
  },
  "screen_size": {
    "screen_type": "default",
    "screen_width": 123,
    "screen_height": 123
  },
  "metadata": {}
}'
{
  "id": "<string>",
  "object": "<string>",
  "created": 123,
  "metadata": {},
  "url_to_scrape": "<string>",
  "result": {
    "html_content": "<string>",
    "markdown_content": "<string>",
    "text_content": "<string>",
    "json_content": "<string>",
    "screenshot_hosted_url": "<string>",
    "html_hosted_url": "<string>",
    "markdown_hosted_url": "<string>",
    "text_hosted_url": "<string>",
    "links_on_page": [
      "<string>"
    ],
    "page_metadata": {
      "status_code": 123,
      "title": "<string>"
    }
  }
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
url_to_scrape
string<uri>
required

The URL to start scraping from.

wait_before_scraping
integer

Time to wait in milliseconds before starting the scraping.

formats
enum<string>[]

Formats in which you want the content.

remove_css_selectors
enum<string>

Option to remove certain CSS selectors from the content. Optionally, you can also pass a JSON stringified array of specific selectors you want to remove. The CSS selectors removed when this option is set to default are ['nav','footer','script','style','noscript','svg',[role=alert],[role=banner],[role=dialog],[role=alertdialog],[role=region][aria-label*=skip i],[aria-modal=true]]

Available options:
default,
none,
array
actions
(Wait · object | Click · object | Fill Input · object | Scroll · object)[]

Actions to perform on the page before getting the content.

country
string

Residential country to load the request from.

Supported values are:

  • US (United States)
  • CA (Canada)
  • IT (Italy)
  • IN (India)
  • GB (England)
  • JP (Japan)
  • MX (Mexico)
  • AU (Australia)
  • ID (Indonesia)
  • UA (UAE)
  • RU (Russia)
  • RANDOM

Some operations, like scraping Google Search and Google News, support all countries.

transformer
enum<string>

Specify the HTML transformer to use, if any. Postlight's Mercury Parser library is used to remove ads and other unwanted content from the scraped content.

Available options:
postlight,
none
remove_images
boolean
default:false

Option to remove images from the scraped content. Defaults to false.

remove_class_names
string[]

List of class names to remove from the content.

parser
object

When defining json as a format, you can use this parameter to specify the parser to use. Parsers are useful to extract structured content from web pages. Olostep has a few parsers built in for most common web pages, and you can also create your own parsers.

llm_extract
object

With this option, you can get all the links present on the page you scrape.

screen_size
object

Configuration for screen size. Preset dimensions are available through screen_type: desktop (1920x1080), mobile (414x896), or default (768x1024).

metadata
object

User-defined metadata. Not supported yet

Response

Successful response with the scrape initiation details.

id
string

Scrape ID

object
string

The kind of object. "scrape" for this endpoint.

created
number

Created epoch

metadata
object

User-defined metadata.

url_to_scrape
string

The URL that was scraped.

result
object