POST
/
v1
/
scrapes
curl --request POST \
  --url https://api.olostep.com/v1/scrapes \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "url_to_scrape": "<string>",
  "wait_before_scraping": 123,
  "formats": [
    "html"
  ],
  "remove_css_selectors": "default",
  "actions": [
    {
      "type": "wait",
      "milliseconds": 1
    }
  ],
  "country": "<string>",
  "transformer": "postlight",
  "remove_images": false,
  "remove_class_names": [
    "<string>"
  ],
  "parser_extract": {
    "parser_id": "<string>",
    "schema": {}
  },
  "llm_extract": {
    "schema": {}
  },
  "links_on_page": {
    "absolute_links": true,
    "query_to_order_links_by": "<string>",
    "include_links": [
      "<string>"
    ],
    "exclude_links": [
      "<string>"
    ]
  },
  "screen_size": {
    "screen_type": "default",
    "screen_width": 123,
    "screen_height": 123
  },
  "metadata": {}
}'
{
  "id": "<string>",
  "object": "<string>",
  "created": 123,
  "metadata": {},
  "url_to_scrape": "<string>",
  "result": {
    "html_content": "<string>",
    "markdown_content": "<string>",
    "text_content": "<string>",
    "json_content": "<string>",
    "screenshot_hosted_url": "<string>",
    "html_hosted_url": "<string>",
    "markdown_hosted_url": "<string>",
    "text_hosted_url": "<string>",
    "links_on_page": [
      "<string>"
    ],
    "page_metadata": {
      "status_code": 123,
      "title": "<string>"
    }
  }
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
url_to_scrape
string
required

The URL to start scraping from.

wait_before_scraping
integer

Time to wait in milliseconds before starting the scraping.

formats
enum<string>[]

Formats in which you want the content.

Available options:
html,
markdown,
text,
parser_extract,
raw_pdf
remove_css_selectors
enum<string>

Option to remove certain CSS selectors from the content. Optionally, you can also pass a JSON stringified array of specific selectors you want to remove. The CSS selectors removed when this option is set to default are ['nav','footer','script','style','noscript','svg',[role=alert],[role=banner],[role=dialog],[role=alertdialog],[role=region][aria-label*=skip i],[aria-modal=true]]

Available options:
default,
none,
array
actions
object[]

Actions to perform on the page before getting the content.

country
string

Residential country to load the request from.

Supported values are:

  • US (United States)
  • CA (Canada)
  • IT (Italy)
  • IN (India)
  • GB (England)
  • JP (Japan)
  • MX (Mexico)
  • AU (Australia)
  • ID (Indonesia)
  • UA (UAE)
  • RU (Russia)
  • RANDOM

Some operations, like scraping Google Search and Google News, support all countries.

transformer
enum<string>

Specify the HTML transformer to use, if any. Postlight's Mercury Parser library is used to remove ads and other unwanted content from the scraped content.

Available options:
postlight,
none
remove_images
boolean
default:false

Option to remove images from the scraped content. Defaults to false.

remove_class_names
string[]

List of class names to remove from the content.

parser_extract
object

Configuration for parser extraction.

llm_extract
object

With this option, you can get all the links present on the page you scrape.

screen_size
object

Configuration for screen size. Preset dimensions are available through screen_type: desktop (1920x1080), mobile (414x896), or default (768x1024).

metadata
object

User-defined metadata. Not supported yet

Response

200
application/json
Successful response with the scrape initiation details.
id
string

Scrape ID

object
string

The kind of object. "scrape" for this endpoint.

created
number

Created epoch

metadata
object

User-defined metadata.

url_to_scrape
string

The URL that was scraped.

result
object