You can use the endpoint to scrape a single URL and choose output formats. The mandatory parameters are url_to_scrape and formats.Some other common parameters are wait_before_scraping (in milliseconds), remove_css_selectors (default, none, or an array of selectors), and country.
The API returns a scrape object in response.The scrape has a few properties like id and result.The result object has the following fields (according to the formats parameter some might be null):
html_content: the HTML content of the page. Pass formats: ["html"] to get this.
markdown_content: the MD content of the page. Pass formats: ["markdown"] to get this.
text_content: the text content of the page. Pass formats: ["text"] to get this.
json_content: the JSON content of the page. Pass formats: ["json"] to get this and also provide a parser or llm_extract parameter.
screenshot_hosted_url: the hosted URL of the screenshot.
html_hosted_url: the hosted URL of the HTML content
markdown_hosted_url: the hosted URL of the Markdown content
json_hosted_url: the hosted URL of the JSON content
text_hosted_url: the hosted URL of the text content
When a scrape is requested, Olostep checks if a matching scrape already exists with the same parameters. If a fresh enough match is found, the content is served instantly from Olostep’s storage without launching a new browser scrape.
Shared Cache: The cache is shared globally. If another request scraped the exact same URL with the exact same configuration within your freshness window, you benefit from the speedup.
Post-processing is still live: Operations like llm_extract and links_on_page filters are run on-the-fly on top of the cached document. You only cache the core page retrieval, keeping your structured extractions dynamic.
By default, the production API always performs a live scrape to guarantee real-time accuracy. You can opt-in to caching using the max_age parameter.
Parameter
Type
Default
Description
max_age
integer
0
Acceptable content age in seconds. If a cached copy exists and is newer than max_age seconds, it is served from the cache.
Default API Behavior (max_age: 0): Every API request triggers a fresh scrape.
Default Playground Behavior: In the dashboard playground, max_age defaults to 24 hours (86400 seconds) to prevent redundant scrapes and save credits while you build and test.
Maximum Age: The cache has a hard limit of 7 days (604800 seconds). Any max_age requested above this limit will fallback to a maximum of 7 days.
from olostep import Olostepclient = Olostep(api_key="YOUR_REAL_KEY")# Opt-in to caching: Accept results up to 1 day (86400 seconds) oldresult = client.scrapes.create( url_to_scrape="https://example.com", formats=["markdown"], max_age=86400)
The cache is automatically bypassed (forcing a live scrape) for features that require unique sessions, real-time visual outputs, or custom file handling:
Interactive sessions: Requests using session_id or loading a custom browser context.
Visuals: Visualizer tools and screenshots (htmlVisualizer).
Special file types: Binary file downloads or raw PDF rendering.
Debugging & Network: Capturing network_calls or using async parser jobs.
include_links / exclude_links: glob patterns matched against each link’s URL path.
query_to_order_links_by: re-orders the returned links by relevance to this text.
Glob patterns match path segments. A single * does not cross /, so "/blog/*" matches "/blog/post-1" but not the index "/blog" itself — and it never matches "/blog?tag=x" because query strings are not part of the path. To include the index too, use "/blog*" or "{/blog,/blog/**}".
Olostep has a few pre-built parsers for popular websites but you can also create your own parsers through the dashboard or ask our team to do it for you.Parsers are self-healing and will update themselves to the latest version of the website.
Provide llm_extract with a JSON Schema (schema) and/or a natural language instruction (prompt). You can pass both parameters, but if both are provided, schema takes precedence.Instead, if you just pass a prompt, the LLM will extract the data based on the prompt and will decide the data structure on its own.
Note: result.json_content returns a stringified JSON. Parse it in your code if you need an object.Pricing:llm_extract costs 10 credits per scrape. To lower the cost, you can bring your own API keys or enable usage-based pricing. Contact info@olostep.com to get access.
With the links_on_page option, you can extract all the links present on the page you scrape. It accepts the following parameters to help filter and order the extracted links:
absolute_links (boolean, default: true): When true, it returns complete URLs (e.g., https://example.com/page) instead of relative paths (e.g., /page).
query_to_order_links_by (string): Orders the returned links by their similarity to the provided query text, prioritizing the most relevant matches first.
include_links (array of strings): Filter extracted links using glob patterns. Use patterns like *.pdf to match file extensions, /blog/* for specific paths, or full URLs like https://example.com/*. Supports wildcards (*), character classes ([a-z]), and alternation ({pattern1,pattern2}).
exclude_links (array of strings): Exclude specific links using glob patterns, following the same syntax as include_links.
The target website has a broken or incompatible HTTPS configuration. error.detail provides the specific SSL error code for diagnostics; error.code is always tls_error.
{ "error": { "type": "invalid_request_error", "code": "tls_error", "detail": "err_ssl_tlsv1_alert_internal_error", "message": "The website closed or rejected the TLS handshake. The server may be misconfigured or use an unsupported SSL/TLS version." }}
The scrape did not complete within the wait budget. The page may be slow, bot-protected, or temporarily unavailable. This response is safe to retry.
{ "error": { "type": "request_timeout", "code": "scrape_poll_timeout", "message": "Request timed out while waiting for scrape result. The page may be slow, blocked for our fetchers, or temporarily unavailable." }}