Extracting Markdown and Plain Text
Learn how to extract content in text and markdown from any web page.
Overview
Olostep’s scrape endpoint allows to extract content from any website. Content in plain text or markdown is useful if you to feed it to an LLM without all the HTML.
In the text format we strip more content (slashes, asterisks and other markdown properties) than in the markdown format, so if you want to keep some of the formatting, you should use the markdown format.
In this guide we will see how to extract text and markdown from a website like https://www.nea.com/team
.
Prerequisites
Before getting started, ensure you have the following:
- A valid Olostep API key. You can get one by signing up at Olostep.
- Python installed on your system
- The
requests
andjson
libraries (these come pre-installed with Python, but you can install them usingpip install requests
if needed)
Extracting Text from a Website
The following Python script demonstrates how to extract text and markdown content from a website using Olostep’s API.
Example Response
A successful response will look something like this:
Explanation
url_to_scrape
: specifies the website URL to extract content from.formats
: defines the output formats (text in this case).Authorization
: contains your API key to authenticate the request.- The response is formatted as JSON and printed for readability.
Conclusion
Using Olostep, you can easily extract text and markdown content from any website. This is useful if you want to get content from a website and feed it to an LLM for data extraction and analysis. If you want to extract content at scale from the same website over and over (e.g. monitoring data, price tracking, etc…) we recommend using a custom parser to get the content in JSON format.