Overview

Instead of mapping entire website, you might want to focus on specific sections. In this guide, we’ll show you how to extract only the blog URLs from Stripe’s website.

Extracting Only Blog URLs

To extract only blog URLs from Stripe’s website, use the maps endpoint with path pattern filters. The include_urls parameter allows you to specify exactly which URL patterns you want to include in the results.

import requests
import time
import json

# Configuration
API_URL = 'https://api.olostep.com/v1'
API_KEY = '<your_olostep_api_key>'
HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

# Start time for latency tracking
start_time = time.time()

# Define the payload with URL patterns to include
payload = {
    "url": "https://stripe.com",
    "include_urls": ["/blog", "/blog/**"]  # Match /blog and all paths under /blog
}

# Make the request
response = requests.post(f'{API_URL}/maps', headers=HEADERS, json=payload)

# Calculate latency
latency = round((time.time() - start_time) * 1000, 2)
print(f"Request completed in {latency}ms")

# Process the results
data = response.json()
print(f"Found {data['urls_count']} blog URLs on Stripe's website")

# Print the first 10 URLs as a sample
print("\nSample blog URLs:")
for url in data['urls'][:10]:
    print(f"- {url}")
    
# Save blog URLs to a file for further processing
with open('stripe_blog_urls.json', 'w') as f:
    json.dump(data, f, indent=2)
print(f"\nAll blog URLs saved to stripe_blog_urls.json")

Understanding the URL Patterns

In the example above, we’re using two pattern specifications:

  • /blog - Matches exactly the main blog page (https://stripe.com/blog)
  • /blog/** - Matches all subpaths under /blog, including individual blog posts, category pages, etc.

This combination ensures we capture all blog-related content while excluding other sections of the website.

Example Response

{
  "id": "map_xyz789abc",
  "urls_count": 278,
  "urls": [
    "https://stripe.com/blog",
    "https://stripe.com/blog/page/1",
    "https://stripe.com/blog/page/2",
    "https://stripe.com/blog/engineering",
    "https://stripe.com/blog/product",
    "https://stripe.com/blog/how-we-built-it-usage-based-billing",
    "https://stripe.com/blog/using-ml-to-detect-and-respond-to-performance-degradations",
    "https://stripe.com/blog/stripe-radar-responded-to-card-testing",
    "https://stripe.com/blog/future-of-real-time-payments",
    "https://stripe.com/blog/ml-flywheel-improve-models"
    // ... additional URLs omitted for brevity
  ]
}

Filtering Blog URLs by Category

You can further refine your extraction to focus on specific blog categories. For example, if you’re only interested in Stripe’s engineering blog posts:

# Define the payload with more specific URL patterns
payload = {
    "url": "https://stripe.com",
    "include_urls": ["/blog/engineering", "/blog/engineering/**"]
}

Next Steps

Now that you have extracted all of Stripe’s blog URLs,

  1. You can fetch their content individually using the scrape API.
  2. Or, use the next guide to crawl and extract the actual content from these blog pages directly with inbuilt filters.