Overview

Before diving into specific sections of a website, it’s often useful to get a complete picture of its structure. In this guide, we’ll show you how to extract all URLs from Stripe’s website, which will help you:

  • Understand the overall site architecture
  • Discover content sections you might not be aware of
  • Use LLMs to decide which URLs to further scrape

Extracting All Stripe URLs

To extract all URLs from Stripe’s website, use the maps endpoint with Stripe’s domain. This will return a comprehensive list of all discoverable URLs on their site.

import requests
import time
import json

# Configuration
API_URL = 'https://api.olostep.com/v1'
API_KEY = '<your_olostep_api_key>'
HEADERS = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

# Start time for latency tracking
start_time = time.time()

# Define the payload with just the base URL
payload = {
    "url": "https://stripe.com"
}

# Make the request
response = requests.post(f'{API_URL}/maps', headers=HEADERS, json=payload)

# Calculate latency
latency = round((time.time() - start_time) * 1000, 2)
print(f"Request completed in {latency}ms")

# Process the results
data = response.json()
print(f"Found {data['urls_count']} URLs on Stripe's website")

# Print the first 10 URLs as a sample
print("\nSample URLs:")
for url in data['urls'][:10]:
    print(f"- {url}")
    
# Save all URLs to a file for further analysis
with open('stripe_urls.json', 'w') as f:
    json.dump(data, f, indent=2)
print(f"\nAll URLs saved to stripe_urls.json")

Example Response

{
  "id": "map_abc123xyz",
  "urls_count": 3842,
  "urls": [
    "https://stripe.com",
    "https://stripe.com/about",
    "https://stripe.com/blog",
    "https://stripe.com/docs",
    "https://stripe.com/pricing",
    "https://stripe.com/customers",
    "https://stripe.com/partners",
    "https://stripe.com/enterprise",
    "https://stripe.com/payments",
    "https://stripe.com/billing"
    // ... thousands more URLs
  ]
}

Analyzing Stripe’s Website Structure

After extracting all URLs, you can analyze the structure to identify patterns. This is particularly useful for understanding how Stripe organizes their content. For example, you might notice these URL patterns:

  • /blog/** - Blog posts and articles
  • /docs/** - Documentation pages
  • /payments/** - Payment product information
  • /billing/** - Billing product information

In some cases, you only want to get URLs in a specific section of the website. For instance, all blog posts. You can use our inbuilt filter in the next guide.