WEB SCRAPING IS DEAD SIMPLE WITH CLAUDE 4.5

Web scraping used to mean writing custom parsers, handling edge cases, and debugging failed requests for hours. With Claude 4.5 Opus, you describe what data you want and it reverse engineers the APIs for you. The model writes Python, runs it, analyzes the output, and iterates until the job is done.

WHY THIS MATTERS

Most websites don’t publish their APIs. The data you see on screen comes from internal endpoints that were never meant for external consumption. Finding these endpoints manually means hours in DevTools, tracing network requests, and reverse engineering authentication flows.

Claude 4.5 does this automatically. Give it a HAR file containing your browsing session and it identifies the API patterns, writes extraction scripts, and handles pagination. What took days now takes minutes.

WHAT MAKES CLAUDE 4.5 DIFFERENT

Claude 4.5 models analyze abstract data formats by executing Python code rather than just reading files. Feed it HTML, JSON, XML, or CSV and it writes scripts to parse, filter, and extract exactly what you need.

The key difference is the feedback loop. Claude writes a script, runs it, reads the output, then refines its approach based on results. This self-correcting behavior means it digs deeper into complex nested data structures without manual intervention.

THE HEREDOC TRICK

The trick is prompting Claude to use HEREDOC syntax in bash. This lets it write and execute Python in a single tool call without creating temporary files. Tell Claude to analyze data using “Python in HEREDOC patterns” and it will generate inline scripts, run them, and iterate based on the output.

This pattern unlocks autonomous analysis. Claude can run a script, see that pagination goes to page 500 instead of 50, adjust its approach, and continue without asking for permission.

THE HAR FILE WORKFLOW

Capturing Traffic

Use Proxyman or Chrome DevTools to intercept network traffic. Open the Network tab, then navigate through the website performing the actions you want to automate: browsing listings, clicking profiles, loading more results.

Every API call the site makes gets recorded. You’ll see JSON responses containing the exact data rendered on the page, often with more fields than displayed. Export everything to a HAR file.

Prompting for Analysis

Structure your prompt with three parts: tell Claude to use Python HEREDOC for analysis, list the specific functions you want identified, and define the end goal. Being explicit about what API patterns to find prevents Claude from getting lost in irrelevant requests.

Analyse ./website.har to reverse engineer the API calls and identify the functions.

functions to identify:
 - get_all_users: all user data metrics including pagination
 - user_profile: all user profile data
 - user_profile_reviews: all user profile reviews including pagination

finally, summarise a plan to save as .py script for exporting all data into a single .json file

BYPASSING ANTI-BOT

curl_cffi for Most Websites

Start with curl_cffi and Chrome impersonation. It handles TLS fingerprinting and HTTP/2 correctly, which defeats most bot detection. Add impersonate="chrome" to your requests and most sites treat you as a real browser.

Pydoll for Hard Targets

When curl_cffi fails, switch to Pydoll. It runs a real browser instance that bypasses 99% of anti-bot systems including Cloudflare and DataDome. Heavier and slower, but necessary for sites with aggressive detection.

Cookie Management

Cookies maintain your authenticated state across requests. Extract them from your browser session or HAR file and include them in API calls. For long-running scrapes, handle expiration by refreshing tokens before they expire.

HTML PARSING FALLBACK

When APIs Don’t Exist

When no API exists or the HTML is simpler than the API, fall back to BeautifulSoup. Use bs4 to extract data metrics directly from rendered pages. This works for static sites or when the API requires complex authentication.

Large Context Requirements

Complex HTML pages require large context windows to write accurate parsing functions. A 1 million token context is often necessary to see the full page structure. Gemini 3.0 Pro handles this well due to its context size.

Claude + Gemini Workflow

Use Claude to orchestrate the scrape and dump raw HTML to files. Feed that HTML to Gemini 3.0 Pro to generate the bs4 parsing function. Claude handles the workflow and execution while Gemini handles large document analysis.

REAL-WORLD EXAMPLE

Target: Realestate.com.au property listings across New South Wales.

Step 1: Capture Traffic

Opened DevTools Network tab. Searched properties in Sydney, filtered by price range, clicked through 10 listings, paginated through results. Exported to realestate.har (4.1 MB).

Step 2: Prompt Claude

Analyse ./realestate.har using Python HEREDOC.

Identify:
 - search_listings: paginated property results with price, bedrooms, location
 - property_details: full listing with agent, description, features, images

Create a scraper script that exports all NSW listings to properties.json

Step 3: Claude Output

Claude identified two API endpoints:

GET /api/v1/search/buy?state=nsw&page=N&limit=100
GET /api/v1/property/{id}/details

It wrote a Python script using curl_cffi, handled pagination across 500+ pages, and exported 52,341 listings to JSON in 45 minutes.

Final Output Structure

{
  "properties": [
    {
      "id": "142857",
      "address": "12 Example St, Sydney NSW 2000",
      "price": 1250000,
      "bedrooms": 3,
      "bathrooms": 2,
      "parking": 1,
      "property_type": "House",
      "agent": "Smith Real Estate",
      "description": "...",
      "features": ["Air Conditioning", "Garden"],
      "images": [...]
    }
  ],
  "total": 52341,
  "scraped_at": "2025-01-15T14:32:00Z"
}

The barrier to web scraping has dropped dramatically. What used to require deep knowledge of HTTP, browser fingerprinting, and parsing libraries now requires a clear prompt and a HAR file. Claude handles the complexity while you focus on what to extract.

The workflow is simple: capture traffic, prompt Claude, get data. Start with curl_cffi, escalate to Pydoll if needed, fall back to HTML parsing as a last resort. Most sites fall to the first approach.