Browser fingerprinting identifies visitors based on browser and device characteristics. Unlike cookies, fingerprinting collects attributes that are difficult to modify: screen resolution, installed fonts, WebGL renderer, timezone, language settings, and dozens of other data points. Combined, these create a unique identifier that persists across sessions.
Standard Python libraries like requests produce fingerprints that look nothing like real browsers. The TLS handshake differs, HTTP headers are wrong, and there’s no JavaScript execution. Anti-bot systems detect this immediately.
WEB SCRAPING BOT DETECTION
Commercial anti-bot services:
- Cloudflare Bot Management - most common, millions of websites
- DataDome - aggressive detection, common on e-commerce
- Akamai Bot Manager - enterprise-level
- PerimeterX (HUMAN) - behavioral analysis focused
- Kasada - heavy JavaScript obfuscation
- Shape Security (F5) - enterprise financial sector
Fingerprinting techniques these services use:
-
TLS Fingerprinting (JA3/JA4) - hashes TLS client hello parameters (cipher suites, extensions, curves) to identify the client. Python’s
requestsproduces a hash that identifies it as Python, not Chrome. -
HTTP/2 Fingerprinting - analyzes HTTP/2 settings (window size, header table size, max concurrent streams). Each browser has a unique pattern. Most HTTP libraries use defaults that match no real browser.
-
JavaScript Fingerprinting - collects Navigator object, screen properties, WebGL renderer, audio context, canvas data. Headless browsers expose
navigator.webdriverand missing plugins. -
Canvas Fingerprinting - renders shapes/text to canvas and hashes pixel data. GPU, drivers, and OS create device-specific fingerprints.
-
Behavioral Fingerprinting - tracks mouse movements, scroll patterns, click timing. Bots click instantly and scroll linearly. Humans have curves and pauses.
METHODS TO BYPASS BOT DETECTION
I use two Python libraries depending on the target site:
- curl_cffi - when the site returns data directly in HTML or JSON responses
- pydoll - when the site requires JavaScript to render content or solve challenges
The key is matching fingerprints at every layer. A real Chrome user-agent means nothing if the TLS fingerprint identifies Python.
curl_cffi
Use when: API endpoints, server-rendered HTML, sites without JavaScript challenges.
Python binding for curl-impersonate. Mimics browser TLS and HTTP/2 fingerprints. No browser required.
from curl_cffi import requests
response = requests.get(
"https://example.com",
impersonate="chrome"
)
With session and proxy:
from curl_cffi import requests
session = requests.Session(impersonate="chrome")
proxies = {
"http": "socks5://user:pass@proxy.example.com:1080",
"https": "socks5://user:pass@proxy.example.com:1080"
}
response = session.get("https://example.com", proxies=proxies)
Fast, lightweight, handles most Cloudflare-protected APIs. Cannot execute JavaScript.
pydoll
Use when: JavaScript-rendered content, CAPTCHA challenges, sites detecting headless browsers.
Browser automation via DevTools Protocol. Controls real Chrome without automation flags. navigator.webdriver returns false.
import asyncio
from pydoll.browser import Browser
async def scrape():
async with Browser() as browser:
page = await browser.get_page()
await page.go_to("https://example.com")
await page.wait_for_selector("div.content")
html = await page.get_page_source()
return html
asyncio.run(scrape())
Slower than curl_cffi but required for SPAs, React/Vue sites, and aggressive bot protection.
RATE LIMITING & THROTTLING
Aggressive scraping triggers detection regardless of fingerprint quality. Servers track requests per IP and flag unusual patterns.
Rate limit indicators:
- HTTP 429 (Too Many Requests)
- HTTP 503 (Service Unavailable)
- Cloudflare challenge pages
- CAPTCHA triggers
- Empty or error responses
Random delays:
- Adds variable wait time between requests (e.g., 2-5 seconds)
- Mimics human browsing speed
- Use when: scraping at moderate volume without hitting blocks
Exponential backoff:
- Doubles wait time after each failed attempt
- Prevents hammering servers when rate limited
- Use when: receiving 429/503 responses or challenge pages
Distributed scraping:
- Spreads requests across multiple IPs via proxy rotation
- Runs concurrent sessions with different fingerprints
- Use when: large-scale scraping or aggressive protection
CAPTCHA SOLVING
CAPTCHAs appear when anti-bot systems aren’t confident a visitor is human. They trigger after behavioral flags, rate limits, or accessing sensitive pages.
CAPTCHA types:
- reCAPTCHA v2 - checkbox “I’m not a robot” plus image challenges
- reCAPTCHA v3 - invisible scoring based on behavior (0.0 to 1.0)
- hCaptcha - similar to reCAPTCHA v2, privacy-focused alternative
- Cloudflare Turnstile - lightweight challenge, often invisible
CAPTCHA triggers:
- Low reCAPTCHA v3 score (below 0.5)
- Suspicious TLS or HTTP fingerprint
- IP reputation issues
- Abnormal navigation patterns
- High request volume
Solving services:
- 2Captcha - ~$2.99 per 1000 solves, 20-40 second average
- Anti-Captcha - ~$2.00 per 1000 solves, slightly faster
How it works:
- Submit CAPTCHA details (site key, page URL) to solving service API
- Service returns a token after human workers solve it
- Submit token with your form/request to bypass the challenge
Cost considerations:
- Minimize triggers by improving fingerprinting first
- Failed solves still cost money
- reCAPTCHA v3 is cheaper (invisible, no image challenges)
- Batch operations where possible
PROXIES FOR BYPASSING IP BASED BOT DETECTION
IP reputation is fundamental to bot detection. Websites track IPs exhibiting scraping behavior, and data centers have poor reputations. Proxies distribute requests across many IPs.
Proxy types:
- Residential - IPs from real ISPs assigned to home users
- Datacenter - IPs from cloud providers and hosting companies
- Mobile - IPs from cellular carriers (4G/5G)
How proxies are graded:
- IP reputation - history of abuse, spam, or scraping
- Neighboring IPs - if nearby IPs are flagged, yours may be too
- Subnet diversity - IPs from the same /24 block share reputation
- ASN reputation - some hosting providers are heavily flagged
Residential Proxies
IPs from real home internet connections. Appear as regular users from ISPs like Comcast, AT&T, or Vodafone.
Pros:
- Highest trust level - IPs look like real users
- Difficult to detect and block
- Large pools available (millions of IPs)
Cons:
- Most expensive proxy type
- Speeds vary based on actual connection
- Some pools have overused IPs
Provider I use: CliProxy
- Pool size: 100+ million residential IPs
- Coverage: 195+ countries, 180+ regions
- Pricing: $0.72/GB standard, $0.58/GB enterprise, unlimited plans at $38.33/day
- Features: HTTP(s)/SOCKS5 support, 1-60 minute session rotation, city targeting, 99.9% availability
Datacenter Proxies
IPs from cloud providers and hosting companies. Faster and cheaper but easier to detect.
Pros:
- Fast and reliable connections
- Cheapest proxy type
- Consistent performance
Cons:
- IPs are known datacenter ranges
- Lower trust scores than residential
- Often blocked by sophisticated anti-bot systems
Provider I use: SmartProxy
- Pool size: 500K+ datacenter IPs
- Coverage: 200+ countries
- Pricing: Starting at $7.50/month for dedicated, $10/month for flexible shared
- Features: Lower latency, higher throughput, good for SEO audits and public data crawling
Mobile Proxies
IPs from cellular networks (4G/5G). Carriers use CGNAT, meaning thousands of users share the same IP. Extremely difficult to block without affecting real users.
Pros:
- Highest trust level available
- Shared IPs mean blocking affects real users
- Excellent for heavily protected sites
Cons:
- Most expensive option
- Slower than datacenter/residential
- Limited geographic targeting
Provider I use: MobileProxy.Space
- Pool size: 40 countries, 185 cities, 171 mobile operators
- Coverage: Global with easy geolocation switching
- Pricing: Starting at $4.90/day, discounts for longer plans (daily to yearly)
- Features: Private dedicated proxies (not shared), unlimited traffic, 24/7 support, free 2-hour trial
Proxy Type Comparison
| Target Site Protection | Recommended Proxy | Cost |
|---|---|---|
| None/Basic | Datacenter | Lowest |
| Cloudflare (standard) | Residential | Medium |
| Aggressive (DataDome) | Residential/Mobile | Medium-High |
| Maximum protection | Mobile | Highest |
Start with datacenter proxies and upgrade to residential or mobile when necessary. This optimizes cost while maintaining success rates.
Browser fingerprinting and IP reputation determine whether your scraper succeeds or gets blocked. Match fingerprints at every layer, use appropriate proxies for the protection level, and implement rate limiting to stay under the radar.
What sites are you scraping?