Once upon a time scraping was "fetch the URL with curl, add User-Agent: Mozilla/5.0". In 2026 this works on maybe 10% of commercial sites. The rest have an anti-bot system that sees your bot in 200ms. This guide explains why.
Anti-bot detection does not check IF you are a bot. It checks HOW MUCH you look like one — and outputs a score. Above threshold = block.
4 main players
Most anti-bot systems are:
- Cloudflare Bot Management — default for 30%+ of the internet. Every request goes through their edge network, they see everything. The JS challenge ("Checking your browser…") is their signature.
- Akamai Bot Manager — used by most Fortune 500 (banks, telecoms, premium retail). More aggressive than Cloudflare, more often hard-bans.
- Datadome — French player, popular in EU. Specialty: e-commerce and ticketing.
- PerimeterX (HUMAN) — used in sneakers, drops, gaming. Hardest to bypass.
Plus smaller: Imperva (Incapsula), Kasada, F5 Shape. All use similar techniques, differ mainly in aggressiveness and price.
What they actually check (40+ signals)
Each request carries hundreds of leaks. Anti-bot systems look at:
1. Network layer (~10 signals)
- IP reputation (is the address in blacklists, is it datacenter or ISP)
- TLS fingerprint (cipher suite order, supported extensions — differs between curl, Python requests, Chrome)
- HTTP/2 fingerprint (pseudo-headers order, settings frame values)
- TCP fingerprint (window size, MSS, options order — OS leak)
2. Browser identification (~15 signals)
- User-Agent + sec-ch-ua headers (client header consistency)
- Headers order (Chrome sends in a different order than Python)
- Headers presence (sec-fetch-dest, sec-fetch-mode — very hard to fake)
- Accept-Language / Accept-Encoding details
3. JavaScript fingerprinting (~20 signals)
- Canvas fingerprint — draw a pixel pattern, hash the result. Different per device.
- WebGL fingerprint — GPU vendor + renderer + supported extensions.
- Audio fingerprint — generate a sine wave, hash the output.
- Fonts enumeration — what fonts are installed.
- Plugins / MimeTypes — dying but still used.
- Screen resolution + colorDepth + pixelRatio
- Timezone + Intl.DateTimeFormat consistency vs IP geolocation.
4. Behavioral (~10 signals)
- Mouse movement curves (humans = bezier-like, bots = linear)
- Typing rhythm on forms
- Scroll velocity and acceleration
- Time between page load and first interaction (bots = milliseconds, humans = seconds)
Why curl / requests are not enough
Python requests has a TLS fingerprint visible from 100m. HTTP/2 it does not even support. First 3 signals = block. User-Agent will not help here.
Headless Chrome without modifications has navigator.webdriver = true. Other quirks: missing chrome.runtime, inconsistency navigator.plugins.length === 0 with navigator.platform === "Win32". All checked.
How Playwright handles it
Vanilla Playwright = headless Chrome = detectable. Production scraping uses:
- playwright-extra + stealth plugin — patches 20+ headless detection flags
- Real fingerprint rotation — generator of legit canvas/WebGL fingerprints
- Residential proxy pool — IPs from real ISPs, not datacenter
- Browser pool sharing — reuse sessions to look like returning user
- CDP-based interaction — mouse moves curve-based, typing speed varied
Plus context — request comes with a sensible referer chain, accept-language matches IP geo, sessions live across multiple page navigations.
Practical takeaways
What this means for a business commissioning scraping:
- "Cloudflare protected" does not mean "impossible" — possible, but 5-10× more expensive than an unprotected site.
- Datacenter proxies ($1-5/GB) suffice only for the weakest anti-bot. Bigger targets need residential ($5-15/GB).
- PerimeterX targets (sneakers, top retail) — sometimes economically infeasible for small projects.
- Realistic accuracy for protected sites: 70-95%. Never 100%. Accept retry logic.
- ToS violation enforcement is rising — if scraping borders on legal grey area, assess risk with a lawyer.
The point
Anti-bot detection in 2026 is a multi-year arms race. Sites protect data, scrapers use ever better techniques. Production scraping of protected targets is more expensive but possible. If your vendor promises "100% accuracy on Cloudflare for $500" — they are lying or do not understand the problem.
Sensible expectations: 90-98% accuracy, retry logic, 3-15× higher infra cost than unprotected targets, acceptance that every few months parser updates are needed. Real, measurable, working for years.