Anti-bot detection: what Cloudflare does and how Playwright handles it

Once upon a time scraping was "fetch the URL with curl, add User-Agent: Mozilla/5.0". In 2026 this works on maybe 10% of commercial sites. The rest have an anti-bot system that sees your bot in 200ms. This guide explains why.

Anti-bot detection does not check IF you are a bot. It checks HOW MUCH you look like one — and outputs a score. Above threshold = block.

4 main players

Most anti-bot systems are:

Cloudflare Bot Management — default for 30%+ of the internet. Every request goes through their edge network, they see everything. The JS challenge ("Checking your browser…") is their signature.
Akamai Bot Manager — used by most Fortune 500 (banks, telecoms, premium retail). More aggressive than Cloudflare, more often hard-bans.
Datadome — French player, popular in EU. Specialty: e-commerce and ticketing.
PerimeterX (HUMAN) — used in sneakers, drops, gaming. Hardest to bypass.

Plus smaller: Imperva (Incapsula), Kasada, F5 Shape. All use similar techniques, differ mainly in aggressiveness and price.

What they actually check (40+ signals)

Each request carries hundreds of leaks. Anti-bot systems look at:

1. Network layer (~10 signals)

IP reputation (is the address in blacklists, is it datacenter or ISP)
TLS fingerprint (cipher suite order, supported extensions — differs between curl, Python requests, Chrome)
HTTP/2 fingerprint (pseudo-headers order, settings frame values)
TCP fingerprint (window size, MSS, options order — OS leak)

2. Browser identification (~15 signals)

User-Agent + sec-ch-ua headers (client header consistency)
Headers order (Chrome sends in a different order than Python)
Headers presence (sec-fetch-dest, sec-fetch-mode — very hard to fake)
Accept-Language / Accept-Encoding details

3. JavaScript fingerprinting (~20 signals)

Canvas fingerprint — draw a pixel pattern, hash the result. Different per device.
WebGL fingerprint — GPU vendor + renderer + supported extensions.
Audio fingerprint — generate a sine wave, hash the output.
Fonts enumeration — what fonts are installed.
Plugins / MimeTypes — dying but still used.
Screen resolution + colorDepth + pixelRatio
Timezone + Intl.DateTimeFormat consistency vs IP geolocation.

4. Behavioral (~10 signals)

Mouse movement curves (humans = bezier-like, bots = linear)
Typing rhythm on forms
Scroll velocity and acceleration
Time between page load and first interaction (bots = milliseconds, humans = seconds)

Why curl / requests are not enough

Python requests has a TLS fingerprint visible from 100m. HTTP/2 it does not even support. First 3 signals = block. User-Agent will not help here.

Headless Chrome without modifications has navigator.webdriver = true. Other quirks: missing chrome.runtime, inconsistency navigator.plugins.length === 0 with navigator.platform === "Win32". All checked.

How Playwright handles it

Vanilla Playwright = headless Chrome = detectable. Production scraping uses:

playwright-extra + stealth plugin — patches 20+ headless detection flags
Real fingerprint rotation — generator of legit canvas/WebGL fingerprints
Residential proxy pool — IPs from real ISPs, not datacenter
Browser pool sharing — reuse sessions to look like returning user
CDP-based interaction — mouse moves curve-based, typing speed varied

Plus context — request comes with a sensible referer chain, accept-language matches IP geo, sessions live across multiple page navigations.

Practical takeaways

What this means for a business commissioning scraping:

"Cloudflare protected" does not mean "impossible" — possible, but 5-10× more expensive than an unprotected site.
Datacenter proxies ($1-5/GB) suffice only for the weakest anti-bot. Bigger targets need residential ($5-15/GB).
PerimeterX targets (sneakers, top retail) — sometimes economically infeasible for small projects.
Realistic accuracy for protected sites: 70-95%. Never 100%. Accept retry logic.
ToS violation enforcement is rising — if scraping borders on legal grey area, assess risk with a lawyer.

The point

Anti-bot detection in 2026 is a multi-year arms race. Sites protect data, scrapers use ever better techniques. Production scraping of protected targets is more expensive but possible. If your vendor promises "100% accuracy on Cloudflare for $500" — they are lying or do not understand the problem.

Sensible expectations: 90-98% accuracy, retry logic, 3-15× higher infra cost than unprotected targets, acceptance that every few months parser updates are needed. Real, measurable, working for years.

Anti-bot detection: what Cloudflare does and how Playwright handles it

§014 main players

§02What they actually check (40+ signals)

1. Network layer (~10 signals)

2. Browser identification (~15 signals)

3. JavaScript fingerprinting (~20 signals)

4. Behavioral (~10 signals)

§03Why curl / requests are not enough

§04How Playwright handles it

§05Practical takeaways

§06The point

Most of these techniques we ship to production.