AX/G/006

Anti-bot detection: what Cloudflare does and how Playwright handles it

Guide for non-developers. How sites detect bots, what major systems use, why "just add a user-agent" stopped working long ago.

Once upon a time scraping was "fetch the URL with curl, add User-Agent: Mozilla/5.0". In 2026 this works on maybe 10% of commercial sites. The rest have an anti-bot system that sees your bot in 200ms. This guide explains why.

Anti-bot detection does not check IF you are a bot. It checks HOW MUCH you look like one — and outputs a score. Above threshold = block.

4 main players

Most anti-bot systems are:

  • Cloudflare Bot Management — default for 30%+ of the internet. Every request goes through their edge network, they see everything. The JS challenge ("Checking your browser…") is their signature.
  • Akamai Bot Manager — used by most Fortune 500 (banks, telecoms, premium retail). More aggressive than Cloudflare, more often hard-bans.
  • Datadome — French player, popular in EU. Specialty: e-commerce and ticketing.
  • PerimeterX (HUMAN) — used in sneakers, drops, gaming. Hardest to bypass.

Plus smaller: Imperva (Incapsula), Kasada, F5 Shape. All use similar techniques, differ mainly in aggressiveness and price.

What they actually check (40+ signals)

Each request carries hundreds of leaks. Anti-bot systems look at:

1. Network layer (~10 signals)

  • IP reputation (is the address in blacklists, is it datacenter or ISP)
  • TLS fingerprint (cipher suite order, supported extensions — differs between curl, Python requests, Chrome)
  • HTTP/2 fingerprint (pseudo-headers order, settings frame values)
  • TCP fingerprint (window size, MSS, options order — OS leak)

2. Browser identification (~15 signals)

  • User-Agent + sec-ch-ua headers (client header consistency)
  • Headers order (Chrome sends in a different order than Python)
  • Headers presence (sec-fetch-dest, sec-fetch-mode — very hard to fake)
  • Accept-Language / Accept-Encoding details

3. JavaScript fingerprinting (~20 signals)

  • Canvas fingerprint — draw a pixel pattern, hash the result. Different per device.
  • WebGL fingerprint — GPU vendor + renderer + supported extensions.
  • Audio fingerprint — generate a sine wave, hash the output.
  • Fonts enumeration — what fonts are installed.
  • Plugins / MimeTypes — dying but still used.
  • Screen resolution + colorDepth + pixelRatio
  • Timezone + Intl.DateTimeFormat consistency vs IP geolocation.

4. Behavioral (~10 signals)

  • Mouse movement curves (humans = bezier-like, bots = linear)
  • Typing rhythm on forms
  • Scroll velocity and acceleration
  • Time between page load and first interaction (bots = milliseconds, humans = seconds)

Why curl / requests are not enough

Python requests has a TLS fingerprint visible from 100m. HTTP/2 it does not even support. First 3 signals = block. User-Agent will not help here.

Headless Chrome without modifications has navigator.webdriver = true. Other quirks: missing chrome.runtime, inconsistency navigator.plugins.length === 0 with navigator.platform === "Win32". All checked.

How Playwright handles it

Vanilla Playwright = headless Chrome = detectable. Production scraping uses:

  • playwright-extra + stealth plugin — patches 20+ headless detection flags
  • Real fingerprint rotation — generator of legit canvas/WebGL fingerprints
  • Residential proxy pool — IPs from real ISPs, not datacenter
  • Browser pool sharing — reuse sessions to look like returning user
  • CDP-based interaction — mouse moves curve-based, typing speed varied

Plus context — request comes with a sensible referer chain, accept-language matches IP geo, sessions live across multiple page navigations.

Practical takeaways

What this means for a business commissioning scraping:

  1. "Cloudflare protected" does not mean "impossible" — possible, but 5-10× more expensive than an unprotected site.
  2. Datacenter proxies ($1-5/GB) suffice only for the weakest anti-bot. Bigger targets need residential ($5-15/GB).
  3. PerimeterX targets (sneakers, top retail) — sometimes economically infeasible for small projects.
  4. Realistic accuracy for protected sites: 70-95%. Never 100%. Accept retry logic.
  5. ToS violation enforcement is rising — if scraping borders on legal grey area, assess risk with a lawyer.

The point

Anti-bot detection in 2026 is a multi-year arms race. Sites protect data, scrapers use ever better techniques. Production scraping of protected targets is more expensive but possible. If your vendor promises "100% accuracy on Cloudflare for $500" — they are lying or do not understand the problem.

Sensible expectations: 90-98% accuracy, retry logic, 3-15× higher infra cost than unprotected targets, acceptance that every few months parser updates are needed. Real, measurable, working for years.

Hitting a similar problem?

Most of these techniques we ship to production.

If this article resonates with something you are trying to solve — write. Initial project assessment is free.