Web scraping is the process of automatically retrieving data from websites and structuring it for further use. Two main approaches:
- HTTP scraping — direct requests (curl, Python requests, axios). Fast, cheap, but works only on static sites without JS rendering.
- Browser scraping — via browser automation (Playwright, Puppeteer). Slower and more expensive, but works everywhere.
Legality: public business data — usually legal (hiQ Labs vs LinkedIn precedent). Personal data, private content, ToS violations — grey zones or clearly illegal. See our GDPR vs scraping guide.
Typical challenges:
- Anti-bot detection (Cloudflare, Akamai, Datadome, PerimeterX)
- Rate limiting and IP blocking
- Selector drift (parser breakage when the site changes)
- JavaScript-rendered content
- CAPTCHA
Production-grade scraping requires retry logic, monitoring, proxy rotation, and schema validation — "fetch + parse" is not enough.