AX/G/004

GDPR and scraping: what you can, what you cannot

Practical guide for a small business without in-house legal. No "consult a lawyer" — concrete rules that apply.

Disclaimer: we are not lawyers, this is not legal advice. This guide summarises practices we have used in production for 4 years for clients in PL/EU. For high-risk projects (medical, finance, sensitive data) consult an attorney.

GDPR does not ban scraping. It bans specific operations on personal data. Most scraping does not touch personal data at all.

Rule 1: Public data ≠ free for any use

"Anything public I can scrape" — wrong. Public availability does not mean consent to any use. A product price on an online shop is public — but that does not mean you can republish it with attribution to the shop. Site ToS may prohibit scraping. Copyright protects database arrangement.

Practical rule: public data for internal use (analytics, benchmarking, monitoring) = safe. Publication unchanged = risky. Commercial redistribution = usually requires an agreement with the source.

Rule 2: Personal data requires a legal basis

GDPR Art. 6 lists 6 bases for processing personal data. In scraping context three are most often relevant:

  • Art. 6(1)(a) — consent: the person consented. Impossible at scrape scale.
  • Art. 6(1)(b) — performance of contract: if the person is your client / contractor.
  • Art. 6(1)(f) — legitimate interest: business purpose justifies processing, but requires LIA (Legitimate Interest Assessment) — documented evaluation.

Practically: scraping professional data (LinkedIn business pages, company directories) usually rests on 6(1)(f). Scraping private data (personal profiles, private addresses, sensitive data) — has no legal basis in 99% of cases.

Rule 3: Publication context matters

A name and email published on the company "Contact us" page as a business contact = intended for business use. The same data on a personal Facebook profile = intended for social use.

First test: in what context is the data published? If a public business contact, scraping for B2B sales is usually OK under 6(1)(f). If from a personal context, do NOT use for B2B sales — that violates legitimate interest balance.

Rule 4: Sectoral exclusions

Some data categories are specially protected (Art. 9 GDPR):

  • Health data
  • Racial or ethnic origin
  • Political, religious, philosophical views
  • Sexual orientation
  • Biometric data (e.g. facial recognition)
  • Trade union membership

These require consent of the individual OR strictly defined legal bases (medical research, legal proceedings, etc). Scraping these for marketing/sales = high risk regardless of source.

Rule 5: Information obligations (Art. 13 and 14)

If you collect personal data (from any source, including scraping), you must fulfil the information obligation:

  • Who is the controller and contact
  • Purpose of processing and legal basis
  • Retention period
  • Rights of the person (access, rectification, erasure)

Art. 14 (data collected not from the person) requires informing the person within one month of collection, unless: providing the information is impossible or would require disproportionate effort.

Practically: when scraping B2B data (e.g. a list of 5000 companies), publish a privacy policy explaining how you collect and process the data. This covers Art. 14 for scale where manual notification is unreasonable.

Rule 6: Right to be forgotten

A person has the right to demand erasure of their data (Art. 17). In practice: if the email 'john@company.com' is in your database from a scrape and John writes asking to be removed — you must erase within one month.

Operational implementation: dedicated endpoint / form for erasure requests (e.g. privacy@axsolutions.pl), tracking system for requests, hard-delete vs soft-delete strategy. Without this you get fined.

Rule 7: Target ToS can be an additional layer

Independent of GDPR, the target site may have ToS prohibiting scraping. That is a civil contract matter, not criminal, but ToS violation can end in: IP ban, cease & desist, in extreme cases a lawsuit (hiQ Labs vs LinkedIn is the key precedent — they won, but after 5 years of dispute).

Practical rule:

  • E-commerce shop monitoring: ToS often prohibits but enforcement is minimal. Risk acceptable for most scenarios.
  • LinkedIn / Facebook: ToS aggressively enforced. Enterprise scale = real legal risk.
  • Gov portals (BIP, EU procurement): public data by definition. No ToS issue.

Rule 8: Rate limiting as engineering ethics

GDPR does not require it but good scraping always uses:

  • Comfortable rate (1 request per 5-30 seconds per target, no hammering)
  • Respecting robots.txt (informational, not a legal obligation, but a signal of good practice)
  • Identifiable User-Agent (no pretending to be a regular browser)
  • Backing off on 429/503
  • Off-hours scraping when possible

This increases resilience (fewer bans, more uptime) and reduces legal risk (clear lack of destabilisation intent).

Checklist for a scrape project

Before going to production:

  1. Data identification: am I collecting personal data? If yes, which categories?
  2. Legal basis: for personal data define GDPR Art. 6.
  3. LIA if 6(1)(f): documented business justification vs individual rights.
  4. Privacy policy published, contains Art. 13/14 required info.
  5. Erasure mechanism: privacy@ endpoint, max 30-day response time.
  6. Retention: defined erasure terms (e.g. 24 months from last contact).
  7. Rate limiting: comfortable, respecting the target.
  8. Logging: timestamp + source + purpose of every operation (audit trail).

The point

Scraping public business data for B2B sales/marketing/research is legal in 95% of cases if:

  • you have a privacy policy explaining Art. 14
  • you have a data erasure mechanism
  • you do not scrape specially-protected data (Art. 9)
  • you use comfortable rate and a transparent User-Agent
  • you have a documented LIA when basing on 6(1)(f)

The rest are edge cases — regulated sectors, enterprise scale, unclear ToS. Then consultation with an attorney is unavoidable. For a typical small business doing price monitoring, lead intel, market research — the above checklist suffices.

Hitting a similar problem?

Most of these techniques we ship to production.

If this article resonates with something you are trying to solve — write. Initial project assessment is free.