AX/J/002

Why production scraping looks nothing like the tutorials

Tutorials show BeautifulSoup, a loop and fifty lines. Production demands a pipeline with idempotency, dead-letter queues, schema validation and observability. Why that gap is so wide.

You open a tutorial. Fifty lines of Python. The requests library, BeautifulSoup, one loop over a listing, write to CSV. It works. Comments: 'great, thanks!'. Comments three weeks later from people who tried to ship it to production: silence.

The gap between what tutorials show and what production demands is so wide that we're effectively talking about two different disciplines. Tutorials show a script. Production demands a pipeline.

A tutorial shows a script. Production demands a pipeline. The border between those words is most of the invisible work.

What production demands and tutorials skip

A list of things that are mandatory in production and absent from tutorials:

Idempotency

Tutorial: scrape runs once, writes rows to CSV, ends. Production: scrape runs every day at 04:00, sometimes hangs halfway, sometimes a retry runs the same batch twice through Kafka deduplication. If your write path is not idempotent — meaning: a replay always produces the same end state, never duplicates — you have a database full of garbage by the end of month one.

The solution isn't complicated: INSERT ... ON CONFLICT DO UPDATE in Postgres, content-hash dedup in S3, idempotency keys at the queue level. But it has to be DESIGNED. You didn't take that from the tutorial.

Schema validation at every boundary

The page you scrape today returns the price field as the string '1299.00 PLN'. Next month they change it to a number without currency. Your scrape will still trudge through — BeautifulSoup has no idea. Every number from that day forward is wrong, until someone notices.

Production: every object exiting the scrape passes through Zod, JSON Schema or custom validation. Records that fail validation do not go to the database — they go to a dead-letter queue with the full event context (URL, HTML, timestamp, parser version). A week later someone reviews the DLQ and sees: ah, the site changed format. Without a DLQ, you find out only when the customer calls.

Rate limiting and anti-bot

Tutorial: one loop, sleep 1s between requests. Production: residential proxy pool rotation, fingerprint rotation, custom headers, timing randomisation, CAPTCHA handling (rarely manual, more often via 2captcha or an anti-CAPTCHA service), 429 responses with exponential backoff, per-IP block monitoring, sometimes browser engine rotation when detection rises. This is a whole discipline, not a parameter.

Observability

Tutorial: print('done'). Production: structured logs (JSON, with trace ID, request ID, parser version, runtime ms), Prometheus metrics (success rate, latency percentile, error rate per source), alerts (PagerDuty when success rate drops below 95% for 15 minutes), a dashboard (Grafana with weekly trends), an audit log (who, when, which parser version touched which record).

Without it, you do not know the scrape is broken until it is too late.

Parser versioning

The site changes layout. The parser must change. Old data extracted with parser v1.2 has a different field set than data parsed by v1.3. You store the parser version with every record. Backfilling v1.3 over v1.2 archives is a separate job that runs offline. All of this requires PROCESS, not just code.

The "I'll just add retries" trap

The most common reaction from a developer hitting their first production failure: 'I'll just add retries.' Three of them. With exponential backoff if they're feeling ambitious.

It does not solve the problem. Retries paper over the symptom. The real question is: why did this request fail, and what should happen if it fails fifty times in a row?

  • Network timeout? Retry, but after 3 attempts route to DLQ.
  • A 429 from proxy? Switch proxy, retry, but if 5 IPs in a row return 429 — alarm: the pool is being blocked.
  • Empty response with HTTP 200? Something in the anti-bot has changed. Don't retry — alarm.
  • HTML the parser cannot parse? Don't retry — DLQ plus alarm.

Each of these needs a different handling strategy. A generic retry is an admission that you don't know why something happened and hope it heals itself. Sometimes it does. More often it clutters the DLQ with entries that needed to reach someone two days ago.

The right model: pipeline, not script

Think of scraping as an ETL process with an extra "acquisition" layer. Stages are clearly separated, each has its own metrics, each has its own dead-letter, each is independently scalable.

  1. Trigger: cron, webhook, queue event. The starting point.
  2. Fetch: acquire HTML/JSON from the source. Anti-bot, proxies, retries live here.
  3. Parse: extract data from the raw response. Versioned parser.
  4. Validate: schema. Valid data moves on, invalid data routes to DLQ.
  5. Persist: idempotent write. Dedup on content hash.
  6. Notify: webhook to customer, event to queue, alarm if something went wrong.

Each of these steps leaves a trail in observability. Each can fail independently. Each can be re-run from any point in the pipeline.

When NOT to scrape

The most often forgotten answer. Before you start writing a scrape, check:

  • Does the provider have an API? Often yes, just paid. A paid API usually wins TCO against a scrape over a 12-month window.
  • Is the data in a public dataset? GovInfo, Eurostat, Companies House, KRS, ONS — much of what people scrape is available as a CSV dump.
  • Does the provider permit scraping in their ToS? If not, legal risk is real — EU case law is growing.
  • Does data frequency justify the engineering? Daily prices for 5,000 products from a provider without an API is a scrape. Annual balance-sheet data for three companies — maybe someone retypes it once a year.

The point

Scraping is software engineering. Not a weekend project, not a script, not 'we'll do it in a day'. Everything a tutorial omits is what decides whether the pipeline lives a year or dies in a week.

Tutorials teach how to grab data. Production demands the ability to deliver it. Those two skills have surprisingly little in common.

Hitting a similar problem?

Most of these techniques we ship to production.

If this article resonates with something you are trying to solve — write. Initial project assessment is free.