OPS-24-B1Marketplace Scraping

Real-time pricing intelligence across 1,200 marketplaces

Resilient scraping with anti-bot routing, SKU normalization and 5-minute price-change webhooks into the client's repricing engine.

2.1Mproducts/day
Sector
E-commerce
Surfaces
Browser · Data · Webhooks
Runtime
22 months autonomous
Published
2024-03-04

Challenge

The client — a large e-commerce company with a private portfolio of 50 own brands — needed a view of their and direct competitor prices across 1,200+ marketplaces in EU and US. They previously bought data from two providers. Problems: 24h delay (their pricing cycle is in hours), low SKU coverage (~65%), no visibility into data quality, cost €38k/mo.

Goal: monitor 2M+ products per day, <5 minute delay from price change at source to webhook in the repricing engine, SKU coverage >95%, operating cost <€15k/mo.

Approach

Three-layer architecture. Scraping layer: 480 Playwright workers in Kubernetes across three regions (EU-West, EU-Central, US-East), each with isolated fingerprint and its own residential proxy pool. Distribution per marketplace optimized for their individual rate limits and detection patterns.

Normalization layer: SKU matching via combination of EAN/UPC/MPN, fuzzy name matching (Levenshtein + embedding similarity for atypical cases), canonical product graph in Postgres with 18M nodes. Every new record hits this graph — either as a match to an existing SKU or a new node.

Delivery layer: change detection on every record, webhooks under 5 minutes from change (most <90 seconds), hourly batch export to S3 in Parquet, Next.js dashboard for client analysts.

Outcome

SKU coverage: 97.2%. Median price-change detection time: 84 seconds. Operating cost: €11,800/mo (proxies, infra, LLM for edge cases).

The client saved €26k/mo versus prior providers, gained 23× shorter latency, and over 30 points higher SKU coverage.

The system survived two Cloudflare anti-bot engine updates in 2024 (each patched within 6h of detection), and a full Amazon EU UI migration in August 2025 (patched within 18h via an API backup path).

Stack

PlaywrightKubernetesBright Data + OxylabsPostgresS3 + ParquetTemporalNext.jsGrafana

Metrics

  • 2.1MProducts/day
  • 1,200+Marketplaces
  • 84sMedian latency
  • 97.2%SKU coverage
  • −68%Saving vs prior
  • 99.94%Uptime
Similar problem in your business?

Every project is different, but patterns repeat.

If you recognise pieces of this case study in your own situation — write. We usually see in the first call whether it is hours-per-week scale or months of infrastructure.