Continuous lead intelligence system for a B2B SaaS sales team
Daily multi-source enrichment of 40k+ accounts, intent signals, decision-maker mapping. Hands-off since Q2 2024.
Sub-30-second refresh, 800+ trading pairs, 14 platforms behind Cloudflare Enterprise and similar. 24/7 production for 18 months.
A fintech client (Series B, trading platform for retail crypto+forex traders) competed on a fundamental metric: the speed at which their UI displayed current pricing data from competing venues. Their research showed traders deciding to buy/sell look at 3–4 platforms simultaneously — whoever shows pricing first retains user attention.
State in 2024: their scrape pipeline used Selenium + datacenter proxies, refresh every 5–10 minutes, success rate <75% due to blocks on Cloudflare-protected venues. Real lead time vs largest competitors was 4–6 minutes BEHIND. Every minute of delay was a measurable hit to user retention.
The internal team attempted an upgrade twice. The first attempt (residential proxies + Playwright) raised success rate to 89% but did not change latency — too many synchronous retries. The second attempt (rewriting on Go workers + Kafka) increased throughput but did not solve anti-bot detection in peak hours.
We designed a real-time architecture with three layers of parallelism: spatial (14 venues scanned simultaneously), temporal (every venue with a dedicated worker pool refreshing every 25–30s), redundancy (every critical pair scraped from 3 venues with cross-validation).
Critical decisions: residential mobile proxy pool for peak hours (when datacenter becomes useless), browser farm with 60–80 concurrent sessions per venue, dedicated fingerprint pool per venue (each venue gets its own anti-bot tuning), real-time event stream to the client via Kafka.
Architecture: Temporal as orchestrator (continuous workflows, not batch), Playwright pool in Kubernetes with autoscaling on latency metric, ClickHouse for time-series price storage, custom WebSocket gateway for real-time delivery to client UI. Persistence double-buffered — primary path to Redis (sub-millisecond reads), secondary to Postgres for historical analytics.
Anti-detection: per-venue persona engineering — every worker has a stable identity (browser fingerprint, IP geography, user agent, cookies) preserved for 6–12 hours. Rotation only when CAPTCHA rate exceeds threshold. Behavioural simulation per venue tuned to typical user patterns on that platform.
Pricing lead time vs largest competitors: -7 minutes average (the client sees pricing 7 minutes before competitors). Measured via third-party benchmarking service over 6 months.
Success rate aggregated across 14 venues: 97.8% (vs 75% baseline). End-to-end P95 latency (from venue source to client UI): 1.8 seconds.
User retention metric (return-within-7-days for active traders): +28% post-deployment. Client attributes this to lead time advantage based on a controlled rollout.
System running 18 months 24/7 at 99.7% uptime. Two major outages in the period — both recovered in <30 min thanks to multi-venue redundancy (if venue X is down, cross-validation from 2 others covers the gap).
Daily multi-source enrichment of 40k+ accounts, intent signals, decision-maker mapping. Hands-off since Q2 2024.
Resilient scraping with anti-bot routing, SKU normalization and 5-minute price-change webhooks into the client's repricing engine.
Goal-driven agent crawling filings, press, social and internal sources — producing structured analyst briefings every morning before 7 AM ET.
If you recognise pieces of this case study in your own situation — write. We usually see in the first call whether it is hours-per-week scale or months of infrastructure.