OPS-26-I5Real Estate Data

Market intelligence platform for proptech B2B

11 real estate portals across 6 EU countries, 800k active listings monitored daily, cross-portal dedup, AVM. SaaS platform.

faster deal sourcing
Sector
Proptech SaaS · 5M+ ARR
Surfaces
Browser · 11 portali · Canonical schema · AVM
Runtime
11 months production
Published
2025-12-08

Challenge

A proptech SaaS client (founded 2022, $5M ARR) served real estate investment funds across 6 European countries (PL, DE, AT, ES, IT, PT). Each fund client paid them $2–15k/month for "market intelligence" — daily listings, comparable sales analytics, deal alerts, AVM for underwriting.

State in 2024: scrape platform built by an earlier team, used Apify underneath, operational cost $25k/month, coverage only 3 countries (PL, DE, AT). Three attempts to expand to southern Europe (Idealista portal) failed — anti-bot blocked their configuration after 2–3 weeks every time.

Clients signalled growing dissatisfaction: data lag was 24–48h (vs competitors 2–4h), deduplication accuracy low (clients reported duplicates in their dashboards), AVM coverage only 60% of potential addresses.

Approach

We took over the existing infrastructure with a mandate: expand to 6 countries, reduce data lag to <2h, achieve 95%+ dedup accuracy, push AVM coverage to 95%.

Critical decisions: rebuild scrape layer from scratch on Playwright (Apify dominates the Polish market but has anti-bot weakness on harder EU targets), per-portal parser with parser_version tagging, canonical schema layer separating raw extraction from normalisation, dedicated photo-hash + geocoded-address deduplication service.

Architecture: Temporal as orchestrator (orchestrating 11 portals with different schedules — Otodom every 30 min, Idealista every 2h due to anti-bot tolerance), Playwright pool in Kubernetes (Hetzner for cost efficiency), PostgreSQL partitioned per portal per month, ClickHouse for time-series price tracking, AVM model in Python (hedonic regression with geo features).

AVM expansion approach: training data from scraped historical listings (3+ years retrospective) plus public land registry data (where available — KRN for Poland, Grundbuch for DE), enrichment via geocoded amenities (POI density, transport access, schools from OpenStreetMap), model retrain quarterly per market.

Outcome

Coverage expanded to 11 portals across 6 countries: Otodom + Domiporta (PL), ImmoScout24 + Immowelt (DE), willhaben + Immobilienscout24.at (AT), Idealista + Fotocasa (ES), Immobiliare.it + Casa.it (IT), Idealista.pt (PT). Each with dedicated parser plus drift detection.

Data lag: average 47 minutes (vs 24–48h baseline), P95 92 minutes. This allowed client positioning as "real-time market intelligence" — a new pricing tier with premium clients.

Deduplication accuracy: 96.4% measured via manual sampling of 1,000 records monthly. False positive rate (different properties linked as same) <1%. Confidence scoring enables downstream applications to make trust-aware decisions.

AVM coverage: 94.2% of addresses with confidence interval <15%. Top 3 lending clients started using AVM in their underwriting decision flow.

Cost overall: $18k/month operational (down from $25k baseline), 4-person team taken over (technical lead + 3 engineers), maintenance retainer for expansion. Client revenue grew 2.4× in 11 months after project completion thanks to deal sourcing acceleration enabled by platform expansion.

Stack

TemporalPlaywrightKubernetes (Hetzner)PostgreSQLClickHousePython AVM modelBright Data residentialOpenStreetMapCustom dedup service

Metrics

  • 11Portals integrated
  • 6Countries covered
  • 800k+Active listings monitored
  • 47 minData lag (avg)
  • 96.4%Dedup accuracy
  • 94.2%AVM coverage
  • Deal sourcing speedup
Similar problem in your business?

Every project is different, but patterns repeat.

If you recognise pieces of this case study in your own situation — write. We usually see in the first call whether it is hours-per-week scale or months of infrastructure.