AX/T/04 — AX/TUTORIALS
Published: May 8, 2026 · 16 min read

How to build an Otodom scraper (and other real estate portals)

Listing aggregation, cross-portal deduplication, GPS geocoding, alerts on new listings. For agents, flippers, investors.

IntermediatePlaywrightTypeScriptPostgreSQL + PostGISGoogle Maps APIDatacenter proxy

Real estate scraping has specific challenges: listings are on 5-10 portals simultaneously (dedup), change dynamically (daily refreshes), require geocoding (address → coordinates), and agents ask for "alerts on anything new in price range X-Y in location Z".

This tutorial shows how to build a pipeline that: scrapes 4 portals, deduplicates cross-portal, geocodes addresses, and sends alerts matching saved searches.

What you need
  • Node.js 20+ or Python 3.11+
  • PostgreSQL with PostGIS extension (for geo queries)
  • Datacenter proxy (real estate sites do not have aggressive anti-bot)
  • Google Maps API key (geocoding) or Nominatim (free)
Steps
  1. 01

    Unified schema for cross-portal listings

    Each portal has its own structure — we normalize to one:

    // schema/listing.ts
    import { z } from 'zod';
    
    export const Listing = z.object({
      externalId: z.string(),       // ID from portal
      portal: z.enum(['portal1', 'portal2', 'portal3', 'portal4']),
      url: z.string().url(),
      title: z.string(),
      type: z.enum(['apartment', 'house', 'land', 'commercial']),
      transaction: z.enum(['sale', 'rent']),
      price: z.number().positive(),
      currency: z.enum(['EUR', 'GBP', 'USD']),
      area: z.number().positive(),  // m²
      rooms: z.number().int().nullable(),
      floor: z.number().int().nullable(),
      address: z.object({
        raw: z.string(),
        city: z.string(),
        district: z.string().nullable(),
        street: z.string().nullable(),
        lat: z.number().nullable(),
        lng: z.number().nullable(),
      }),
      description: z.string(),
      images: z.array(z.string().url()),
      scrapedAt: z.date(),
    });
  2. 02

    Pagination and listing index

    Each portal has listing search URLs with pagination:

    // portals/portal1.ts
    export async function listPortal1Listings(filters) {
      const browser = await chromium.launch({ headless: true });
      const page = await browser.newPage();
      
      const urls = [];
      let pageNum = 1;
      let hasMore = true;
      
      while (hasMore && pageNum <= 100) { // safety limit
        const url = `https://example.com/search?city=${filters.city}&page=${pageNum}`;
        await page.goto(url, { waitUntil: 'networkidle' });
        
        const items = await page.locator('[data-cy="listing-item"]').all();
        for (const item of items) {
          const href = await item.locator('a').first().getAttribute('href');
          urls.push(`https://example.com${href}`);
        }
        
        const nextBtn = page.locator('[data-cy="pagination.next-page"]');
        hasMore = (await nextBtn.count()) > 0 && !(await nextBtn.isDisabled());
        pageNum++;
      }
      
      await browser.close();
      return urls;
    }

    First we collect URLs (list), then separately scrape details of each. Allows resumption when the process dies.

  3. 03

    Cross-portal deduplication

    The same property is often listed on 3-5 portals. Dedup by: address + area + price (with tolerance), or phone number if visible.

    async function findDuplicates(listing) {
      // Strategy 1: exact phone match
      if (listing.phone) {
        const byPhone = await db.query(
          'SELECT id, portal FROM listings WHERE phone=$1 AND active=true',
          [listing.phone]
        );
        if (byPhone.rows.length > 0) return byPhone.rows;
      }
      
      // Strategy 2: address + area + price (with 5% tolerance)
      const byAttrs = await db.query(`
        SELECT id, portal, price FROM listings
        WHERE address_normalized = $1
          AND area BETWEEN $2 AND $3
          AND price BETWEEN $4 AND $5
          AND active = true
      `, [
        normalizeAddress(listing.address),
        listing.area * 0.95, listing.area * 1.05,
        listing.price * 0.95, listing.price * 1.05,
      ]);
      
      return byAttrs.rows;
    }
    
    async function ingest(listing) {
      const dups = await findDuplicates(listing);
      if (dups.length > 0) {
        // Link as alternative listing
        await db.query(
          'INSERT INTO listing_alternatives (canonical_id, alt_url, portal) VALUES ($1, $2, $3)',
          [dups[0].id, listing.url, listing.portal]
        );
      } else {
        await db.query('INSERT INTO listings (...) VALUES (...)', [...]);
      }
    }
  4. 04

    Geocoding and PostGIS queries

    Geocode address → lat/lng. Google Maps API ($5/1000 requests) or Nominatim (free, slower):

    async function geocode(address) {
      const cached = await db.query(
        'SELECT lat, lng FROM geocode_cache WHERE address=$1', [address]
      );
      if (cached.rows.length) return cached.rows[0];
      
      const url = `https://maps.googleapis.com/maps/api/geocode/json?address=${encodeURIComponent(address)}&key=${API_KEY}`;
      const res = await fetch(url).then(r => r.json());
      if (res.status !== 'OK') return null;
      
      const { lat, lng } = res.results[0].geometry.location;
      await db.query(
        'INSERT INTO geocode_cache (address, lat, lng) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING',
        [address, lat, lng]
      );
      return { lat, lng };
    }

    Setup PostGIS for geo queries:

    CREATE EXTENSION IF NOT EXISTS postgis;
    ALTER TABLE listings ADD COLUMN location GEOGRAPHY(POINT, 4326);
    UPDATE listings SET location = ST_MakePoint(lng, lat)::geography WHERE lat IS NOT NULL;
    CREATE INDEX idx_listings_location ON listings USING GIST (location);
    
    -- Find apartments within 1km of a point
    SELECT * FROM listings
    WHERE ST_DWithin(location, ST_MakePoint(-0.1, 51.5)::geography, 1000)
      AND price < 800000;
  5. 05

    Saved searches and alerts

    User defines saved search: "apartment, London, 60-80m², up to £850k". After each scrape we match new listings vs searches.

    CREATE TABLE saved_searches (
      id BIGSERIAL PRIMARY KEY,
      user_id TEXT NOT NULL,
      filters JSONB NOT NULL,
      notification_channel TEXT, -- 'email' or 'slack'
      created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
    );
    
    -- Match function
    CREATE OR REPLACE FUNCTION matches_search(l listings, s saved_searches) RETURNS BOOLEAN AS $$
    BEGIN
      IF s.filters->>'type' IS NOT NULL AND l.type != s.filters->>'type' THEN
        RETURN FALSE;
      END IF;
      IF s.filters->>'max_price' IS NOT NULL AND l.price > (s.filters->>'max_price')::numeric THEN
        RETURN FALSE;
      END IF;
      -- ... rest
      RETURN TRUE;
    END;
    $$ LANGUAGE plpgsql;
What it costs to run

Run cost for 4 portals, ~10k listings/day:

  • Datacenter proxy (~30GB/month): $30-60/month
  • VPS (Hetzner CX32): €10/month
  • PostgreSQL + PostGIS (Supabase Pro or Hetzner managed): $25/month
  • Google Maps geocoding (~3000 unique addresses/month, cached): $15-30/month (or Nominatim for $0)

Total: ~$80-125/month.

Common pitfalls
  • No deduplication — the agent sees 4× the same apartment from 4 portals. UX killer.
  • Address normalization — "15/2 Main St" vs "15 Main Street Apt 2" — same place, different strings. Normalize aggressively.
  • Stale listings — apartments disappear from portals after sale. Mark inactive if URL gives 404 or "listing expired".
  • Geocoding rate limits — Google Maps API has 50 req/sec but generous quota. Nominatim 1 req/sec (free).
  • Gross/net price inconsistency — some portals show with VAT, others without. Validate.
Build yourself or hire?

Real estate scraping is a common request — we helped 3 companies (a property management firm, a real estate office, a solo flipper). All use a variant of the above pipeline. The complicated parts: dynamic auctions, local portals, AI photo classification for condition scoring.

If you want this done production-grade — write us.

Frequently asked questions
Can I resell scraped property listings?
Usually no — photos + descriptions are copyright of agency/seller. You can use them for internal analytics, dashboards for clients, alerts. You cannot publish as your own. For redistribution: licensing deal with portals (most major ones have API partner programs).
How long does a full scrape across all portals take?
4 portals × 10k listings × ~5s per detail page (with politeness delay) ≈ 14h. With parallelism (10 workers) → 1.5h. Practically the full scrape runs overnight (cron 2-5am).
How to detect when an apartment was sold or disappeared?
Daily re-check each active listing — if URL returns 404 or "expired" → mark inactive. Alternative: full re-scrape weekly, listings not appearing in new snapshot → marked sold.
Do real estate portals block scrapers?
Major real estate portals usually have basic rate limiting but not aggressive anti-bot. Datacenter proxy + Playwright with normal headers suffices for 5-10k pages/day. Above that — residential proxy + larger delays. More restrictive: portals where listings are behind member registration.