AX/G/007

Slack alerts for automation: practical setup

Webhook vs Bot API, routing per priority, the "47 alerts per day" anti-pattern. Concrete templates that work from day one.

A first retainer typically starts with "all alerts to #automation-alerts". After 2 weeks the channel has 200 messages a day, no one reads it, a critical incident sits 4 hours before someone notices. This guide explains how not to fall into that trap.

An alert no one reads is not an alert — it is noise. Good observability means fewer channels but more precisely routed.

Webhook vs Bot API — what when

Slack has two main APIs for sending messages:

  • Incoming Webhook (simplest): URL → POST JSON → message. Setup: 5 min. Limitation: one URL per channel, no interaction.
  • Bot API (fuller): token + scope → send to any channel, threads, reactions, interactive buttons. Setup: 15-30 min. Requires workspace admin approval.

Rule: if you only send information — webhook. If you need interaction (Ack, Mute 1h, Reassign buttons) — Bot API.

Routing per priority

Instead of 1 channel, use 3:

  • #ax-alerts-critical — pipeline down, data corrupted, security incident. Ping @here or @oncall. Goal: response <15 min.
  • #ax-alerts-high — degraded performance, partial failures, parser starts to fail. No ping. Goal: response in business hours <4h.
  • #ax-logs — normal operations, daily summary, scheduled job completed. Read-only muted by default. Goal: refer when needed.

Critical threshold: only when action required in <1h. Everything else is high or logs. Discipline of critical → respond to every critical in 15 min.

Anatomy of a good alert

Every alert message should contain:

  1. Severity icon — 🔴 critical, 🟡 high, 🔵 info (one glance enough)
  2. System ID — e.g. OPS-25-K7 — immediately clear whose system
  3. 1-line summary — what broke (not "error", but "parser failed on nike.com — selector .price-now changed")
  4. Impact — what this means operationally ("0 SKUs collected last 2h, retry queue: 47")
  5. Direct link — to dashboard / logs / runbook
  6. Recommended action — simplest next step

Practical templates

Critical alert (parser failed)

🔴 [OPS-25-K7] Parser failed | nike.com
14 selectors broken — likely site redesign
Impact: 0/247 SKUs collected last 90 min
Action: fix parser within 4h
📊 Dashboard: ax.io/ops-25-k7 | 📚 Runbook: ax.io/rb/parser-fail

Daily summary (info)

🔵 [OPS-25-K7] Daily summary | 2026-03-25
✅ 247/247 SKUs collected (100%)
📈 12 prices changed (3 down, 9 up)
⏱️ Total runtime: 4m 32s
📊 Dashboard: ax.io/ops-25-k7

Anti-patterns

What not to do:

  • Alert per row — "Found 47 new prices" as 47 separate messages. Aggregate.
  • Alert without context — "Error" or "Failed task". What? Where? What are the consequences?
  • "All OK" alert — every minute "system OK". No one reads it, hides real alerts.
  • @channel for normal events — pings reserved for critical. Otherwise the team stops reacting.
  • Same alert every 5 min — when a parser fails for an hour, 1 alert + escalate, not 12.

Snooze + acknowledgment flow

Bot API allows interactive buttons. Practical implementation:

  • [Ack] — "I know about the problem, working on it". Suspends re-alerts for 1h.
  • [Mute 4h] — "I know, fix is coming, do not spam". Mutes for 4h.
  • [Resolved] — incident closed, opens post-mortem template.
  • [Escalate] — pings on-call + creates Linear ticket.

Without this, the team lives in the stress of "has someone seen this yet?". With buttons — clear ownership flow.

The point

3 channels (critical / high / logs), severity-first format, 1 alert per incident not per row, acknowledge buttons. Setup of a single pipeline = ~30 min. Setup of an entire stack (5-10 pipelines) = half a day. ROI: first 2 weeks of production save 10+ team hours not searching for what is happening.

Hitting a similar problem?

Most of these techniques we ship to production.

If this article resonates with something you are trying to solve — write. Initial project assessment is free.