A first retainer typically starts with "all alerts to #automation-alerts". After 2 weeks the channel has 200 messages a day, no one reads it, a critical incident sits 4 hours before someone notices. This guide explains how not to fall into that trap.
An alert no one reads is not an alert — it is noise. Good observability means fewer channels but more precisely routed.
Webhook vs Bot API — what when
Slack has two main APIs for sending messages:
- Incoming Webhook (simplest): URL → POST JSON → message. Setup: 5 min. Limitation: one URL per channel, no interaction.
- Bot API (fuller): token + scope → send to any channel, threads, reactions, interactive buttons. Setup: 15-30 min. Requires workspace admin approval.
Rule: if you only send information — webhook. If you need interaction (Ack, Mute 1h, Reassign buttons) — Bot API.
Routing per priority
Instead of 1 channel, use 3:
- #ax-alerts-critical — pipeline down, data corrupted, security incident. Ping @here or @oncall. Goal: response <15 min.
- #ax-alerts-high — degraded performance, partial failures, parser starts to fail. No ping. Goal: response in business hours <4h.
- #ax-logs — normal operations, daily summary, scheduled job completed. Read-only muted by default. Goal: refer when needed.
Critical threshold: only when action required in <1h. Everything else is high or logs. Discipline of critical → respond to every critical in 15 min.
Anatomy of a good alert
Every alert message should contain:
- Severity icon — 🔴 critical, 🟡 high, 🔵 info (one glance enough)
- System ID — e.g.
OPS-25-K7— immediately clear whose system - 1-line summary — what broke (not "error", but "parser failed on nike.com — selector .price-now changed")
- Impact — what this means operationally ("0 SKUs collected last 2h, retry queue: 47")
- Direct link — to dashboard / logs / runbook
- Recommended action — simplest next step
Practical templates
Critical alert (parser failed)
🔴 [OPS-25-K7] Parser failed | nike.com
14 selectors broken — likely site redesign
Impact: 0/247 SKUs collected last 90 min
Action: fix parser within 4h
📊 Dashboard: ax.io/ops-25-k7 | 📚 Runbook: ax.io/rb/parser-fail
Daily summary (info)
🔵 [OPS-25-K7] Daily summary | 2026-03-25
✅ 247/247 SKUs collected (100%)
📈 12 prices changed (3 down, 9 up)
⏱️ Total runtime: 4m 32s
📊 Dashboard: ax.io/ops-25-k7
Anti-patterns
What not to do:
- Alert per row — "Found 47 new prices" as 47 separate messages. Aggregate.
- Alert without context — "Error" or "Failed task". What? Where? What are the consequences?
- "All OK" alert — every minute "system OK". No one reads it, hides real alerts.
- @channel for normal events — pings reserved for critical. Otherwise the team stops reacting.
- Same alert every 5 min — when a parser fails for an hour, 1 alert + escalate, not 12.
Snooze + acknowledgment flow
Bot API allows interactive buttons. Practical implementation:
[Ack]— "I know about the problem, working on it". Suspends re-alerts for 1h.[Mute 4h]— "I know, fix is coming, do not spam". Mutes for 4h.[Resolved]— incident closed, opens post-mortem template.[Escalate]— pings on-call + creates Linear ticket.
Without this, the team lives in the stress of "has someone seen this yet?". With buttons — clear ownership flow.
The point
3 channels (critical / high / logs), severity-first format, 1 alert per incident not per row, acknowledge buttons. Setup of a single pipeline = ~30 min. Setup of an entire stack (5-10 pipelines) = half a day. ROI: first 2 weeks of production save 10+ team hours not searching for what is happening.