Kafka Alerting Best Practices: Beyond Threshold Monitoring
A well-designed alert is a precise instrument. A poorly-designed one is a noise generator that trains engineers to ignore it. This guide covers alert rule design, severity levels, notification routing, and proven techniques to reduce alert fatigue without missing incidents.
The Alert Fatigue Problem
Research consistently shows that on-call engineers who receive more than 5–10 alerts per shift begin ignoring them. When every alert is urgent, no alert is urgent. The goal of a good alerting strategy is not to catch every anomaly — it is to ensure that every alert that fires demands and deserves immediate human attention.
The Alert Quality Test
Before creating any alert rule, answer these three questions:
If you cannot answer yes to all three, the alert belongs in a dashboard — not a PagerDuty notification.
Severity Level Design
Critical (P1) — Page immediately
User-visible impact is current or imminent. Requires immediate response at any hour.
High (P2) — Notify within 30 min
Degraded performance or growing risk. Requires attention during business hours or within 30 minutes if sustained.
Warning (P3) — Next business day
Informational signals that may indicate emerging issues. Review during working hours.
Multi-Condition Alert Rules
Single-metric thresholds produce the most false positives. Combining conditions dramatically improves precision — a lag of 50,000 records during a known batch window is fine; 50,000 records that have been growing for 20 minutes is not.
Pattern 1: Lag + Rate of Change
# Fire only when lag is high AND still growing
# (prevents false alarms during normal batch catch-up)
ALERT ConsumerLagCritical
IF consumer_lag > 50000
AND rate(consumer_lag[10m]) > 0 # still growing
AND rate(consumer_lag[10m]) > 500 # growing fast
FOR 5m
LABELS { severity="critical" }
ANNOTATIONS {
summary = "{{ $labels.consumer_group }} lag growing rapidly",
description = "Lag: {{ $value }}, rate: {{ rate(consumer_lag[10m]) }} records/sec"
}Pattern 2: Disk + Growth Rate
# Alert when disk is high AND will fill within 6 hours
ALERT BrokerDiskCritical
IF broker_disk_used_percent > 80
AND predict_linear(broker_disk_used_bytes[2h], 6 * 3600) > broker_disk_total_bytes
FOR 10m
LABELS { severity="critical" }Pattern 3: Sustained Deviation
# Alert when produce latency p99 is 3× baseline for 5 consecutive minutes
# (Not just a momentary spike)
ALERT ProduceLatencyHigh
IF produce_latency_p99 > (
avg_over_time(produce_latency_p99[1h] offset 1h) * 3
)
FOR 5m
LABELS { severity="high" }Notification Routing Strategy
| Severity | Channel | Escalation |
|---|---|---|
| Critical (P1) | PagerDuty page + Slack #incidents | Auto-escalate to manager after 10 min no ack |
| High (P2) | Slack #kafka-alerts + email | Escalate to P1 if sustained for 30 min |
| Warning (P3) | Slack #kafka-alerts (low priority) | Review in weekly operations meeting |
| Info | Dashboard only (no notification) | No escalation |
Route by Namespace, Not Globally
Route alerts to the team that owns the affected topic or consumer group — not to a single shared on-call rotation. The payments team should receive alerts about the payments.orders consumer group, not the platform team.
Alert Anti-Patterns to Avoid
Alerting on every metric
Not every metric deviation requires human intervention. Broker CPU at 40% is fine. Alerting on it ensures engineers start ignoring all alerts.
Zero-duration alerts (no FOR clause)
Transient spikes lasting 30 seconds should not page an engineer at 3 AM. Always add a minimum sustained duration (5–10 minutes) before firing.
Duplicate alerts from overlapping rules
If a P1 rule fires and a P2 rule fires for the same condition, engineers receive two notifications for the same incident. Group conditions into a single rule.
Static thresholds without business context
A consumer lag of 5,000 might be catastrophic for a real-time fraud detection system and completely normal for a weekly reporting pipeline. Context must be in the threshold.
Key Takeaways
Intelligent Alerting Built Into KLogic
KLogic includes pre-built alert rules for all critical Kafka metrics, multi-condition rule support, namespace-based routing, and anomaly-detection-powered thresholds that adapt to your traffic patterns automatically.