KLogic
Alerting Best Practices

Kafka Alerting Best Practices: Beyond Threshold Monitoring

A well-designed alert is a precise instrument. A poorly-designed one is a noise generator that trains engineers to ignore it. This guide covers alert rule design, severity levels, notification routing, and proven techniques to reduce alert fatigue without missing incidents.

Published: January 2, 2025 • 11 min read • Alerting & Observability

The Alert Fatigue Problem

Research consistently shows that on-call engineers who receive more than 5–10 alerts per shift begin ignoring them. When every alert is urgent, no alert is urgent. The goal of a good alerting strategy is not to catch every anomaly — it is to ensure that every alert that fires demands and deserves immediate human attention.

The Alert Quality Test

Before creating any alert rule, answer these three questions:

1.Is there a clear action the on-call engineer must take when this fires?
2.Would missing this alert cause user-visible impact within 15 minutes?
3.Is the threshold precise enough that false positives are rare (< 5% of firings)?

If you cannot answer yes to all three, the alert belongs in a dashboard — not a PagerDuty notification.

Severity Level Design

Critical (P1) — Page immediately

User-visible impact is current or imminent. Requires immediate response at any hour.

Consumer lag > 30 minutes sustained for 5 min
Under-replicated partitions > 0 for 2 min
Active controller count != 1
Produce error rate > 1% for 3 min
All brokers for any partition offline
Consumer group in DEAD state

High (P2) — Notify within 30 min

Degraded performance or growing risk. Requires attention during business hours or within 30 minutes if sustained.

Consumer lag growing steadily for 15 min
Broker disk usage > 80%
Produce p99 latency > 2× baseline
Replication lag > 10,000 messages
Consumer group rebalancing for > 5 min
Network errors on any broker

Warning (P3) — Next business day

Informational signals that may indicate emerging issues. Review during working hours.

Broker disk usage > 65%
Any topic has zero consumers
Consumer lag increasing slowly
Partition count imbalance > 20%

Multi-Condition Alert Rules

Single-metric thresholds produce the most false positives. Combining conditions dramatically improves precision — a lag of 50,000 records during a known batch window is fine; 50,000 records that have been growing for 20 minutes is not.

Pattern 1: Lag + Rate of Change

# Fire only when lag is high AND still growing
# (prevents false alarms during normal batch catch-up)

ALERT ConsumerLagCritical
  IF consumer_lag > 50000
     AND rate(consumer_lag[10m]) > 0   # still growing
     AND rate(consumer_lag[10m]) > 500  # growing fast
  FOR 5m
  LABELS { severity="critical" }
  ANNOTATIONS {
    summary = "{{ $labels.consumer_group }} lag growing rapidly",
    description = "Lag: {{ $value }}, rate: {{ rate(consumer_lag[10m]) }} records/sec"
  }
A batch job that creates 100K lag and catches up within 10 min will never fire this alert.

Pattern 2: Disk + Growth Rate

# Alert when disk is high AND will fill within 6 hours
ALERT BrokerDiskCritical
  IF broker_disk_used_percent > 80
     AND predict_linear(broker_disk_used_bytes[2h], 6 * 3600) > broker_disk_total_bytes
  FOR 10m
  LABELS { severity="critical" }

Pattern 3: Sustained Deviation

# Alert when produce latency p99 is 3× baseline for 5 consecutive minutes
# (Not just a momentary spike)
ALERT ProduceLatencyHigh
  IF produce_latency_p99 > (
    avg_over_time(produce_latency_p99[1h] offset 1h) * 3
  )
  FOR 5m
  LABELS { severity="high" }

Notification Routing Strategy

SeverityChannelEscalation
Critical (P1)PagerDuty page + Slack #incidentsAuto-escalate to manager after 10 min no ack
High (P2)Slack #kafka-alerts + emailEscalate to P1 if sustained for 30 min
Warning (P3)Slack #kafka-alerts (low priority)Review in weekly operations meeting
InfoDashboard only (no notification)No escalation

Route by Namespace, Not Globally

Route alerts to the team that owns the affected topic or consumer group — not to a single shared on-call rotation. The payments team should receive alerts about the payments.orders consumer group, not the platform team.

Alert Anti-Patterns to Avoid

Alerting on every metric

Not every metric deviation requires human intervention. Broker CPU at 40% is fine. Alerting on it ensures engineers start ignoring all alerts.

Zero-duration alerts (no FOR clause)

Transient spikes lasting 30 seconds should not page an engineer at 3 AM. Always add a minimum sustained duration (5–10 minutes) before firing.

Duplicate alerts from overlapping rules

If a P1 rule fires and a P2 rule fires for the same condition, engineers receive two notifications for the same incident. Group conditions into a single rule.

Static thresholds without business context

A consumer lag of 5,000 might be catastrophic for a real-time fraud detection system and completely normal for a weekly reporting pipeline. Context must be in the threshold.

Key Takeaways

Every alert must have a clear required action — if there is no action, it belongs in a dashboard.
Multi-condition rules (lag AND rate of change) cut false positives by 60–80% compared to single-metric thresholds.
Use a minimum sustained duration (FOR 5m) to filter transient spikes before firing.
Route alerts to the team that owns the affected resource, not to a single global rotation.
Review alert firing history monthly — any rule firing more than 20% false positives needs a threshold adjustment.
Silence windows for known batch jobs and maintenance windows are not workarounds — they are good engineering.

Intelligent Alerting Built Into KLogic

KLogic includes pre-built alert rules for all critical Kafka metrics, multi-condition rule support, namespace-based routing, and anomaly-detection-powered thresholds that adapt to your traffic patterns automatically.

Pre-built critical alert rules
Multi-condition rule editor
Slack, PagerDuty, email routing
Anomaly-based adaptive thresholds
Request a Demo