Alerting Best Practices

Kafka Alerting Best Practices: Beyond Threshold Monitoring

A well-designed alert is a precise instrument. A poorly-designed one is a noise generator that trains engineers to ignore it. This guide covers alert rule design, severity levels, notification routing, and proven techniques to reduce alert fatigue without missing incidents.

Published: January 2, 2025 • 11 min read • Alerting & Observability

The Alert Fatigue Problem

Research consistently shows that on-call engineers who receive more than 5–10 alerts per shift begin ignoring them. When every alert is urgent, no alert is urgent. The goal of a good alerting strategy is not to catch every anomaly — it is to ensure that every alert that fires demands and deserves immediate human attention.

The Alert Quality Test

Before creating any alert rule, answer these three questions:

1.Is there a clear action the on-call engineer must take when this fires?

2.Would missing this alert cause user-visible impact within 15 minutes?

3.Is the threshold precise enough that false positives are rare (< 5% of firings)?

If you cannot answer yes to all three, the alert belongs in a dashboard — not a PagerDuty notification.

Severity Level Design

Critical (P1) — Page immediately

User-visible impact is current or imminent. Requires immediate response at any hour.

Consumer lag > 30 minutes sustained for 5 min

Under-replicated partitions > 0 for 2 min

Active controller count != 1

Produce error rate > 1% for 3 min

All brokers for any partition offline

Consumer group in DEAD state

High (P2) — Notify within 30 min

Degraded performance or growing risk. Requires attention during business hours or within 30 minutes if sustained.

Consumer lag growing steadily for 15 min

Broker disk usage > 80%

Produce p99 latency > 2× baseline

Replication lag > 10,000 messages

Consumer group rebalancing for > 5 min

Network errors on any broker

Warning (P3) — Next business day

Informational signals that may indicate emerging issues. Review during working hours.

Broker disk usage > 65%

Any topic has zero consumers

Consumer lag increasing slowly

Partition count imbalance > 20%

Multi-Condition Alert Rules

Single-metric thresholds produce the most false positives. Combining conditions dramatically improves precision — a lag of 50,000 records during a known batch window is fine; 50,000 records that have been growing for 20 minutes is not.

Pattern 1: Lag + Rate of Change

# Fire only when lag is high AND still growing
# (prevents false alarms during normal batch catch-up)

ALERT ConsumerLagCritical
  IF consumer_lag > 50000
     AND rate(consumer_lag[10m]) > 0   # still growing
     AND rate(consumer_lag[10m]) > 500  # growing fast
  FOR 5m
  LABELS { severity="critical" }
  ANNOTATIONS {
    summary = "{{ $labels.consumer_group }} lag growing rapidly",
    description = "Lag: {{ $value }}, rate: {{ rate(consumer_lag[10m]) }} records/sec"
  }

A batch job that creates 100K lag and catches up within 10 min will never fire this alert.

Pattern 2: Disk + Growth Rate

# Alert when disk is high AND will fill within 6 hours
ALERT BrokerDiskCritical
  IF broker_disk_used_percent > 80
     AND predict_linear(broker_disk_used_bytes[2h], 6 * 3600) > broker_disk_total_bytes
  FOR 10m
  LABELS { severity="critical" }

Pattern 3: Sustained Deviation

# Alert when produce latency p99 is 3× baseline for 5 consecutive minutes
# (Not just a momentary spike)
ALERT ProduceLatencyHigh
  IF produce_latency_p99 > (
    avg_over_time(produce_latency_p99[1h] offset 1h) * 3
  )
  FOR 5m
  LABELS { severity="high" }

Notification Routing Strategy

Severity	Channel	Escalation
Critical (P1)	PagerDuty page + Slack #incidents	Auto-escalate to manager after 10 min no ack
High (P2)	Slack #kafka-alerts + email	Escalate to P1 if sustained for 30 min
Warning (P3)	Slack #kafka-alerts (low priority)	Review in weekly operations meeting
Info	Dashboard only (no notification)	No escalation

Route by Namespace, Not Globally

Route alerts to the team that owns the affected topic or consumer group — not to a single shared on-call rotation. The payments team should receive alerts about the payments.orders consumer group, not the platform team.

Alert Anti-Patterns to Avoid

Alerting on every metric

Not every metric deviation requires human intervention. Broker CPU at 40% is fine. Alerting on it ensures engineers start ignoring all alerts.

Zero-duration alerts (no FOR clause)

Transient spikes lasting 30 seconds should not page an engineer at 3 AM. Always add a minimum sustained duration (5–10 minutes) before firing.

Duplicate alerts from overlapping rules

If a P1 rule fires and a P2 rule fires for the same condition, engineers receive two notifications for the same incident. Group conditions into a single rule.

Static thresholds without business context

A consumer lag of 5,000 might be catastrophic for a real-time fraud detection system and completely normal for a weekly reporting pipeline. Context must be in the threshold.

Key Takeaways

Every alert must have a clear required action — if there is no action, it belongs in a dashboard.

Multi-condition rules (lag AND rate of change) cut false positives by 60–80% compared to single-metric thresholds.

Use a minimum sustained duration (FOR 5m) to filter transient spikes before firing.

Route alerts to the team that owns the affected resource, not to a single global rotation.

Review alert firing history monthly — any rule firing more than 20% false positives needs a threshold adjustment.

Silence windows for known batch jobs and maintenance windows are not workarounds — they are good engineering.

Intelligent Alerting Built Into KLogic

KLogic includes pre-built alert rules for all critical Kafka metrics, multi-condition rule support, namespace-based routing, and anomaly-detection-powered thresholds that adapt to your traffic patterns automatically.

Pre-built critical alert rules

Multi-condition rule editor

Slack, PagerDuty, email routing

Anomaly-based adaptive thresholds

Request a Demo

ML-Based Anomaly Detection for Kafka Clusters

How MAD, Z-score, and IQR algorithms surface incidents automatically.

Kafka Consumer Lag Monitoring

Complete guide to tracking, diagnosing, and reducing consumer lag.