Anomaly Detection for Kafka Clusters
Static thresholds miss the anomalies that matter most. Learn how MAD, Z-score, and IQR algorithms automatically adapt to your cluster's baseline, catch subtle degradations early, and fire alerts before users notice anything is wrong.
Why Static Thresholds Are Not Enough
A static alert rule like "alert if consumer lag exceeds 10,000" sounds sensible until you realize that 10,000 records might be completely normal at 3 AM on a Monday but catastrophically high at 2 PM on a Friday. Static thresholds either fire too often (alert fatigue) or miss real problems in low-traffic windows.
What Anomaly Detection Adds
The Three Core Algorithms
MADMedian Absolute Deviation
MAD measures how far each data point deviates from the median, then computes the median of those deviations. Unlike standard deviation, MAD is robust to outliers — a single extreme value does not distort the baseline.
# MAD formula
MAD = median(|Xi - median(X)|)
# Anomaly score
score = |current_value - median| / (1.4826 * MAD)
# Alert if score > threshold (typically 3.5)
if score > 3.5:
fire_alert(metric, current_value)Best for:
Consumer lag, produce rate, message throughput — metrics that spike transiently and have skewed distributions.
Lookback window:
7–14 days of historical data provides a stable median while remaining sensitive to gradual shifts.
ZZ-Score (Standard Score)
Z-score expresses how many standard deviations a value is from the mean. It works best on metrics that follow approximately normal distributions and do not have heavy outliers that would inflate the standard deviation.
# Z-score formula z = (current_value - mean) / std_dev # Alert thresholds by severity if abs(z) > 2.5: fire_warning_alert() if abs(z) > 3.5: fire_critical_alert() if abs(z) > 5.0: fire_emergency_alert()
Best for:
Request error rates, broker CPU utilization, replication lag — metrics with stable variance around a predictable mean.
Caution:
Sensitive to outliers in the training window. Exclude known maintenance events from the baseline calculation.
IQRInterquartile Range
IQR defines the "normal" range as the middle 50% of historical values (between the 25th and 75th percentiles). Values outside the fence — typically 1.5×IQR beyond each quartile — are flagged as anomalies. IQR is intuitive and requires no distributional assumptions.
# IQR formula
IQR = Q3 - Q1
lower_fence = Q1 - (1.5 * IQR)
upper_fence = Q3 + (1.5 * IQR)
# Strict anomaly (Tukey extreme outlier)
lower_extreme = Q1 - (3.0 * IQR)
upper_extreme = Q3 + (3.0 * IQR)
if current_value > upper_extreme:
fire_critical_alert()Best for:
Partition count changes, replica counts, topic creation rate — operational metrics that change infrequently but matter when they do.
Lookback window:
30 days gives a stable quartile range for slowly changing operational metrics.
Handling Seasonal Traffic Patterns
Most Kafka clusters exhibit strong seasonal patterns: higher throughput during business hours, traffic spikes on weekdays, batch jobs at midnight. Ignoring seasonality causes anomaly detectors to fire false positives every Monday morning and miss real anomalies on quiet weekends.
Time-of-Day Bucketing
Maintain separate baseline statistics for each hour of the week (168 buckets). Compare Monday 9 AM traffic only against previous Monday 9 AM measurements, not against an all-time average that includes weekend quiet periods.
Trend-Adjusted Baselines
If your cluster is growing 15% month-over-month, a static baseline will incorrectly flag normal growth as anomalous. Apply exponential smoothing or linear detrending to separate growth trend from genuine anomalies.
Anomaly Score Smoothing
A single high anomaly score may be noise. Apply a rolling average over 3–5 consecutive anomaly scores before firing an alert. This dramatically reduces false positives from one-off metric spikes while preserving sensitivity to sustained deviations.
Planned Event Suppression
Suppress anomaly alerts during planned maintenance windows, deployments, and known batch job windows. Feed these suppression windows back into the baseline calculator so maintenance events do not pollute historical training data.
Predictive Alerting
Reactive anomaly detection tells you something is wrong now. Predictive alerting tells you something will be wrong in 2 hours — giving your team time to act before users are affected.
Disk Exhaustion Forecasting
Apply linear regression to the last 24 hours of disk utilization to project when a broker will run out of storage. Alert when projected exhaustion is within 48 hours, giving time to adjust retention settings or expand storage before data loss occurs.
# Simple linear projection
slope = (latest_disk_usage - oldest_disk_usage) / time_window_hours
hours_to_full = (max_disk_bytes - current_bytes) / (slope * bytes_per_hour)
if hours_to_full < 48:
fire_warning("Broker {id} disk full in {hours_to_full:.0f}h")
if hours_to_full < 12:
fire_critical("Broker {id} disk full in {hours_to_full:.0f}h")Consumer Lag Trend Detection
A consumer group is in danger not only when lag is high, but when lag is growing consistently — even if it has not crossed a threshold yet. Track the derivative (rate of change) of consumer lag alongside the absolute value.
# Compute lag velocity (records per minute)
lag_velocity = (lag_now - lag_5min_ago) / 5
# Alert on sustained positive velocity, even if lag is low
if lag_velocity > 0 for 15 consecutive minutes:
fire_warning("Consumer group {group} lag growing at +{lag_velocity}/min")Key Takeaways
Anomaly Detection Built Into KLogic
KLogic runs MAD, Z-score, and IQR analysis continuously across all your cluster metrics. Get predictive alerts before problems escalate — with zero configuration required.