ML-Based Anomaly Detection for Kafka Clusters
Static thresholds miss gradual degradation and fire false alarms during traffic spikes. Statistical anomaly detection — MAD, Z-score, and IQR — adapts to your cluster's baseline automatically and surfaces genuine incidents with less noise.
Why Static Thresholds Fall Short
A consumer lag threshold of 10,000 records sounds reasonable — until you realize your overnight batch job produces 500,000 records and your lag routinely hits 50,000 at 2 AM before fully catching up. Static thresholds create alert fatigue that trains engineers to ignore notifications, masking the alerts that matter.
False Positives
Alerts fire during known traffic spikes, end-of-month batches, and marketing campaigns. Engineers learn to ignore them, and real incidents are buried in the noise.
False Negatives
A gradual drift from 50 ms to 800 ms produce latency over two weeks never crosses a 1,000 ms threshold, but it represents a serious degradation no one notices.
Median Absolute Deviation (MAD)
MAD is the most robust of the three algorithms. Unlike standard deviation, it is resistant to outliers — a single extreme value does not inflate the baseline and suppress future alerts.
MAD Formula
Python Implementation
import numpy as np
from typing import List, Tuple
def mad_anomaly_score(
values: List[float],
threshold: float = 3.5
) -> Tuple[float, bool]:
"""
Compute MAD-based anomaly score for the latest data point.
Returns (modified_z_score, is_anomaly).
"""
arr = np.array(values)
median = np.median(arr)
mad = np.median(np.abs(arr - median))
if mad == 0:
return 0.0, False
current = arr[-1]
modified_z = 0.6745 * (current - median) / mad
return modified_z, abs(modified_z) > threshold
# Example: consumer lag over last 30 minutes (1-min samples)
lag_samples = [120, 115, 130, 125, 118, 122, 119, 4200] # spike at end
score, is_anomaly = mad_anomaly_score(lag_samples)
print(f"Score: {score:.2f}, Anomaly: {is_anomaly}")
# Score: 23.41, Anomaly: TrueBest for: Metrics with occasional extreme spikes (consumer lag, produce errors). The median-based calculation is not skewed by those spikes when computing future baselines.
Z-Score Detection
The Z-score measures how many standard deviations a value is from the mean of a rolling window. It is computationally cheap and effective for metrics that follow a roughly normal distribution without heavy outlier contamination.
Z-Score Formula
import numpy as np
def zscore_anomaly(
values: list[float],
window: int = 60,
threshold: float = 3.0
) -> tuple[float, bool]:
"""Rolling Z-score on a window of recent observations."""
window_data = np.array(values[-window:])
mean = np.mean(window_data)
std = np.std(window_data)
if std == 0:
return 0.0, False
z = (values[-1] - mean) / std
return z, abs(z) > threshold
# Example: broker network bytes/sec (60-minute window)
# Normal range 80-120 MB/s, sudden drop to 5 MB/s
metrics = [95, 102, 98, 105, 100, 97, 5] # last value is anomalous
z, is_anomaly = zscore_anomaly(metrics, window=60)
print(f"Z-score: {z:.2f}, Anomaly: {is_anomaly}")Best for: Broker throughput, request rates, and partition leader distribution — metrics that are normally distributed around a stable mean.
Interquartile Range (IQR)
IQR-based detection defines "normal" as the middle 50% of observed values. Anything outside 1.5×IQR below Q1 or above Q3 is flagged. It is parameter-free and works well for skewed distributions common in Kafka latency metrics.
IQR Bounds Formula
import numpy as np
def iqr_anomaly(
values: list[float],
multiplier: float = 1.5
) -> tuple[float, float, bool]:
"""
IQR-based outlier detection.
Returns (lower_fence, upper_fence, is_anomaly).
"""
arr = np.array(values[:-1]) # baseline excludes current point
q1, q3 = np.percentile(arr, [25, 75])
iqr = q3 - q1
lower = q1 - multiplier * iqr
upper = q3 + multiplier * iqr
current = values[-1]
return lower, upper, not (lower <= current <= upper)
# Example: produce latency p99 in ms
latency_history = [12, 14, 11, 13, 15, 12, 14, 11, 13, 142]
lo, hi, anomaly = iqr_anomaly(latency_history)
print(f"Normal range: [{lo:.1f}, {hi:.1f}] ms, Anomaly: {anomaly}")Best for: Latency percentiles (p50, p99), replication lag, and partition size metrics that have natural lower bounds of zero.
Choosing the Right Algorithm
| Algorithm | Outlier Resistant | Best Metric Type | Kafka Use Case |
|---|---|---|---|
| MAD | High | Spiky / long-tailed | Consumer lag, error rates |
| Z-Score | Medium | Normally distributed | Broker throughput, request rate |
| IQR | High | Skewed, bounded | Latency percentiles, partition size |
Production Recommendation
Use MAD for consumer lag and error metrics, IQR for latency, and Z-score for smooth throughput metrics. Running all three in parallel and requiring 2-of-3 agreement before firing significantly reduces false positives.
Key Takeaways
Anomaly Detection Built Into KLogic
KLogic runs MAD, Z-score, and IQR anomaly detection on all key Kafka metrics out of the box. No configuration needed — just connect your cluster and start getting intelligent alerts that adapt to your traffic patterns automatically.