Anomaly Detection

ML-Based Anomaly Detection for Kafka Clusters

Static thresholds miss gradual degradation and fire false alarms during traffic spikes. Statistical anomaly detection — MAD, Z-score, and IQR — adapts to your cluster's baseline automatically and surfaces genuine incidents with less noise.

Published: January 10, 2025 • 11 min read • Observability & AI

Why Static Thresholds Fall Short

A consumer lag threshold of 10,000 records sounds reasonable — until you realize your overnight batch job produces 500,000 records and your lag routinely hits 50,000 at 2 AM before fully catching up. Static thresholds create alert fatigue that trains engineers to ignore notifications, masking the alerts that matter.

False Positives

Alerts fire during known traffic spikes, end-of-month batches, and marketing campaigns. Engineers learn to ignore them, and real incidents are buried in the noise.

False Negatives

A gradual drift from 50 ms to 800 ms produce latency over two weeks never crosses a 1,000 ms threshold, but it represents a serious degradation no one notices.

Median Absolute Deviation (MAD)

MAD is the most robust of the three algorithms. Unlike standard deviation, it is resistant to outliers — a single extreme value does not inflate the baseline and suppress future alerts.

MAD Formula

MAD = median(|Xᵢ - median(X)|)

modified_z = 0.6745 × (Xᵢ - median(X)) / MAD

Anomaly threshold: |modified_z| > 3.5

Python Implementation

import numpy as np
from typing import List, Tuple

def mad_anomaly_score(
    values: List[float],
    threshold: float = 3.5
) -> Tuple[float, bool]:
    """
    Compute MAD-based anomaly score for the latest data point.
    Returns (modified_z_score, is_anomaly).
    """
    arr = np.array(values)
    median = np.median(arr)
    mad = np.median(np.abs(arr - median))

    if mad == 0:
        return 0.0, False

    current = arr[-1]
    modified_z = 0.6745 * (current - median) / mad
    return modified_z, abs(modified_z) > threshold


# Example: consumer lag over last 30 minutes (1-min samples)
lag_samples = [120, 115, 130, 125, 118, 122, 119, 4200]  # spike at end
score, is_anomaly = mad_anomaly_score(lag_samples)
print(f"Score: {score:.2f}, Anomaly: {is_anomaly}")
# Score: 23.41, Anomaly: True

Best for: Metrics with occasional extreme spikes (consumer lag, produce errors). The median-based calculation is not skewed by those spikes when computing future baselines.

Z-Score Detection

The Z-score measures how many standard deviations a value is from the mean of a rolling window. It is computationally cheap and effective for metrics that follow a roughly normal distribution without heavy outlier contamination.

Z-Score Formula

z = (X - μ) / σ

Anomaly threshold: |z| > 3.0

import numpy as np

def zscore_anomaly(
    values: list[float],
    window: int = 60,
    threshold: float = 3.0
) -> tuple[float, bool]:
    """Rolling Z-score on a window of recent observations."""
    window_data = np.array(values[-window:])
    mean = np.mean(window_data)
    std = np.std(window_data)

    if std == 0:
        return 0.0, False

    z = (values[-1] - mean) / std
    return z, abs(z) > threshold


# Example: broker network bytes/sec (60-minute window)
# Normal range 80-120 MB/s, sudden drop to 5 MB/s
metrics = [95, 102, 98, 105, 100, 97, 5]  # last value is anomalous
z, is_anomaly = zscore_anomaly(metrics, window=60)
print(f"Z-score: {z:.2f}, Anomaly: {is_anomaly}")

Best for: Broker throughput, request rates, and partition leader distribution — metrics that are normally distributed around a stable mean.

Interquartile Range (IQR)

IQR-based detection defines "normal" as the middle 50% of observed values. Anything outside 1.5×IQR below Q1 or above Q3 is flagged. It is parameter-free and works well for skewed distributions common in Kafka latency metrics.

IQR Bounds Formula

IQR = Q3 - Q1

lower_fence = Q1 - 1.5 × IQR

upper_fence = Q3 + 1.5 × IQR

Anomaly: value < lower_fence OR value > upper_fence

import numpy as np

def iqr_anomaly(
    values: list[float],
    multiplier: float = 1.5
) -> tuple[float, float, bool]:
    """
    IQR-based outlier detection.
    Returns (lower_fence, upper_fence, is_anomaly).
    """
    arr = np.array(values[:-1])  # baseline excludes current point
    q1, q3 = np.percentile(arr, [25, 75])
    iqr = q3 - q1

    lower = q1 - multiplier * iqr
    upper = q3 + multiplier * iqr

    current = values[-1]
    return lower, upper, not (lower <= current <= upper)


# Example: produce latency p99 in ms
latency_history = [12, 14, 11, 13, 15, 12, 14, 11, 13, 142]
lo, hi, anomaly = iqr_anomaly(latency_history)
print(f"Normal range: [{lo:.1f}, {hi:.1f}] ms, Anomaly: {anomaly}")

Best for: Latency percentiles (p50, p99), replication lag, and partition size metrics that have natural lower bounds of zero.

Choosing the Right Algorithm

Algorithm	Outlier Resistant	Best Metric Type	Kafka Use Case
MAD	High	Spiky / long-tailed	Consumer lag, error rates
Z-Score	Medium	Normally distributed	Broker throughput, request rate
IQR	High	Skewed, bounded	Latency percentiles, partition size

Production Recommendation

Use MAD for consumer lag and error metrics, IQR for latency, and Z-score for smooth throughput metrics. Running all three in parallel and requiring 2-of-3 agreement before firing significantly reduces false positives.

Key Takeaways

Static thresholds create alert fatigue; adaptive algorithms reduce noise by 60–80% in practice.

MAD is the most robust choice for spiky metrics like consumer lag and error rates.

Z-score works well for smooth, normally distributed metrics like broker network throughput.

IQR is parameter-free and handles skewed latency distributions naturally.

Requiring 2-of-3 algorithm agreement before alerting dramatically reduces false positives.

Always use a rolling window (60–120 minutes) to capture recent traffic patterns rather than all-time history.

Anomaly Detection Built Into KLogic

KLogic runs MAD, Z-score, and IQR anomaly detection on all key Kafka metrics out of the box. No configuration needed — just connect your cluster and start getting intelligent alerts that adapt to your traffic patterns automatically.

Three-algorithm ensemble detection

Per-metric adaptive baselines

Alert suppression during known maintenance windows

Anomaly history and trend visualization

Request a Demo

Kafka Alerting Best Practices

Design alert rules that surface real problems and reduce alert fatigue.

Kafka Consumer Lag Monitoring

Complete guide to tracking, diagnosing, and reducing consumer lag.