KLogic
Anomaly Detection

ML-Based Anomaly Detection for Kafka Clusters

Static thresholds miss gradual degradation and fire false alarms during traffic spikes. Statistical anomaly detection — MAD, Z-score, and IQR — adapts to your cluster's baseline automatically and surfaces genuine incidents with less noise.

Published: January 10, 2025 • 11 min read • Observability & AI

Why Static Thresholds Fall Short

A consumer lag threshold of 10,000 records sounds reasonable — until you realize your overnight batch job produces 500,000 records and your lag routinely hits 50,000 at 2 AM before fully catching up. Static thresholds create alert fatigue that trains engineers to ignore notifications, masking the alerts that matter.

False Positives

Alerts fire during known traffic spikes, end-of-month batches, and marketing campaigns. Engineers learn to ignore them, and real incidents are buried in the noise.

False Negatives

A gradual drift from 50 ms to 800 ms produce latency over two weeks never crosses a 1,000 ms threshold, but it represents a serious degradation no one notices.

Median Absolute Deviation (MAD)

MAD is the most robust of the three algorithms. Unlike standard deviation, it is resistant to outliers — a single extreme value does not inflate the baseline and suppress future alerts.

MAD Formula

MAD = median(|Xᵢ - median(X)|)
modified_z = 0.6745 × (Xᵢ - median(X)) / MAD
Anomaly threshold: |modified_z| > 3.5

Python Implementation

import numpy as np
from typing import List, Tuple

def mad_anomaly_score(
    values: List[float],
    threshold: float = 3.5
) -> Tuple[float, bool]:
    """
    Compute MAD-based anomaly score for the latest data point.
    Returns (modified_z_score, is_anomaly).
    """
    arr = np.array(values)
    median = np.median(arr)
    mad = np.median(np.abs(arr - median))

    if mad == 0:
        return 0.0, False

    current = arr[-1]
    modified_z = 0.6745 * (current - median) / mad
    return modified_z, abs(modified_z) > threshold


# Example: consumer lag over last 30 minutes (1-min samples)
lag_samples = [120, 115, 130, 125, 118, 122, 119, 4200]  # spike at end
score, is_anomaly = mad_anomaly_score(lag_samples)
print(f"Score: {score:.2f}, Anomaly: {is_anomaly}")
# Score: 23.41, Anomaly: True

Best for: Metrics with occasional extreme spikes (consumer lag, produce errors). The median-based calculation is not skewed by those spikes when computing future baselines.

Z-Score Detection

The Z-score measures how many standard deviations a value is from the mean of a rolling window. It is computationally cheap and effective for metrics that follow a roughly normal distribution without heavy outlier contamination.

Z-Score Formula

z = (X - μ) / σ
Anomaly threshold: |z| > 3.0
import numpy as np

def zscore_anomaly(
    values: list[float],
    window: int = 60,
    threshold: float = 3.0
) -> tuple[float, bool]:
    """Rolling Z-score on a window of recent observations."""
    window_data = np.array(values[-window:])
    mean = np.mean(window_data)
    std = np.std(window_data)

    if std == 0:
        return 0.0, False

    z = (values[-1] - mean) / std
    return z, abs(z) > threshold


# Example: broker network bytes/sec (60-minute window)
# Normal range 80-120 MB/s, sudden drop to 5 MB/s
metrics = [95, 102, 98, 105, 100, 97, 5]  # last value is anomalous
z, is_anomaly = zscore_anomaly(metrics, window=60)
print(f"Z-score: {z:.2f}, Anomaly: {is_anomaly}")

Best for: Broker throughput, request rates, and partition leader distribution — metrics that are normally distributed around a stable mean.

Interquartile Range (IQR)

IQR-based detection defines "normal" as the middle 50% of observed values. Anything outside 1.5×IQR below Q1 or above Q3 is flagged. It is parameter-free and works well for skewed distributions common in Kafka latency metrics.

IQR Bounds Formula

IQR = Q3 - Q1
lower_fence = Q1 - 1.5 × IQR
upper_fence = Q3 + 1.5 × IQR
Anomaly: value < lower_fence OR value > upper_fence
import numpy as np

def iqr_anomaly(
    values: list[float],
    multiplier: float = 1.5
) -> tuple[float, float, bool]:
    """
    IQR-based outlier detection.
    Returns (lower_fence, upper_fence, is_anomaly).
    """
    arr = np.array(values[:-1])  # baseline excludes current point
    q1, q3 = np.percentile(arr, [25, 75])
    iqr = q3 - q1

    lower = q1 - multiplier * iqr
    upper = q3 + multiplier * iqr

    current = values[-1]
    return lower, upper, not (lower <= current <= upper)


# Example: produce latency p99 in ms
latency_history = [12, 14, 11, 13, 15, 12, 14, 11, 13, 142]
lo, hi, anomaly = iqr_anomaly(latency_history)
print(f"Normal range: [{lo:.1f}, {hi:.1f}] ms, Anomaly: {anomaly}")

Best for: Latency percentiles (p50, p99), replication lag, and partition size metrics that have natural lower bounds of zero.

Choosing the Right Algorithm

AlgorithmOutlier ResistantBest Metric TypeKafka Use Case
MADHighSpiky / long-tailedConsumer lag, error rates
Z-ScoreMediumNormally distributedBroker throughput, request rate
IQRHighSkewed, boundedLatency percentiles, partition size

Production Recommendation

Use MAD for consumer lag and error metrics, IQR for latency, and Z-score for smooth throughput metrics. Running all three in parallel and requiring 2-of-3 agreement before firing significantly reduces false positives.

Key Takeaways

Static thresholds create alert fatigue; adaptive algorithms reduce noise by 60–80% in practice.
MAD is the most robust choice for spiky metrics like consumer lag and error rates.
Z-score works well for smooth, normally distributed metrics like broker network throughput.
IQR is parameter-free and handles skewed latency distributions naturally.
Requiring 2-of-3 algorithm agreement before alerting dramatically reduces false positives.
Always use a rolling window (60–120 minutes) to capture recent traffic patterns rather than all-time history.

Anomaly Detection Built Into KLogic

KLogic runs MAD, Z-score, and IQR anomaly detection on all key Kafka metrics out of the box. No configuration needed — just connect your cluster and start getting intelligent alerts that adapt to your traffic patterns automatically.

Three-algorithm ensemble detection
Per-metric adaptive baselines
Alert suppression during known maintenance windows
Anomaly history and trend visualization
Request a Demo