Chaos Engineering

Chaos Engineering for Kafka: Testing Resilience

Your Kafka cluster has never experienced the failures you expect it to handle. Chaos engineering fixes that — controlled fault injection reveals weaknesses before production does. This guide covers the experiments that matter most and how to measure their impact.

Published: January 8, 2025 • 13 min read • Resilience & Testing

Chaos Engineering Principles for Kafka

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. For Kafka, that means intentionally killing brokers, introducing network latency, and corrupting consumer offsets — in a controlled, observable way.

The Chaos Experiment Loop

1. Define steady state:Measure baseline: consumer lag, produce latency p99, replication lag.

2. Hypothesize:"If broker-2 restarts, lag will remain under 5,000 records for 2 minutes."

3. Inject fault:Execute the experiment in a controlled, reversible manner.

4. Measure impact:Compare metrics against the steady-state baseline.

5. Fix or accept:If the hypothesis fails, fix the weakness. If it passes, increase confidence.

Fault Injection Scenarios

Scenario 1: Broker Crash

High Impact

The most common production failure. Tests partition leader election, in-sync replica (ISR) shrinkage, and client reconnection behavior.

# Stop one broker abruptly (no graceful shutdown)
docker stop --time 0 kafka-broker-2

# Monitor partition leadership reassignment
watch -n1 'kafka-topics.sh --bootstrap-server broker-1:9092 \
  --describe --topic orders | grep -E "Leader|Isr"'

# Observe consumer group behavior
kafka-consumer-groups.sh --bootstrap-server broker-1:9092 \
  --describe --group order-processor

# Restore broker
docker start kafka-broker-2

Metrics to watch:

Under-replicated partitions count

Leader election rate

Consumer lag spike magnitude

Producer error rate during election

Scenario 2: Network Partition

High Impact

Simulates a network split between Kafka zones. Tests split-brain handling, controller failover, and producer retries under connectivity loss.

# Using tc (traffic control) to add packet loss between brokers
# Run on the host of broker-2
sudo tc qdisc add dev eth0 root netem loss 100%

# Optional: introduce latency instead of full partition
sudo tc qdisc add dev eth0 root netem delay 500ms 100ms

# Restore normal network
sudo tc qdisc del dev eth0 root

Scenario 3: Slow Disk I/O

Medium Impact

Disk saturation is the most common cause of Kafka tail latency. Tests whether produce latency p99 remains acceptable under I/O pressure.

# Throttle disk writes to 10 MB/s on the Kafka data directory
sudo cgcreate -g blkio:/kafka-throttle
echo "8:0 10485760" | sudo tee \
  /sys/fs/cgroup/blkio/kafka-throttle/blkio.throttle.write_bps_device

# Assign Kafka process to throttled group
sudo cgexec -g blkio:/kafka-throttle kafka-server-start.sh config/server.properties

# Monitor produce request total time
kafka-jmx.sh --jmx-url service:jmx:rmi:///jndi/rmi://broker-1:9999/jmxrmi \
  --object-name "kafka.server:type=BrokerTopicMetrics,name=ProduceRequestsPerSec"

Scenario 4: Consumer Group Rebalance Storm

Medium Impact

Rolling restarts of consumer instances trigger repeated group rebalances. Tests whether your rebalance strategy (eager vs. cooperative) causes unacceptable lag.

# Simulate rolling restart of consumer pods (Kubernetes)
kubectl rollout restart deployment/order-processor

# Watch for rebalance events in consumer logs
kubectl logs -l app=order-processor -f | grep -i "rebalance\|partitions assigned"

# Measure lag during the rolling restart
watch -n5 'kafka-consumer-groups.sh \
  --bootstrap-server broker-1:9092 \
  --describe --group order-processor | awk "{sum+=\$5} END {print "Total lag: "sum}"'

Measuring Experiment Impact

Resilience Scorecard

Metric	Target (Acceptable)	Measure How
Recovery time after broker restart	< 30 seconds	Time until 0 under-replicated partitions
Max consumer lag during fault	< 5× normal baseline	Peak lag during experiment window
Producer error rate during fault	< 0.1%	records-error-rate / records-sent-rate
Produce p99 latency during fault	< 3× normal p99	JMX ProduceTotalTimeMs.99thPercentile

Use Monitoring Snapshots

Take a monitoring snapshot (screenshot or metrics export) immediately before starting a chaos experiment and immediately after recovery. The delta is your ground truth for the experiment report and for setting future alert thresholds.

Chaos Engineering Best Practices

Start in Staging

Run every experiment in a staging environment that mirrors production topology before attempting it in prod. Build a playbook for each scenario.

Always Have a Kill Switch

Every experiment needs an immediate rollback procedure. Document it before you start and have a second engineer ready to execute it.

Limit Blast Radius

Start with the smallest possible fault scope: one partition, one consumer instance, 5% packet loss. Increase scope only after validating recovery at smaller scale.

Automate and Schedule

Manual chaos experiments happen once at launch and never again. Automate them in CI/CD pipelines or schedule them weekly during low-traffic windows.

Key Takeaways

Define a quantitative steady state before every experiment — you cannot measure recovery without a baseline.

Broker crash scenarios are the most valuable: they test leader election, ISR shrinkage, and client reconnection simultaneously.

Network partition experiments reveal split-brain risks that broker crash tests do not surface.

Cooperative sticky rebalance reduces lag spikes during consumer rolling restarts by 70–90% compared to eager rebalance.

Automate chaos experiments in CI/CD — manual experiments become stale and miss regressions introduced by config changes.

Publish a resilience scorecard after each experiment and track it over time to measure architectural improvements.

Validate Your Kafka Resilience with KLogic

KLogic's real-time dashboards and anomaly detection make chaos experiments measurable. See exactly how your cluster responds to fault injection and track recovery metrics over every experiment in your history.

Request a Demo

Kafka Monitoring Fundamentals 2026

Essential metrics and monitoring strategies for Apache Kafka clusters.

Kafka Alerting Best Practices

Design alert rules that surface real incidents with minimal noise.