Chaos Engineering for Kafka: Testing Resilience
Your Kafka cluster has never experienced the failures you expect it to handle. Chaos engineering fixes that — controlled fault injection reveals weaknesses before production does. This guide covers the experiments that matter most and how to measure their impact.
Chaos Engineering Principles for Kafka
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. For Kafka, that means intentionally killing brokers, introducing network latency, and corrupting consumer offsets — in a controlled, observable way.
The Chaos Experiment Loop
Fault Injection Scenarios
Scenario 1: Broker Crash
High ImpactThe most common production failure. Tests partition leader election, in-sync replica (ISR) shrinkage, and client reconnection behavior.
# Stop one broker abruptly (no graceful shutdown) docker stop --time 0 kafka-broker-2 # Monitor partition leadership reassignment watch -n1 'kafka-topics.sh --bootstrap-server broker-1:9092 \ --describe --topic orders | grep -E "Leader|Isr"' # Observe consumer group behavior kafka-consumer-groups.sh --bootstrap-server broker-1:9092 \ --describe --group order-processor # Restore broker docker start kafka-broker-2
Scenario 2: Network Partition
High ImpactSimulates a network split between Kafka zones. Tests split-brain handling, controller failover, and producer retries under connectivity loss.
# Using tc (traffic control) to add packet loss between brokers # Run on the host of broker-2 sudo tc qdisc add dev eth0 root netem loss 100% # Optional: introduce latency instead of full partition sudo tc qdisc add dev eth0 root netem delay 500ms 100ms # Restore normal network sudo tc qdisc del dev eth0 root
Scenario 3: Slow Disk I/O
Medium ImpactDisk saturation is the most common cause of Kafka tail latency. Tests whether produce latency p99 remains acceptable under I/O pressure.
# Throttle disk writes to 10 MB/s on the Kafka data directory sudo cgcreate -g blkio:/kafka-throttle echo "8:0 10485760" | sudo tee \ /sys/fs/cgroup/blkio/kafka-throttle/blkio.throttle.write_bps_device # Assign Kafka process to throttled group sudo cgexec -g blkio:/kafka-throttle kafka-server-start.sh config/server.properties # Monitor produce request total time kafka-jmx.sh --jmx-url service:jmx:rmi:///jndi/rmi://broker-1:9999/jmxrmi \ --object-name "kafka.server:type=BrokerTopicMetrics,name=ProduceRequestsPerSec"
Scenario 4: Consumer Group Rebalance Storm
Medium ImpactRolling restarts of consumer instances trigger repeated group rebalances. Tests whether your rebalance strategy (eager vs. cooperative) causes unacceptable lag.
# Simulate rolling restart of consumer pods (Kubernetes)
kubectl rollout restart deployment/order-processor
# Watch for rebalance events in consumer logs
kubectl logs -l app=order-processor -f | grep -i "rebalance\|partitions assigned"
# Measure lag during the rolling restart
watch -n5 'kafka-consumer-groups.sh \
--bootstrap-server broker-1:9092 \
--describe --group order-processor | awk "{sum+=\$5} END {print "Total lag: "sum}"'Measuring Experiment Impact
Resilience Scorecard
| Metric | Target (Acceptable) | Measure How |
|---|---|---|
| Recovery time after broker restart | < 30 seconds | Time until 0 under-replicated partitions |
| Max consumer lag during fault | < 5× normal baseline | Peak lag during experiment window |
| Producer error rate during fault | < 0.1% | records-error-rate / records-sent-rate |
| Produce p99 latency during fault | < 3× normal p99 | JMX ProduceTotalTimeMs.99thPercentile |
Use Monitoring Snapshots
Take a monitoring snapshot (screenshot or metrics export) immediately before starting a chaos experiment and immediately after recovery. The delta is your ground truth for the experiment report and for setting future alert thresholds.
Chaos Engineering Best Practices
Start in Staging
Run every experiment in a staging environment that mirrors production topology before attempting it in prod. Build a playbook for each scenario.
Always Have a Kill Switch
Every experiment needs an immediate rollback procedure. Document it before you start and have a second engineer ready to execute it.
Limit Blast Radius
Start with the smallest possible fault scope: one partition, one consumer instance, 5% packet loss. Increase scope only after validating recovery at smaller scale.
Automate and Schedule
Manual chaos experiments happen once at launch and never again. Automate them in CI/CD pipelines or schedule them weekly during low-traffic windows.
Key Takeaways
Validate Your Kafka Resilience with KLogic
KLogic's real-time dashboards and anomaly detection make chaos experiments measurable. See exactly how your cluster responds to fault injection and track recovery metrics over every experiment in your history.
Request a Demo