Break Kafka on Purpose with Khaos Monkey
Simulate broker crashes, network partitions, disk pressure, and GC spikes against your real Kafka cluster. Measure the impact, verify your alerting, and build confidence in your recovery procedures — before an incident does it for you.
You Should Not Wait for a Real Incident to Test Recovery
Teams that have never tested failure scenarios are always surprised when one happens
Untested Runbooks
Most Kafka runbooks are written once and never executed. When a broker fails at 3 AM, the first time engineers follow the steps is during the incident itself.
Silent Alert Gaps
Alert rules look correct in code review but fail to fire in practice because of misconfigured thresholds, stale metrics, or missing notification channels.
Unknown Recovery Time
Without measured recovery experiments, SLA commitments are guesses. Teams cannot quote a realistic RTO until they have actually timed a recovery.
Chaos vs. No Chaos Testing
❌ Without Chaos Testing
- • First real failure is a surprise
- • Alert gaps discovered during incidents
- • RTO is a guess, not a measurement
- • Engineers panic under pressure
✅ With Khaos Monkey
- • Failures rehearsed before they matter
- • Alert delivery verified every sprint
- • Measured RTO in the report
- • Team responds with confidence
Six Fault Injection Scenarios
Real failure modes, safely simulated with automatic recovery
Broker-Level Faults
Broker Crash
Hard-stop a broker process to test partition leader election, consumer reconnection, and ISR recovery under realistic conditions
Broker Pause
Suspend a broker with SIGSTOP to simulate a frozen JVM or unresponsive host without losing the process, then auto-resume after the test window
GC Spike Simulation
Inject CPU starvation to simulate a full GC pause and measure how your consumers handle a broker that is alive but unresponsive
Memory Leak Simulation
Gradually increase heap pressure to observe how the broker degrades under memory exhaustion and when alerts fire
Network Partition
Isolate a broker from its peers to test split-brain handling and unclean leader election settings
Disk Full Simulation
Fill broker disk to a configurable threshold (e.g. 95%) and observe producer errors, log segment failures, and recovery
All faults auto-restore
Every experiment rolls back automatically at the configured duration — no manual cleanup needed
Network & Infrastructure Faults
Network Partition
Isolate a broker from its peers and the ZooKeeper/KRaft quorum to test how your cluster handles split-brain scenarios
Disk Full Simulation
Simulate a full disk at a configurable threshold to verify producer backpressure, error handling, and alerting before it happens in production
Integrated Impact Measurement
KLogic monitoring captures baseline and in-experiment metrics automatically, generating a post-experiment report with time-to-detect, time-to-alert, and recovery time
Measure Real Impact, Not Assumptions
Every experiment generates a structured report tied to your monitoring data
Time to Alert
Measure exactly how many seconds elapse between fault injection and the first alert notification reaching your on-call channel.
Lag Impact
Track how consumer group lag grows during the experiment and how quickly it recovers after auto-restore, giving you a measured RTO.
Replication Health
Monitor under-replicated partition counts and leader election frequency throughout the experiment to quantify resilience gaps.
Build Resilience Systematically
Chaos engineering outcomes from teams using Khaos Monkey
Frequently Asked Questions
Khaos Monkey is KLogic's built-in chaos engineering module inspired by Netflix's Chaos Monkey. It lets you simulate broker failures, network partitions, disk pressure, and resource exhaustion on a running Kafka cluster to verify that your consumers, producers, and alerting pipelines behave correctly under adverse conditions.
No. Every experiment has a configurable duration and an auto-restoration step. When the experiment completes (or if you manually stop it early), KLogic automatically reverses the injected fault — restarting paused brokers, releasing simulated disk pressure, and restoring network connectivity.
KLogic supports: broker crash (hard stop), broker pause (SIGSTOP equivalent), GC spike simulation (CPU starvation), memory leak simulation (heap pressure), network partition (isolate a broker from peers), and disk full simulation (fill disk to a configurable threshold).
KLogic's monitoring integration captures baseline metrics before the experiment starts, then tracks consumer lag, producer error rate, request latency, and replication health throughout the experiment. A post-experiment report shows exactly how long it took for your cluster to detect, alert on, and recover from the injected fault.
Khaos Monkey is designed for production use where it is safe to do so — for example, testing a single broker in a multi-broker cluster with adequate replication. We recommend starting with staging environments and using short durations (30–60 seconds) until you understand your cluster's recovery behaviour.
Chaos experiments trigger the same alert rules configured in KLogic. This means your Slack, PagerDuty, or webhook integrations will fire during an experiment, letting you verify end-to-end alert delivery as part of the test. Alerts generated during an experiment are tagged with the experiment ID for easy filtering.
Break Things Before They Break You
Run controlled Kafka failure experiments, verify your alerting, measure your RTO, and build the confidence to handle real incidents calmly.
Free 14-day trial • All fault scenarios included • Auto-restore after every test