KLogic
💥 Chaos Engineering

Break Kafka on Purpose with Khaos Monkey

Simulate broker crashes, network partitions, disk pressure, and GC spikes against your real Kafka cluster. Measure the impact, verify your alerting, and build confidence in your recovery procedures — before an incident does it for you.

You Should Not Wait for a Real Incident to Test Recovery

Teams that have never tested failure scenarios are always surprised when one happens

Untested Runbooks

Most Kafka runbooks are written once and never executed. When a broker fails at 3 AM, the first time engineers follow the steps is during the incident itself.

Silent Alert Gaps

Alert rules look correct in code review but fail to fire in practice because of misconfigured thresholds, stale metrics, or missing notification channels.

Unknown Recovery Time

Without measured recovery experiments, SLA commitments are guesses. Teams cannot quote a realistic RTO until they have actually timed a recovery.

Chaos vs. No Chaos Testing

❌ Without Chaos Testing

  • • First real failure is a surprise
  • • Alert gaps discovered during incidents
  • • RTO is a guess, not a measurement
  • • Engineers panic under pressure

✅ With Khaos Monkey

  • • Failures rehearsed before they matter
  • • Alert delivery verified every sprint
  • • Measured RTO in the report
  • • Team responds with confidence

Six Fault Injection Scenarios

Real failure modes, safely simulated with automatic recovery

Broker-Level Faults

Broker Crash

Hard-stop a broker process to test partition leader election, consumer reconnection, and ISR recovery under realistic conditions

Broker Pause

Suspend a broker with SIGSTOP to simulate a frozen JVM or unresponsive host without losing the process, then auto-resume after the test window

GC Spike Simulation

Inject CPU starvation to simulate a full GC pause and measure how your consumers handle a broker that is alive but unresponsive

Memory Leak Simulation

Gradually increase heap pressure to observe how the broker degrades under memory exhaustion and when alerts fire

Active ExperimentRunning
Broker Crash — broker-2
Duration60 s / 120 s
Consumer lag: +45,210Leader elections: 3
Auto-restore in 60 s
12.4 s
Time to first alert
3
Alerts fired
Network & Disk Faults

Network Partition

Isolate a broker from its peers to test split-brain handling and unclean leader election settings

Disk Full Simulation

Fill broker disk to a configurable threshold (e.g. 95%) and observe producer errors, log segment failures, and recovery

All faults auto-restore

Every experiment rolls back automatically at the configured duration — no manual cleanup needed

Network & Infrastructure Faults

Network Partition

Isolate a broker from its peers and the ZooKeeper/KRaft quorum to test how your cluster handles split-brain scenarios

Disk Full Simulation

Simulate a full disk at a configurable threshold to verify producer backpressure, error handling, and alerting before it happens in production

Integrated Impact Measurement

KLogic monitoring captures baseline and in-experiment metrics automatically, generating a post-experiment report with time-to-detect, time-to-alert, and recovery time

Measure Real Impact, Not Assumptions

Every experiment generates a structured report tied to your monitoring data

Time to Alert

Measure exactly how many seconds elapse between fault injection and the first alert notification reaching your on-call channel.

Verify alert rule sensitivity and notification delivery

Lag Impact

Track how consumer group lag grows during the experiment and how quickly it recovers after auto-restore, giving you a measured RTO.

Turn RTO from a guess into a measured number

Replication Health

Monitor under-replicated partition counts and leader election frequency throughout the experiment to quantify resilience gaps.

Expose misconfigured replication factors safely

Build Resilience Systematically

Chaos engineering outcomes from teams using Khaos Monkey

6
Fault injection scenarios
Auto
Restore after experiment
70%
Faster incident recovery
3x
Alert gaps found before prod

Frequently Asked Questions

Khaos Monkey is KLogic's built-in chaos engineering module inspired by Netflix's Chaos Monkey. It lets you simulate broker failures, network partitions, disk pressure, and resource exhaustion on a running Kafka cluster to verify that your consumers, producers, and alerting pipelines behave correctly under adverse conditions.

No. Every experiment has a configurable duration and an auto-restoration step. When the experiment completes (or if you manually stop it early), KLogic automatically reverses the injected fault — restarting paused brokers, releasing simulated disk pressure, and restoring network connectivity.

KLogic supports: broker crash (hard stop), broker pause (SIGSTOP equivalent), GC spike simulation (CPU starvation), memory leak simulation (heap pressure), network partition (isolate a broker from peers), and disk full simulation (fill disk to a configurable threshold).

KLogic's monitoring integration captures baseline metrics before the experiment starts, then tracks consumer lag, producer error rate, request latency, and replication health throughout the experiment. A post-experiment report shows exactly how long it took for your cluster to detect, alert on, and recover from the injected fault.

Khaos Monkey is designed for production use where it is safe to do so — for example, testing a single broker in a multi-broker cluster with adequate replication. We recommend starting with staging environments and using short durations (30–60 seconds) until you understand your cluster's recovery behaviour.

Chaos experiments trigger the same alert rules configured in KLogic. This means your Slack, PagerDuty, or webhook integrations will fire during an experiment, letting you verify end-to-end alert delivery as part of the test. Alerts generated during an experiment are tagged with the experiment ID for easy filtering.

Break Things Before They Break You

Run controlled Kafka failure experiments, verify your alerting, measure your RTO, and build the confidence to handle real incidents calmly.

Free 14-day trial • All fault scenarios included • Auto-restore after every test