🐛 Troubleshooting Guide

Troubleshoot Kafka & ZooKeeper Issues

Comprehensive troubleshooting guide for diagnosing and resolving common Kafka and ZooKeeper issues, including connection problems, performance degradation, and cluster instability scenarios.

Get Issue Monitoring View JVM Monitoring

Common Kafka & ZooKeeper Issues

Identify and resolve the most frequent problems in Kafka and ZooKeeper deployments.

Connection Issues

Network connectivity, authentication failures, and session timeout problems.

Performance Problems

High latency, throughput bottlenecks, and resource contention issues.

Data Consistency

Replication lag, partition leadership, and data corruption scenarios.

Systematic Diagnostic Approach

Step-by-step methodology for diagnosing complex Kafka and ZooKeeper issues.

Initial Assessment

Log Analysis

# Check Kafka logs
tail -f /var/log/kafka/server.log
grep ERROR /var/log/kafka/server.log

# Check ZooKeeper logs
tail -f /var/log/zookeeper/zookeeper.log

Health Checks

• Verify service status and ports
• Check disk space and memory usage
• Validate network connectivity

Deep Dive Analysis

JVM Monitoring

# GC Analysis
jstat -gc -t PID 1s

# Thread Dump
jstack PID > thread_dump.txt

# Heap Dump
jmap -dump:file=heap.hprof PID

Cluster State

• Leader election status
• ISR health and replica lag
• Consumer group stability

Common Issue Resolutions

Proven solutions for frequently encountered Kafka and ZooKeeper problems.

ZooKeeper Session Timeouts

Symptoms

• Frequent "Session 0x... for server ... expired" errors
• Broker disconnections and reconnections
• Controller election storms

Solutions

• Increase zookeeper.session.timeout.ms
• Tune JVM GC settings to reduce pauses
• Monitor network latency between nodes
• Check for resource contention

High Replica Lag

Symptoms

• Under-replicated partitions alerts
• Followers consistently behind leader
• ISR shrinkage events

Solutions

• Increase replica.fetch.max.bytes
• Optimize disk I/O performance
• Check network bandwidth capacity
• Review producer batch settings

Proactive Issue Prevention

Monitoring strategies and alerts to prevent issues before they impact your systems.

Critical Health Metrics

ZooKeeper Ensemble Health3/3 Online

Session Timeout Rate0.01%

Under-replicated Partitions0

Max Replica Lag42ms

Alert Thresholds

GC Pause Time>100ms

Disk Usage>80%

Network Latency>50ms

Connection Pool>90%

Prevent Issues with KLogic Monitoring

Get proactive monitoring and automated issue detection for your Kafka and ZooKeeper clusters.

Get started Contact sales