Kafka Troubleshooting Guide
Comprehensive guide to diagnosing and fixing common Kafka issues in production environments. From consumer lag to broker failures, get your Kafka cluster back on track quickly.
Systematic Troubleshooting Approach
When Kafka issues occur, a systematic approach saves time and prevents further problems. Follow this methodology to diagnose issues effectively and implement lasting fixes.
Identify Symptoms
Gather information about what's not working and when it started
Check Basics
Verify broker status, network connectivity, and resource availability
Analyze Logs
Review broker, producer, and consumer logs for error patterns
Implement Fix
Apply targeted solutions and monitor for improvement
Most Common Kafka Issues
The top issues encountered in production Kafka deployments and how to fix them
Consumer Lag Issues
High consumer lag is one of the most common Kafka issues, often indicating processing bottlenecks or consumer group problems.
Symptoms
- Increasing consumer lag metrics
- Delayed message processing
- Consumer group rebalancing frequently
- Timeouts in downstream applications
Diagnostic Commands
Common Causes & Solutions
Slow Message Processing
Consumer takes too long to process each message.
Insufficient Consumers
Not enough consumer instances for the number of partitions.
Resource Constraints
Consumer host lacks CPU, memory, or network resources.
Broker Performance Issues
Broker performance problems can cascade across the entire cluster, affecting producers and consumers.
High CPU Usage
Cause: Inefficient compression, too many small batches
Fix: Tune compression settings, increase batch sizes
Disk I/O Bottlenecks
Cause: Slow storage, insufficient IOPS
Fix: Upgrade to SSDs, optimize log retention
Memory Issues
Cause: Inadequate JVM heap, memory leaks
Fix: Tune JVM settings, monitor GC performance
Connection & Network Issues
Network connectivity problems between clients and brokers or between brokers themselves.
Connection Timeouts
Clients frequently timeout when connecting to brokers.
Broker Connectivity
Brokers cannot communicate with each other properly.
Essential Diagnostic Tools
Tools and commands for diagnosing Kafka issues effectively
Command Line Tools
kafka-topics.sh
- Topic managementkafka-consumer-groups.sh
- Consumer monitoringkafka-log-dirs.sh
- Log directory analysiskafka-broker-api-versions.sh
- Version compatibilityJMX Metrics
Log Analysis
Performance Troubleshooting Checklist
Step-by-step checklist for diagnosing performance issues
Check Broker Health
- • Verify all brokers are online and responsive
- • Check CPU, memory, and disk utilization
- • Review recent GC activity and JVM performance
- • Examine broker logs for errors or warnings
Analyze Producer Performance
- • Monitor producer throughput and latency metrics
- • Check for failed send attempts and retries
- • Verify batch size and compression settings
- • Review buffer memory usage and blocking time
Examine Consumer Behavior
- • Measure consumer lag across all partitions
- • Check consumer group stability and rebalancing
- • Monitor fetch performance and processing time
- • Verify proper offset management and commits
Review Topic Configuration
- • Validate partition count and distribution
- • Check replication factor and ISR health
- • Review retention and cleanup policies
- • Examine topic-level metrics and hotspots
Emergency Response Procedures
Critical procedures for handling severe Kafka outages
Broker Failure Response
Immediate Actions
- Verify broker is actually down (not just network issue)
- Check if other brokers can handle the load
- Monitor producer and consumer error rates
- Prepare for potential partition leadership changes
Recovery Steps
- Restart broker service if possible
- Check disk space and file system integrity
- Verify log directory permissions and accessibility
- Monitor cluster health after broker rejoins
Data Loss Prevention
Before Restart
- • Back up log directories if possible
- • Document current state and error messages
- • Check for corrupted log segments
- • Verify replication status of critical topics
After Recovery
- • Verify all partitions have proper leadership
- • Check ISR status for all topics
- • Monitor replication lag
- • Test producer/consumer functionality
Simplify Kafka Troubleshooting
KLogic automatically diagnoses common Kafka issues and provides intelligent recommendations to fix problems before they impact your applications.
Free 14-day trial • Automated diagnostics • Expert recommendations