🔧 Troubleshooting Guide

Kafka Troubleshooting Guide

Comprehensive guide to diagnosing and fixing common Kafka issues in production environments. From consumer lag to broker failures, get your Kafka cluster back on track quickly.

Published: August 3, 2025 • 10 min read • Troubleshooting Guide

Systematic Troubleshooting Approach

When Kafka issues occur, a systematic approach saves time and prevents further problems. Follow this methodology to diagnose issues effectively and implement lasting fixes.

Identify Symptoms

Gather information about what's not working and when it started

Check Basics

Verify broker status, network connectivity, and resource availability

Analyze Logs

Review broker, producer, and consumer logs for error patterns

Implement Fix

Apply targeted solutions and monitor for improvement

Most Common Kafka Issues

The top issues encountered in production Kafka deployments and how to fix them

Consumer Lag Issues

High consumer lag is one of the most common Kafka issues, often indicating processing bottlenecks or consumer group problems.

Symptoms

Increasing consumer lag metrics
Delayed message processing
Consumer group rebalancing frequently
Timeouts in downstream applications

Diagnostic Commands

# Check consumer group lag

kafka-consumer-groups.sh \\

--bootstrap-server localhost:9092 \\

--describe --group my-group

Common Causes & Solutions

Slow Message Processing

Consumer takes too long to process each message.

Solution: Optimize processing logic, increase consumer instances, or implement batch processing.

Insufficient Consumers

Not enough consumer instances for the number of partitions.

Solution: Scale consumer instances up to match partition count.

Resource Constraints

Consumer host lacks CPU, memory, or network resources.

Solution: Monitor resource usage and scale infrastructure.

Broker Performance Issues

Broker performance problems can cascade across the entire cluster, affecting producers and consumers.

High CPU Usage

Cause: Inefficient compression, too many small batches

Fix: Tune compression settings, increase batch sizes

Disk I/O Bottlenecks

Cause: Slow storage, insufficient IOPS

Fix: Upgrade to SSDs, optimize log retention

Memory Issues

Cause: Inadequate JVM heap, memory leaks

Fix: Tune JVM settings, monitor GC performance

Connection & Network Issues

Network connectivity problems between clients and brokers or between brokers themselves.

Connection Timeouts

Clients frequently timeout when connecting to brokers.

Check network latency between client and broker

Verify firewall rules and port accessibility

Increase timeout values in client configuration

Broker Connectivity

Brokers cannot communicate with each other properly.

Verify inter-broker listener configuration

Check DNS resolution for broker hostnames

Validate security group and network ACL settings

Essential Diagnostic Tools

Tools and commands for diagnosing Kafka issues effectively

Command Line Tools

kafka-topics.sh - Topic management

kafka-consumer-groups.sh - Consumer monitoring

kafka-log-dirs.sh - Log directory analysis

kafka-broker-api-versions.sh - Version compatibility

JMX Metrics

MessagesInPerSec - Incoming message rate

BytesInPerSec - Incoming data rate

TotalTimeMs - Request processing time

NetworkProcessorAvgIdlePercent - Network utilization

Log Analysis

Server.log - Broker operations and errors

State-change.log - Partition leadership changes

Controller.log - Cluster coordination

Log-cleaner.log - Compaction process

Performance Troubleshooting Checklist

Step-by-step checklist for diagnosing performance issues

Check Broker Health

• Verify all brokers are online and responsive
• Check CPU, memory, and disk utilization
• Review recent GC activity and JVM performance
• Examine broker logs for errors or warnings

Analyze Producer Performance

• Monitor producer throughput and latency metrics
• Check for failed send attempts and retries
• Verify batch size and compression settings
• Review buffer memory usage and blocking time

Examine Consumer Behavior

• Measure consumer lag across all partitions
• Check consumer group stability and rebalancing
• Monitor fetch performance and processing time
• Verify proper offset management and commits

Review Topic Configuration

• Validate partition count and distribution
• Check replication factor and ISR health
• Review retention and cleanup policies
• Examine topic-level metrics and hotspots

Emergency Response Procedures

Critical procedures for handling severe Kafka outages

Broker Failure Response

Immediate Actions

Verify broker is actually down (not just network issue)
Check if other brokers can handle the load
Monitor producer and consumer error rates
Prepare for potential partition leadership changes

Recovery Steps

Restart broker service if possible
Check disk space and file system integrity
Verify log directory permissions and accessibility
Monitor cluster health after broker rejoins

Data Loss Prevention

Before Restart

• Back up log directories if possible
• Document current state and error messages
• Check for corrupted log segments
• Verify replication status of critical topics

After Recovery

• Verify all partitions have proper leadership
• Check ISR status for all topics
• Monitor replication lag
• Test producer/consumer functionality

Simplify Kafka Troubleshooting

KLogic automatically diagnoses common Kafka issues and provides intelligent recommendations to fix problems before they impact your applications.

Try Intelligent Diagnostics Learn About AI Detection

Free 14-day trial • Automated diagnostics • Expert recommendations