KLogic
🔧 Troubleshooting Guide

Kafka Troubleshooting Guide

Comprehensive guide to diagnosing and fixing common Kafka issues in production environments. From consumer lag to broker failures, get your Kafka cluster back on track quickly.

Published: August 3, 2025 • 20 min read • Troubleshooting Guide

Systematic Troubleshooting Approach

When Kafka issues occur, a systematic approach saves time and prevents further problems. Follow this methodology to diagnose issues effectively and implement lasting fixes.

1

Identify Symptoms

Gather information about what's not working and when it started

2

Check Basics

Verify broker status, network connectivity, and resource availability

3

Analyze Logs

Review broker, producer, and consumer logs for error patterns

4

Implement Fix

Apply targeted solutions and monitor for improvement

Most Common Kafka Issues

The top issues encountered in production Kafka deployments and how to fix them

Consumer Lag Issues

High consumer lag is one of the most common Kafka issues, often indicating processing bottlenecks or consumer group problems.

Symptoms

  • Increasing consumer lag metrics
  • Delayed message processing
  • Consumer group rebalancing frequently
  • Timeouts in downstream applications

Diagnostic Commands

# Check consumer group lag
kafka-consumer-groups.sh \\
--bootstrap-server localhost:9092 \\
--describe --group my-group

Common Causes & Solutions

Slow Message Processing

Consumer takes too long to process each message.

Solution: Optimize processing logic, increase consumer instances, or implement batch processing.
Insufficient Consumers

Not enough consumer instances for the number of partitions.

Solution: Scale consumer instances up to match partition count.
Resource Constraints

Consumer host lacks CPU, memory, or network resources.

Solution: Monitor resource usage and scale infrastructure.

Broker Performance Issues

Broker performance problems can cascade across the entire cluster, affecting producers and consumers.

High CPU Usage

Cause: Inefficient compression, too many small batches

Fix: Tune compression settings, increase batch sizes

Disk I/O Bottlenecks

Cause: Slow storage, insufficient IOPS

Fix: Upgrade to SSDs, optimize log retention

Memory Issues

Cause: Inadequate JVM heap, memory leaks

Fix: Tune JVM settings, monitor GC performance

Connection & Network Issues

Network connectivity problems between clients and brokers or between brokers themselves.

Connection Timeouts

Clients frequently timeout when connecting to brokers.

Check network latency between client and broker
Verify firewall rules and port accessibility
Increase timeout values in client configuration

Broker Connectivity

Brokers cannot communicate with each other properly.

Verify inter-broker listener configuration
Check DNS resolution for broker hostnames
Validate security group and network ACL settings

Essential Diagnostic Tools

Tools and commands for diagnosing Kafka issues effectively

Command Line Tools

kafka-topics.sh - Topic management
kafka-consumer-groups.sh - Consumer monitoring
kafka-log-dirs.sh - Log directory analysis
kafka-broker-api-versions.sh - Version compatibility

JMX Metrics

MessagesInPerSec - Incoming message rate
BytesInPerSec - Incoming data rate
TotalTimeMs - Request processing time
NetworkProcessorAvgIdlePercent - Network utilization

Log Analysis

Server.log - Broker operations and errors
State-change.log - Partition leadership changes
Controller.log - Cluster coordination
Log-cleaner.log - Compaction process

Performance Troubleshooting Checklist

Step-by-step checklist for diagnosing performance issues

1

Check Broker Health

  • • Verify all brokers are online and responsive
  • • Check CPU, memory, and disk utilization
  • • Review recent GC activity and JVM performance
  • • Examine broker logs for errors or warnings
2

Analyze Producer Performance

  • • Monitor producer throughput and latency metrics
  • • Check for failed send attempts and retries
  • • Verify batch size and compression settings
  • • Review buffer memory usage and blocking time
3

Examine Consumer Behavior

  • • Measure consumer lag across all partitions
  • • Check consumer group stability and rebalancing
  • • Monitor fetch performance and processing time
  • • Verify proper offset management and commits
4

Review Topic Configuration

  • • Validate partition count and distribution
  • • Check replication factor and ISR health
  • • Review retention and cleanup policies
  • • Examine topic-level metrics and hotspots

Emergency Response Procedures

Critical procedures for handling severe Kafka outages

Broker Failure Response

Immediate Actions

  1. Verify broker is actually down (not just network issue)
  2. Check if other brokers can handle the load
  3. Monitor producer and consumer error rates
  4. Prepare for potential partition leadership changes

Recovery Steps

  1. Restart broker service if possible
  2. Check disk space and file system integrity
  3. Verify log directory permissions and accessibility
  4. Monitor cluster health after broker rejoins

Data Loss Prevention

Before Restart

  • • Back up log directories if possible
  • • Document current state and error messages
  • • Check for corrupted log segments
  • • Verify replication status of critical topics

After Recovery

  • • Verify all partitions have proper leadership
  • • Check ISR status for all topics
  • • Monitor replication lag
  • • Test producer/consumer functionality

Simplify Kafka Troubleshooting

KLogic automatically diagnoses common Kafka issues and provides intelligent recommendations to fix problems before they impact your applications.

Free 14-day trial • Automated diagnostics • Expert recommendations