KLogic
🔧 Troubleshooting Guide

Kafka Troubleshooting Guide

Comprehensive guide to diagnosing and resolving common Apache Kafka issues including broker failures, consumer lag, performance problems, and network connectivity.

Published: August 3, 2025 • 30 min read • Troubleshooting Guide

Quick Diagnostic Checklist

Start with these essential checks when troubleshooting any Kafka issue. These cover the most common root causes of Kafka problems.

Broker Health

All brokers online?
Controller elected?
Under-replicated partitions?

Resource Usage

CPU utilization < 80%?
Memory usage < 85%?
Disk space available?

Consumer Health

Consumer lag normal?
Rebalancing frequent?
Processing errors?

Network & Connectivity

Inter-broker connectivity?
Client connectivity?
DNS resolution working?

Common Kafka Issues & Solutions

Detailed troubleshooting for the most frequent Kafka problems

Broker Failures & Connectivity Issues

Broker failures are the most critical Kafka issues, often causing cascading problems throughout the cluster.

Symptoms:

• Under-replicated partitions > 0
• Offline partitions present
• Producer/consumer connection failures
• High request latency

Diagnostic Steps:

1.Check broker logs for errors and exceptions
2.Verify network connectivity between brokers
3.Check system resources (CPU, memory, disk)
4.Validate ZooKeeper connectivity and health

Common Solutions:

Resource Issues: Increase heap size, add disk space, or scale horizontally
Network Issues: Check firewall rules, DNS resolution, and advertised.listeners
Configuration: Verify replica.lag.time.max.ms and unclean.leader.election.enable

Consumer Lag & Processing Issues

Consumer lag indicates that consumers cannot keep up with the rate of incoming messages, leading to processing delays and potential data loss.

Symptoms:

• Increasing consumer lag over time
• Frequent consumer group rebalancing
• Processing timeouts and errors
• Downstream system delays

Root Cause Analysis:

•Processing Speed: Consumer processing slower than producer rate
•Resource Constraints: Consumer instances under-resourced
•Partitioning: Uneven partition distribution or hotspots
•Configuration: Suboptimal consumer configuration settings

Optimization Strategies:

Scale Consumers: Add more consumer instances
Optimize Processing: Improve consumer logic efficiency
Batch Processing: Increase max.poll.records
Parallel Processing: Use multiple threads per consumer

Performance & Throughput Problems

Performance issues manifest as high latency, low throughput, or resource bottlenecks that prevent Kafka from operating at optimal capacity.

Performance Metrics:

Request latency (p99 > 100ms)
Low throughput (< expected)
High CPU/memory usage
Disk I/O bottlenecks

Common Causes:

•Small batch sizes
•Unoptimized JVM settings
•Too many small partitions
•Storage I/O limitations

Optimization:

Increase batch.size
Tune JVM heap/GC
Optimize partitioning
Use SSD storage

Essential Diagnostic Commands

Command-line tools and scripts for diagnosing Kafka issues

Cluster Health Commands

Check Broker Status

kafka-broker-api-versions.sh --bootstrap-server localhost:9092

List Under-Replicated Partitions

kafka-topics.sh --bootstrap-server localhost:9092
--describe --under-replicated-partitions

Check Topic Configuration

kafka-configs.sh --bootstrap-server localhost:9092
--describe --entity-type topics --entity-name {topic-name}

Consumer Diagnostic Commands

Check Consumer Group Lag

kafka-consumer-groups.sh --bootstrap-server localhost:9092
--group {group-name} --describe

List Consumer Groups

kafka-consumer-groups.sh --bootstrap-server localhost:9092
--list

Reset Consumer Offsets

kafka-consumer-groups.sh --bootstrap-server localhost:9092
--group {group-name} --reset-offsets --to-latest
--topic {topic-name} --execute

Proactive Monitoring Setup

Prevent issues before they occur with comprehensive monitoring

Essential Monitoring Checklist

Critical Alerts

Broker Down

Alert immediately when broker becomes unreachable

Under-Replicated Partitions

Alert when any partition lacks sufficient replicas

Offline Partitions

Critical alert for partitions without leaders

Performance Monitoring

Consumer Lag Trends

Track lag patterns and growth over time

Request Latency

Monitor p95/p99 latency for early detection

Resource Utilization

Track CPU, memory, and disk usage trends

KLogic Integration Benefits

Pre-configured alerts for all critical Kafka issues
AI-powered anomaly detection for early problem identification
Automated troubleshooting suggestions and remediation steps
Comprehensive dashboards with drill-down capabilities

Troubleshooting Best Practices

Proven strategies for effective Kafka issue resolution

1

Start with Cluster Health

Always begin troubleshooting by checking overall cluster health, broker status, and partition replication before diving into specific issues.

2

Check Logs Systematically

Review broker logs, application logs, and system logs in chronological order to understand the sequence of events leading to the issue.

3

Isolate the Problem

Use binary search approach: test with simplified configurations, single partitions, or isolated consumers to narrow down the root cause.

4

Monitor Resource Usage

Resource exhaustion is a common cause of Kafka issues. Always check CPU, memory, disk, and network utilization during troubleshooting.

5

Test Configuration Changes

Before applying configuration changes to production, test them in a staging environment and measure their impact on performance and stability.

6

Document Solutions

Maintain a troubleshooting playbook documenting common issues, their symptoms, root causes, and proven solutions for future reference.

Simplify Kafka Troubleshooting

KLogic automates issue detection and provides intelligent troubleshooting recommendations, reducing mean time to resolution from hours to minutes.

Free 14-day trial • Automated issue detection • AI-powered recommendations • Expert troubleshooting guides