Setting Up Kafka Alerts

Learn how to configure effective Kafka alerting that catches real issues without overwhelming your team with false positives and noise.

Essential Kafka Alert Categories

The critical alerts every Kafka deployment should have

Broker Health Alerts

Monitor broker availability, resource usage, and performance degradation.

Broker DownCritical

High CPUWarning

Disk FullCritical

Consumer Lag Alerts

Track consumer performance and detect processing bottlenecks.

High LagWarning

Consumer DownCritical

Lag TrendInfo

Performance Alerts

Monitor throughput, latency, and system performance metrics.

High LatencyWarning

Low ThroughputInfo

Error RateCritical

Setting Effective Thresholds

Guidelines for setting alert thresholds that catch real issues

Threshold Best Practices

Use Historical Baselines

Set thresholds based on historical performance data, not arbitrary numbers

Account for Business Patterns

Consider daily, weekly, and seasonal variations in your thresholds

Use Multiple Conditions

Combine multiple conditions to reduce false positives

Include Duration Checks

Require conditions to persist for a minimum duration before alerting

Example Thresholds

Consumer Lag

Warning: Lag > 10,000 messages for 5+ minutes

Critical: Lag > 100,000 messages for 2+ minutes

Broker CPU

Warning: CPU > 80% for 10+ minutes

Critical: CPU > 95% for 5+ minutes

Disk Usage

Warning: Disk > 85% for 15+ minutes

Critical: Disk > 95% for 5+ minutes

Multi-Channel Notifications

Ensure alerts reach the right people through the right channels

Email

Detailed alerts with context and resolution steps.

Best for: Info & Warning alerts

Slack

Team collaboration and quick acknowledgment.

Best for: All alert levels

Teams

Integration with Microsoft Teams workflows.

Best for: Enterprise environments

PagerDuty

Incident management and on-call escalation.

Best for: Critical alerts

Alert Configuration Examples

Practical examples you can adapt for your environment

Consumer Lag Alert

Configuration

metric: consumer_lag_sum

condition: > 10000

duration: 5 minutes

group_by: consumer_group, topic

Alert Details

Severity: Warning

Channels: Slack, Email

Message: Consumer group {{consumer_group}} has high lag on topic {{topic}}

Broker Down Alert

Configuration

metric: broker_online_status

condition: = 0

duration: 1 minute

group_by: broker_id

Alert Details

Severity: Critical

Channels: PagerDuty, Slack, Email

Message: Broker {{broker_id}} is offline - immediate attention required

Alert Escalation Policies

Ensure critical issues get the attention they need

Initial Alert

Notify primary on-call engineer via Slack and email

Immediate

First Escalation

If not acknowledged, page secondary engineer and notify team lead

+5 minutes

Manager Escalation

Escalate to engineering manager and start incident response

+15 minutes

Executive Escalation

Notify CTO/VP Engineering for major business impact

+30 minutes

Escalation Best Practices

• Different escalation times for different severity levels
• Clear acknowledgment requirements to stop escalation
• Automated incident creation for critical alerts
• Regular review and testing of escalation policies

Alert Testing & Maintenance

Keep your alerting system effective over time

Regular Testing

Validate that your alerts work correctly and reach the right people.

Monthly test of critical alerts

Verify notification delivery

Test escalation policies

Update contact information

Continuous Improvement

Regularly review and adjust alerts based on operational feedback.

Analyze false positive rates

Review missed incidents

Adjust thresholds based on data

Document learnings from incidents

Implement Intelligent Kafka Alerting

KLogic provides pre-configured alert templates and intelligent thresholds that adapt to your Kafka environment automatically.

Try Smart Alerts Learn About AI Detection

Free 14-day trial • Pre-configured alerts • Multi-channel notifications