KLogic

Setting Up Kafka Alerts

Learn how to configure effective Kafka alerting that catches real issues without overwhelming your team with false positives and noise.

Essential Kafka Alert Categories

The critical alerts every Kafka deployment should have

Broker Health Alerts

Monitor broker availability, resource usage, and performance degradation.

Broker DownCritical
High CPUWarning
Disk FullCritical

Consumer Lag Alerts

Track consumer performance and detect processing bottlenecks.

High LagWarning
Consumer DownCritical
Lag TrendInfo

Performance Alerts

Monitor throughput, latency, and system performance metrics.

High LatencyWarning
Low ThroughputInfo
Error RateCritical

Setting Effective Thresholds

Guidelines for setting alert thresholds that catch real issues

Threshold Best Practices

Use Historical Baselines

Set thresholds based on historical performance data, not arbitrary numbers

Account for Business Patterns

Consider daily, weekly, and seasonal variations in your thresholds

Use Multiple Conditions

Combine multiple conditions to reduce false positives

Include Duration Checks

Require conditions to persist for a minimum duration before alerting

Example Thresholds

Consumer Lag

Warning: Lag > 10,000 messages for 5+ minutes
Critical: Lag > 100,000 messages for 2+ minutes

Broker CPU

Warning: CPU > 80% for 10+ minutes
Critical: CPU > 95% for 5+ minutes

Disk Usage

Warning: Disk > 85% for 15+ minutes
Critical: Disk > 95% for 5+ minutes

Multi-Channel Notifications

Ensure alerts reach the right people through the right channels

Email

Detailed alerts with context and resolution steps.

Best for: Info & Warning alerts

Slack

Team collaboration and quick acknowledgment.

Best for: All alert levels

Teams

Integration with Microsoft Teams workflows.

Best for: Enterprise environments

PagerDuty

Incident management and on-call escalation.

Best for: Critical alerts

Alert Configuration Examples

Practical examples you can adapt for your environment

Consumer Lag Alert

Configuration

metric: consumer_lag_sum
condition: > 10000
duration: 5 minutes
group_by: consumer_group, topic

Alert Details

Severity: Warning
Channels: Slack, Email
Message: Consumer group {{consumer_group}} has high lag on topic {{topic}}

Broker Down Alert

Configuration

metric: broker_online_status
condition: = 0
duration: 1 minute
group_by: broker_id

Alert Details

Severity: Critical
Channels: PagerDuty, Slack, Email
Message: Broker {{broker_id}} is offline - immediate attention required

Alert Escalation Policies

Ensure critical issues get the attention they need

1

Initial Alert

Notify primary on-call engineer via Slack and email

Immediate
2

First Escalation

If not acknowledged, page secondary engineer and notify team lead

+5 minutes
3

Manager Escalation

Escalate to engineering manager and start incident response

+15 minutes
4

Executive Escalation

Notify CTO/VP Engineering for major business impact

+30 minutes

Escalation Best Practices

  • • Different escalation times for different severity levels
  • • Clear acknowledgment requirements to stop escalation
  • • Automated incident creation for critical alerts
  • • Regular review and testing of escalation policies

Alert Testing & Maintenance

Keep your alerting system effective over time

Regular Testing

Validate that your alerts work correctly and reach the right people.

Monthly test of critical alerts
Verify notification delivery
Test escalation policies
Update contact information

Continuous Improvement

Regularly review and adjust alerts based on operational feedback.

Analyze false positive rates
Review missed incidents
Adjust thresholds based on data
Document learnings from incidents

Implement Intelligent Kafka Alerting

KLogic provides pre-configured alert templates and intelligent thresholds that adapt to your Kafka environment automatically.

Free 14-day trial • Pre-configured alerts • Multi-channel notifications