KLogic
AWS MSK Guide

AWS MSK CloudWatch Monitoring

Master AWS MSK monitoring with CloudWatch. Learn which metrics matter, how to set up effective alarms, and build dashboards that give you complete visibility into your managed Kafka clusters.

Published: January 10, 2026 • 15 min read • AWS MSK Monitoring

AWS MSK CloudWatch Metrics Overview

AWS MSK automatically publishes Kafka metrics to CloudWatch, giving you visibility into broker health, topic performance, and cluster operations. Understanding which metrics to monitor is crucial for maintaining a healthy managed Kafka deployment.

MSK Monitoring Levels

DEFAULT: Basic cluster-level metrics (free)

PER_BROKER: Broker-level metrics with more detail

PER_TOPIC_PER_BROKER: Most granular, per-topic metrics (recommended for production)

PER_TOPIC_PER_PARTITION: Partition-level granularity (highest cost)

Essential AWS MSK Metrics

Broker Health Metrics

MetricDescriptionAlert Threshold
ActiveControllerCountNumber of active controllers (should be 1)!= 1
OfflinePartitionsCountPartitions without an active leader> 0
UnderReplicatedPartitionsPartitions with fewer than min.insync.replicas> 0
UnderMinIsrPartitionCountPartitions below minimum ISR> 0

Throughput Metrics

MetricDescriptionWhat to Watch
BytesInPerSecBytes received per second per brokerApproach to broker limits
BytesOutPerSecBytes sent per second per brokerConsumer throughput patterns
MessagesInPerSecMessages received per secondTraffic patterns and anomalies
ProduceRequestsPerSecProducer request rateRequest patterns

Resource Utilization Metrics

MetricDescriptionAlert Threshold
CpuUserUser CPU utilization percentage> 70%
CpuSystemSystem CPU utilization percentage> 30%
MemoryUsedMemory used by broker> 85%
KafkaDataLogsDiskUsedDisk space used for Kafka data> 80%

Consumer Lag Metrics

MetricDescriptionWhat to Watch
SumOffsetLagTotal lag across all partitions for a consumer groupGrowing lag over time
MaxOffsetLagMaximum lag for any single partitionHot partitions
EstimatedTimeLagEstimated time behind in secondsSLA compliance

Setting Up CloudWatch Alarms for MSK

Critical Alarms (Page Immediately)

# Offline Partitions Alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "MSK-OfflinePartitions" \
  --metric-name OfflinePartitionsCount \
  --namespace "AWS/Kafka" \
  --statistic Maximum \
  --period 60 \
  --threshold 0 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:region:account:critical-alerts

These alarms indicate immediate data availability issues that require urgent attention.

Warning Alarms (Investigate Soon)

# High CPU Alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "MSK-HighCPU" \
  --metric-name CpuUser \
  --namespace "AWS/Kafka" \
  --statistic Average \
  --period 300 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:region:account:warning-alerts

Disk Space Alarm

# Disk Space Warning
aws cloudwatch put-metric-alarm \
  --alarm-name "MSK-DiskSpace" \
  --metric-name KafkaDataLogsDiskUsed \
  --namespace "AWS/Kafka" \
  --statistic Maximum \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:region:account:warning-alerts

MSK automatically scales storage, but it's important to monitor trends and budget implications.

Building an MSK CloudWatch Dashboard

Create a comprehensive CloudWatch dashboard to visualize your MSK cluster health at a glance.

Recommended Dashboard Widgets

Cluster Health

  • • ActiveControllerCount (Number widget)
  • • OfflinePartitionsCount (Number widget)
  • • UnderReplicatedPartitions (Time series)
  • • ZooKeeperSessionState (Number widget)

Throughput

  • • BytesInPerSec by broker (Time series)
  • • BytesOutPerSec by broker (Time series)
  • • MessagesInPerSec (Time series)
  • • NetworkRxPackets/NetworkTxPackets

Resources

  • • CpuUser by broker (Time series)
  • • MemoryUsed (Time series)
  • • KafkaDataLogsDiskUsed (Time series)
  • • NetworkProcessorAvgIdlePercent

Consumer Groups

  • • SumOffsetLag by group (Time series)
  • • EstimatedTimeLag (Time series)
  • • FetchMessageConversionsPerSec
  • • ConsumerLag per topic

CloudWatch Limitations

While CloudWatch provides essential metrics, it has limitations for comprehensive Kafka monitoring:

  • • Metrics are delayed by 1-5 minutes
  • • Limited per-partition visibility at higher monitoring levels
  • • No native topic/message inspection capabilities
  • • Additional costs for enhanced monitoring levels

Beyond CloudWatch: Enhanced MSK Monitoring

Open Monitoring with Prometheus

MSK supports JMX and Node Exporter for Prometheus, providing access to hundreds of additional Kafka metrics not available in CloudWatch.

Enable Open Monitoring in your MSK cluster configuration to expose Prometheus endpoints.

MSK Connect Monitoring

If using MSK Connect, monitor connector-specific metrics for data pipeline health.

Track connector status, task counts, and error rates in addition to broker metrics.

Complete AWS MSK Monitoring with KLogic

KLogic provides comprehensive AWS MSK monitoring that goes beyond CloudWatch, offering real-time visibility, intelligent alerting, and operational insights specifically designed for managed Kafka.

Native AWS MSK integration
Real-time consumer lag tracking
Topic and partition-level visibility
AI-powered anomaly detection
Cost optimization insights
Unified dashboard for all clusters
Try KLogic for MSK