AWS MSK CloudWatch Monitoring
Master AWS MSK monitoring with CloudWatch. Learn which metrics matter, how to set up effective alarms, and build dashboards that give you complete visibility into your managed Kafka clusters.
AWS MSK CloudWatch Metrics Overview
AWS MSK automatically publishes Kafka metrics to CloudWatch, giving you visibility into broker health, topic performance, and cluster operations. Understanding which metrics to monitor is crucial for maintaining a healthy managed Kafka deployment.
MSK Monitoring Levels
DEFAULT: Basic cluster-level metrics (free)
PER_BROKER: Broker-level metrics with more detail
PER_TOPIC_PER_BROKER: Most granular, per-topic metrics (recommended for production)
PER_TOPIC_PER_PARTITION: Partition-level granularity (highest cost)
Essential AWS MSK Metrics
Broker Health Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
ActiveControllerCount | Number of active controllers (should be 1) | != 1 |
OfflinePartitionsCount | Partitions without an active leader | > 0 |
UnderReplicatedPartitions | Partitions with fewer than min.insync.replicas | > 0 |
UnderMinIsrPartitionCount | Partitions below minimum ISR | > 0 |
Throughput Metrics
| Metric | Description | What to Watch |
|---|---|---|
BytesInPerSec | Bytes received per second per broker | Approach to broker limits |
BytesOutPerSec | Bytes sent per second per broker | Consumer throughput patterns |
MessagesInPerSec | Messages received per second | Traffic patterns and anomalies |
ProduceRequestsPerSec | Producer request rate | Request patterns |
Resource Utilization Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
CpuUser | User CPU utilization percentage | > 70% |
CpuSystem | System CPU utilization percentage | > 30% |
MemoryUsed | Memory used by broker | > 85% |
KafkaDataLogsDiskUsed | Disk space used for Kafka data | > 80% |
Consumer Lag Metrics
| Metric | Description | What to Watch |
|---|---|---|
SumOffsetLag | Total lag across all partitions for a consumer group | Growing lag over time |
MaxOffsetLag | Maximum lag for any single partition | Hot partitions |
EstimatedTimeLag | Estimated time behind in seconds | SLA compliance |
Setting Up CloudWatch Alarms for MSK
Critical Alarms (Page Immediately)
# Offline Partitions Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "MSK-OfflinePartitions" \
--metric-name OfflinePartitionsCount \
--namespace "AWS/Kafka" \
--statistic Maximum \
--period 60 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:region:account:critical-alertsThese alarms indicate immediate data availability issues that require urgent attention.
Warning Alarms (Investigate Soon)
# High CPU Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "MSK-HighCPU" \
--metric-name CpuUser \
--namespace "AWS/Kafka" \
--statistic Average \
--period 300 \
--threshold 70 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:region:account:warning-alertsDisk Space Alarm
# Disk Space Warning
aws cloudwatch put-metric-alarm \
--alarm-name "MSK-DiskSpace" \
--metric-name KafkaDataLogsDiskUsed \
--namespace "AWS/Kafka" \
--statistic Maximum \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:region:account:warning-alertsMSK automatically scales storage, but it's important to monitor trends and budget implications.
Building an MSK CloudWatch Dashboard
Create a comprehensive CloudWatch dashboard to visualize your MSK cluster health at a glance.
Recommended Dashboard Widgets
Cluster Health
- • ActiveControllerCount (Number widget)
- • OfflinePartitionsCount (Number widget)
- • UnderReplicatedPartitions (Time series)
- • ZooKeeperSessionState (Number widget)
Throughput
- • BytesInPerSec by broker (Time series)
- • BytesOutPerSec by broker (Time series)
- • MessagesInPerSec (Time series)
- • NetworkRxPackets/NetworkTxPackets
Resources
- • CpuUser by broker (Time series)
- • MemoryUsed (Time series)
- • KafkaDataLogsDiskUsed (Time series)
- • NetworkProcessorAvgIdlePercent
Consumer Groups
- • SumOffsetLag by group (Time series)
- • EstimatedTimeLag (Time series)
- • FetchMessageConversionsPerSec
- • ConsumerLag per topic
CloudWatch Limitations
While CloudWatch provides essential metrics, it has limitations for comprehensive Kafka monitoring:
- • Metrics are delayed by 1-5 minutes
- • Limited per-partition visibility at higher monitoring levels
- • No native topic/message inspection capabilities
- • Additional costs for enhanced monitoring levels
Beyond CloudWatch: Enhanced MSK Monitoring
Open Monitoring with Prometheus
MSK supports JMX and Node Exporter for Prometheus, providing access to hundreds of additional Kafka metrics not available in CloudWatch.
MSK Connect Monitoring
If using MSK Connect, monitor connector-specific metrics for data pipeline health.
Complete AWS MSK Monitoring with KLogic
KLogic provides comprehensive AWS MSK monitoring that goes beyond CloudWatch, offering real-time visibility, intelligent alerting, and operational insights specifically designed for managed Kafka.