KLogic
Cost Optimization

Reducing Kafka Infrastructure Costs

Kafka clusters are expensive to run at scale. Most organizations have 20–40% of their storage and compute going to waste. This guide shows exactly where to look and how to reduce costs without compromising reliability.

Published: January 6, 2025 • 10 min read • Cost Management

Where Kafka Costs Accumulate

Before optimizing, understand the cost structure. Kafka costs are dominated by four factors:

45%
Storage (EBS/SSD)

Retention policies that are too generous keep terabytes of rarely-accessed data on expensive block storage.

30%
Compute (EC2/VMs)

Over-provisioned brokers running at 20% CPU utilization for peak-capacity headroom.

15%
Network egress

Cross-AZ replication traffic and consumer fetch requests that cross availability zone boundaries.

10%
Operational overhead

Cluster management tooling, Schema Registry, Kafka Connect, and monitoring stack.

Finding and Removing Unused Topics

The average mature Kafka cluster has 15–25% of its topics receiving no new messages. Many more have zero active consumers. These ghost topics still consume storage through replication and inflate partition counts on your brokers.

Identifying Stale Topics

# Topics with no produce activity in 30 days
# (Query against ClickHouse metrics store)
SELECT
  topic,
  max(ts) AS last_produce_time,
  dateDiff('day', max(ts), now()) AS days_inactive,
  sum(size_bytes) AS total_size_bytes
FROM kafka_topic_metrics
GROUP BY topic
HAVING days_inactive > 30
ORDER BY total_size_bytes DESC;

# Topics with no active consumer groups
kafka-consumer-groups.sh --bootstrap-server broker:9092 --list | \
  while read group; do
    kafka-consumer-groups.sh --bootstrap-server broker:9092 \
      --describe --group "$group" 2>/dev/null
  done | awk '{print $1}' | sort -u > consumed_topics.txt

kafka-topics.sh --bootstrap-server broker:9092 --list | \
  sort > all_topics.txt

comm -23 all_topics.txt consumed_topics.txt  # topics with no consumers

Safe Deletion Process

1.Tag the topic as deprecated in your internal catalog or documentation.
2.Set a retention.ms of 1 hour for 7 days — producers will see errors, surfacing any unknown consumers.
3.After 7 days with no complaints, delete the topic.
4.Archive a metadata record of the deletion with the last known schema.

Retention Policy Tuning

Default retention of 7 days is almost always the wrong setting. High-throughput topics may only need 24 hours for consumer recovery. Audit and compliance topics may legitimately need 90 days. Setting them all to 7 days wastes both money and purpose.

Retention Decision Matrix

Topic TypeRecommended RetentionReason
Clickstream / analytics events24–48 hoursDownstream sink to data warehouse within hours
Application log events3 daysIndexed by Elasticsearch within minutes
Business domain events7 daysAllows consumer replay after incidents
Financial transactions30 daysRegulatory audit window
Compacted state topicsUnlimited (compaction)Retain latest value per key indefinitely

Apply Retention Changes

# Reduce retention on high-throughput clickstream topic
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics \
  --entity-name clickstream.events \
  --alter \
  --add-config retention.ms=86400000  # 24 hours

# Set size-based retention cap (whichever limit is hit first applies)
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics \
  --entity-name clickstream.events \
  --alter \
  --add-config retention.bytes=107374182400  # 100 GB per partition

# Verify configuration
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics \
  --entity-name clickstream.events \
  --describe

Compression: The Highest ROI Optimization

Enabling compression on JSON-heavy topics typically reduces storage by 60–80% and network bandwidth by 40–60%. It is the single highest-impact optimization most clusters are not using.

Compression Algorithm Comparison

AlgorithmRatio (JSON)CPU CostBest For
lz43–4×Very lowHigh-throughput, latency-sensitive topics
snappy3–4×LowGeneral-purpose, Hadoop ecosystem compat
gzip5–7×MediumLow-throughput topics where ratio matters most
zstd6–8×MediumBest overall ratio/speed tradeoff (Kafka 2.1+)
# Enable lz4 compression on producer (batch-level)
Properties props = new Properties();
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);       // 64 KB batches
props.put(ProducerConfig.LINGER_MS_CONFIG, 10);           // wait 10ms to fill batch

# Or set at the topic level (broker re-compresses if needed)
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics \
  --entity-name events.clickstream \
  --alter \
  --add-config compression.type=lz4

Right-Sizing Broker Resources

CPU Right-Sizing

Kafka brokers are mostly I/O bound. A broker running at 20% CPU utilization at peak is a strong signal for right-sizing. Profile over a full business cycle (at least 2 weeks) before making changes.

Guideline: Target 60–70% CPU at peak traffic. If you are under 30%, consider downsizing or consolidating to fewer, larger brokers.

Tiered Storage

Kafka 3.6+ supports tiered storage, offloading old log segments to object storage (S3, GCS) at a fraction of the cost of EBS. This is the highest-leverage storage optimization available today.

# server.properties for tiered storage (Kafka 3.6+)
remote.log.storage.system.enable=true
remote.log.manager.task.interval.ms=30000

# Topic-level tiered storage config
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics \
  --entity-name events.clickstream \
  --alter \
  --add-config remote.storage.enable=true,\
local.retention.ms=86400000,\  # keep 1 day on disk
retention.ms=2592000000         # keep 30 days total (rest in S3)

Key Takeaways

Audit topics for inactivity quarterly — 15–25% of topics in mature clusters have no active producers.
Align retention.ms with actual consumer recovery windows, not a blanket 7-day default.
Enable lz4 or zstd compression on JSON-heavy topics — expect 60–80% storage reduction.
Tiered storage (Kafka 3.6+) cuts storage costs by 60–90% for topics with long retention requirements.
Profile CPU utilization over a full business cycle before right-sizing; avoid optimizing on daily snapshots.
Cross-AZ replication traffic is often invisible — place consumers in the same AZ as partition leaders when latency permits.

Find Cost Waste in Your Kafka Cluster

KLogic identifies unused topics, topics with suboptimal retention, and uncompressed high-throughput topics automatically — with estimated cost savings for each recommendation.

Unused topic identification
Retention policy recommendations
Compression savings estimates
Storage trend forecasting
Request a Demo