Reducing Kafka Infrastructure Costs
Kafka clusters are expensive to run at scale. Most organizations have 20–40% of their storage and compute going to waste. This guide shows exactly where to look and how to reduce costs without compromising reliability.
Where Kafka Costs Accumulate
Before optimizing, understand the cost structure. Kafka costs are dominated by four factors:
Retention policies that are too generous keep terabytes of rarely-accessed data on expensive block storage.
Over-provisioned brokers running at 20% CPU utilization for peak-capacity headroom.
Cross-AZ replication traffic and consumer fetch requests that cross availability zone boundaries.
Cluster management tooling, Schema Registry, Kafka Connect, and monitoring stack.
Finding and Removing Unused Topics
The average mature Kafka cluster has 15–25% of its topics receiving no new messages. Many more have zero active consumers. These ghost topics still consume storage through replication and inflate partition counts on your brokers.
Identifying Stale Topics
# Topics with no produce activity in 30 days
# (Query against ClickHouse metrics store)
SELECT
topic,
max(ts) AS last_produce_time,
dateDiff('day', max(ts), now()) AS days_inactive,
sum(size_bytes) AS total_size_bytes
FROM kafka_topic_metrics
GROUP BY topic
HAVING days_inactive > 30
ORDER BY total_size_bytes DESC;
# Topics with no active consumer groups
kafka-consumer-groups.sh --bootstrap-server broker:9092 --list | \
while read group; do
kafka-consumer-groups.sh --bootstrap-server broker:9092 \
--describe --group "$group" 2>/dev/null
done | awk '{print $1}' | sort -u > consumed_topics.txt
kafka-topics.sh --bootstrap-server broker:9092 --list | \
sort > all_topics.txt
comm -23 all_topics.txt consumed_topics.txt # topics with no consumersSafe Deletion Process
Retention Policy Tuning
Default retention of 7 days is almost always the wrong setting. High-throughput topics may only need 24 hours for consumer recovery. Audit and compliance topics may legitimately need 90 days. Setting them all to 7 days wastes both money and purpose.
Retention Decision Matrix
| Topic Type | Recommended Retention | Reason |
|---|---|---|
| Clickstream / analytics events | 24–48 hours | Downstream sink to data warehouse within hours |
| Application log events | 3 days | Indexed by Elasticsearch within minutes |
| Business domain events | 7 days | Allows consumer replay after incidents |
| Financial transactions | 30 days | Regulatory audit window |
| Compacted state topics | Unlimited (compaction) | Retain latest value per key indefinitely |
Apply Retention Changes
# Reduce retention on high-throughput clickstream topic kafka-configs.sh --bootstrap-server broker:9092 \ --entity-type topics \ --entity-name clickstream.events \ --alter \ --add-config retention.ms=86400000 # 24 hours # Set size-based retention cap (whichever limit is hit first applies) kafka-configs.sh --bootstrap-server broker:9092 \ --entity-type topics \ --entity-name clickstream.events \ --alter \ --add-config retention.bytes=107374182400 # 100 GB per partition # Verify configuration kafka-configs.sh --bootstrap-server broker:9092 \ --entity-type topics \ --entity-name clickstream.events \ --describe
Compression: The Highest ROI Optimization
Enabling compression on JSON-heavy topics typically reduces storage by 60–80% and network bandwidth by 40–60%. It is the single highest-impact optimization most clusters are not using.
Compression Algorithm Comparison
| Algorithm | Ratio (JSON) | CPU Cost | Best For |
|---|---|---|---|
| lz4 | 3–4× | Very low | High-throughput, latency-sensitive topics |
| snappy | 3–4× | Low | General-purpose, Hadoop ecosystem compat |
| gzip | 5–7× | Medium | Low-throughput topics where ratio matters most |
| zstd | 6–8× | Medium | Best overall ratio/speed tradeoff (Kafka 2.1+) |
# Enable lz4 compression on producer (batch-level) Properties props = new Properties(); props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4"); props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536); // 64 KB batches props.put(ProducerConfig.LINGER_MS_CONFIG, 10); // wait 10ms to fill batch # Or set at the topic level (broker re-compresses if needed) kafka-configs.sh --bootstrap-server broker:9092 \ --entity-type topics \ --entity-name events.clickstream \ --alter \ --add-config compression.type=lz4
Right-Sizing Broker Resources
CPU Right-Sizing
Kafka brokers are mostly I/O bound. A broker running at 20% CPU utilization at peak is a strong signal for right-sizing. Profile over a full business cycle (at least 2 weeks) before making changes.
Tiered Storage
Kafka 3.6+ supports tiered storage, offloading old log segments to object storage (S3, GCS) at a fraction of the cost of EBS. This is the highest-leverage storage optimization available today.
# server.properties for tiered storage (Kafka 3.6+) remote.log.storage.system.enable=true remote.log.manager.task.interval.ms=30000 # Topic-level tiered storage config kafka-configs.sh --bootstrap-server broker:9092 \ --entity-type topics \ --entity-name events.clickstream \ --alter \ --add-config remote.storage.enable=true,\ local.retention.ms=86400000,\ # keep 1 day on disk retention.ms=2592000000 # keep 30 days total (rest in S3)
Key Takeaways
Find Cost Waste in Your Kafka Cluster
KLogic identifies unused topics, topics with suboptimal retention, and uncompressed high-throughput topics automatically — with estimated cost savings for each recommendation.