KLogic
Kafka Connect Guide

Kafka Connect Best Practices for Production

Kafka Connect is deceptively simple to start and notoriously difficult to operate at scale. This guide covers the connector lifecycle, task parallelism, error handling, and the monitoring setup that keeps pipelines healthy in production.

Published: January 15, 2025 • 15 min read • Kafka Connect

Connector Configuration Fundamentals

A well-configured connector prevents the majority of production incidents. Start with conservative settings and tune upward based on observed throughput and lag.

JDBC Source Connector — Production Template

{
  "name": "postgres-orders-source",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url": "jdbc:postgresql://db:5432/orders",
    "connection.user": "${file:/opt/kafka/secrets.properties:db.user}",
    "connection.password": "${file:/opt/kafka/secrets.properties:db.pass}",

    "mode": "timestamp+incrementing",
    "timestamp.column.name": "updated_at",
    "incrementing.column.name": "id",
    "table.whitelist": "orders,order_items",

    "topic.prefix": "postgres.",
    "poll.interval.ms": "5000",
    "batch.max.rows": "1000",

    "tasks.max": "4",
    "transforms": "addTimestamp",
    "transforms.addTimestamp.type": "org.apache.kafka.connect.transforms.InsertField$Value",
    "transforms.addTimestamp.timestamp.field": "ingest_ts",

    "errors.tolerance": "all",
    "errors.log.enable": "true",
    "errors.log.include.messages": "true",
    "errors.deadletterqueue.topic.name": "dlq.postgres-orders-source",
    "errors.deadletterqueue.context.headers.enable": "true"
  }
}
Credentials stored in an external secrets file, never inline.
DLQ configured so bad records are captured rather than blocking the pipeline.
Transform adds ingest timestamp for end-to-end latency tracking.

S3 Sink Connector — Production Template

{
  "name": "s3-events-sink",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "tasks.max": "8",
    "topics": "events.clickstream,events.pageviews",

    "s3.region": "us-east-1",
    "s3.bucket.name": "data-lake-prod",
    "s3.part.size": "67108864",

    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "parquet.codec": "snappy",

    "flush.size": "50000",
    "rotate.interval.ms": "600000",
    "rotate.schedule.interval.ms": "3600000",
    "locale": "en_US",
    "timezone": "UTC",

    "timestamp.extractor": "RecordField",
    "timestamp.field": "event_ts",
    "path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",

    "schema.compatibility": "FULL",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dlq.s3-events-sink"
  }
}

Task Parallelism and Scaling

Tasks are the unit of parallelism in Kafka Connect. Getting the task count right is critical: too few starves throughput, too many overwhelms the worker JVM and the downstream system.

Source Connector Task Sizing

For JDBC source connectors, the maximum useful tasks equals the number of tables being polled. For Debezium, there is always exactly 1 task per connector instance because CDC requires a single ordered stream.

Rule of thumb: tasks.max = min(table_count, worker_cpu_cores / 2)

Sink Connector Task Sizing

For sink connectors, tasks.max should match the topic partition count consumed. A task assigned more partitions than it can drain will cause lag to accumulate.

# Rebalance tasks after changing task count
curl -X PUT http://connect:8083/connectors/s3-events-sink/config \
  -H "Content-Type: application/json" \
  -d '{"tasks.max": "16", ...}'

# Check task assignment
curl http://connect:8083/connectors/s3-events-sink/tasks | jq .

Worker Configuration Tuning

# connect-distributed.properties
group.id=connect-cluster-prod
bootstrap.servers=broker-1:9092,broker-2:9092,broker-3:9092

# Internal topic settings
config.storage.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-statuses

# Performance
offset.flush.interval.ms=10000
offset.flush.timeout.ms=5000

# JVM tuning (set in KAFKA_OPTS)
# -Xms4g -Xmx4g -XX:+UseG1GC
# -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35

Error Handling and Dead Letter Queues

The default error behavior — stopping the connector on the first bad record — is rarely appropriate in production. A DLQ-based strategy keeps pipelines flowing while preserving every failed record for inspection.

errors.tolerance=none (Default)

Connector stops on the first transformation or serialization error. Guarantees no data loss but causes pipeline downtime for a single bad record. Suitable only for critical financial pipelines where every record must succeed.

errors.tolerance=all (Recommended)

Failed records are routed to the DLQ topic with error context in headers. Pipeline continues uninterrupted. Set up a separate consumer to monitor and reprocess DLQ messages after fixing the root cause.

Retry Configuration for Transient Failures

# For sink connectors writing to external systems
{
  "errors.retry.delay.max.ms": "60000",
  "errors.retry.timeout": "300000",

  # Retriable errors (network, timeouts) are retried automatically
  # Non-retriable errors (schema mismatches) go to DLQ immediately

  "errors.deadletterqueue.topic.name": "dlq.my-connector",
  "errors.deadletterqueue.topic.replication.factor": "3",
  "errors.deadletterqueue.context.headers.enable": "true",
  "errors.log.enable": "true",
  "errors.log.include.messages": "true"
}

Monitoring Kafka Connect in Production

Critical JMX Metrics

MetricAlert Threshold
kafka.connect:type=connector-task-metrics,connector=*,task=* source-record-poll-rateBelow expected baseline for 5 min
kafka.connect:type=connector-metrics statusNot RUNNING
kafka.connect:type=task-error-metrics total-errors-loggedRate > 0 sustained for 5 min
kafka.connect:type=sink-task-metrics offset-commit-success-percentage< 100%

REST API Health Checks

# Check all connector statuses
curl -s http://connect:8083/connectors?expand=status | \
  jq 'to_entries[] | {
    connector: .key,
    state: .value.status.connector.state,
    tasks: [.value.status.tasks[].state]
  }'

# Example output:
# { "connector": "postgres-orders-source", "state": "RUNNING", "tasks": ["RUNNING", "RUNNING", "RUNNING", "RUNNING"] }
# { "connector": "s3-events-sink",         "state": "RUNNING", "tasks": ["RUNNING", "FAILED", "RUNNING"] }
#                                                                                      ^--- alert on this

# Restart a single failed task
curl -X POST \
  http://connect:8083/connectors/s3-events-sink/tasks/1/restart

Managing the Connector Lifecycle

Blue-Green Connector Deployments

When updating a connector configuration, never delete and recreate — you will lose offset checkpoints and replay all data from the beginning. Instead use PUT to update config atomically:

# Safe update — offsets are preserved
curl -X PUT http://connect:8083/connectors/my-connector/config \
  -H "Content-Type: application/json" \
  -d @updated-config.json

# Only use DELETE when you truly intend to reset offsets
# curl -X DELETE http://connect:8083/connectors/my-connector

Pause / Resume for Maintenance

# Pause without losing offset position
curl -X PUT http://connect:8083/connectors/my-connector/pause

# Resume after maintenance window
curl -X PUT http://connect:8083/connectors/my-connector/resume

Key Takeaways

Always configure DLQs with errors.tolerance=all in production — stopping on one bad record is rarely the right tradeoff.
Set tasks.max based on partition count for sinks and table count for JDBC sources.
Use PUT /connectors/{name}/config to update configuration — DELETE loses offset checkpoints.
Monitor connector state via the REST API on a regular schedule and alert on any non-RUNNING task.
Store credentials in external secrets files, never inline in connector configuration JSON.
Tune worker JVM heap size (4–8 GB typical) and use G1GC to avoid GC-induced lag spikes.

Monitor All Your Kafka Connect Clusters

KLogic gives you a unified view of every connector and task across all your Kafka Connect clusters — with automatic alerting when tasks fail and DLQ message inspection built in.

Request a Demo