Kafka Connect Best Practices for Production
Kafka Connect is deceptively simple to start and notoriously difficult to operate at scale. This guide covers the connector lifecycle, task parallelism, error handling, and the monitoring setup that keeps pipelines healthy in production.
Connector Configuration Fundamentals
A well-configured connector prevents the majority of production incidents. Start with conservative settings and tune upward based on observed throughput and lag.
JDBC Source Connector — Production Template
{
"name": "postgres-orders-source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://db:5432/orders",
"connection.user": "${file:/opt/kafka/secrets.properties:db.user}",
"connection.password": "${file:/opt/kafka/secrets.properties:db.pass}",
"mode": "timestamp+incrementing",
"timestamp.column.name": "updated_at",
"incrementing.column.name": "id",
"table.whitelist": "orders,order_items",
"topic.prefix": "postgres.",
"poll.interval.ms": "5000",
"batch.max.rows": "1000",
"tasks.max": "4",
"transforms": "addTimestamp",
"transforms.addTimestamp.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.addTimestamp.timestamp.field": "ingest_ts",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "dlq.postgres-orders-source",
"errors.deadletterqueue.context.headers.enable": "true"
}
}S3 Sink Connector — Production Template
{
"name": "s3-events-sink",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "8",
"topics": "events.clickstream,events.pageviews",
"s3.region": "us-east-1",
"s3.bucket.name": "data-lake-prod",
"s3.part.size": "67108864",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"parquet.codec": "snappy",
"flush.size": "50000",
"rotate.interval.ms": "600000",
"rotate.schedule.interval.ms": "3600000",
"locale": "en_US",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "event_ts",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"schema.compatibility": "FULL",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "dlq.s3-events-sink"
}
}Task Parallelism and Scaling
Tasks are the unit of parallelism in Kafka Connect. Getting the task count right is critical: too few starves throughput, too many overwhelms the worker JVM and the downstream system.
Source Connector Task Sizing
For JDBC source connectors, the maximum useful tasks equals the number of tables being polled. For Debezium, there is always exactly 1 task per connector instance because CDC requires a single ordered stream.
Sink Connector Task Sizing
For sink connectors, tasks.max should match the topic partition count consumed. A task assigned more partitions than it can drain will cause lag to accumulate.
# Rebalance tasks after changing task count
curl -X PUT http://connect:8083/connectors/s3-events-sink/config \
-H "Content-Type: application/json" \
-d '{"tasks.max": "16", ...}'
# Check task assignment
curl http://connect:8083/connectors/s3-events-sink/tasks | jq .Worker Configuration Tuning
# connect-distributed.properties group.id=connect-cluster-prod bootstrap.servers=broker-1:9092,broker-2:9092,broker-3:9092 # Internal topic settings config.storage.replication.factor=3 offset.storage.replication.factor=3 status.storage.replication.factor=3 config.storage.topic=connect-configs offset.storage.topic=connect-offsets status.storage.topic=connect-statuses # Performance offset.flush.interval.ms=10000 offset.flush.timeout.ms=5000 # JVM tuning (set in KAFKA_OPTS) # -Xms4g -Xmx4g -XX:+UseG1GC # -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35
Error Handling and Dead Letter Queues
The default error behavior — stopping the connector on the first bad record — is rarely appropriate in production. A DLQ-based strategy keeps pipelines flowing while preserving every failed record for inspection.
errors.tolerance=none (Default)
Connector stops on the first transformation or serialization error. Guarantees no data loss but causes pipeline downtime for a single bad record. Suitable only for critical financial pipelines where every record must succeed.
errors.tolerance=all (Recommended)
Failed records are routed to the DLQ topic with error context in headers. Pipeline continues uninterrupted. Set up a separate consumer to monitor and reprocess DLQ messages after fixing the root cause.
Retry Configuration for Transient Failures
# For sink connectors writing to external systems
{
"errors.retry.delay.max.ms": "60000",
"errors.retry.timeout": "300000",
# Retriable errors (network, timeouts) are retried automatically
# Non-retriable errors (schema mismatches) go to DLQ immediately
"errors.deadletterqueue.topic.name": "dlq.my-connector",
"errors.deadletterqueue.topic.replication.factor": "3",
"errors.deadletterqueue.context.headers.enable": "true",
"errors.log.enable": "true",
"errors.log.include.messages": "true"
}Monitoring Kafka Connect in Production
Critical JMX Metrics
| Metric | Alert Threshold |
|---|---|
| kafka.connect:type=connector-task-metrics,connector=*,task=* source-record-poll-rate | Below expected baseline for 5 min |
| kafka.connect:type=connector-metrics status | Not RUNNING |
| kafka.connect:type=task-error-metrics total-errors-logged | Rate > 0 sustained for 5 min |
| kafka.connect:type=sink-task-metrics offset-commit-success-percentage | < 100% |
REST API Health Checks
# Check all connector statuses
curl -s http://connect:8083/connectors?expand=status | \
jq 'to_entries[] | {
connector: .key,
state: .value.status.connector.state,
tasks: [.value.status.tasks[].state]
}'
# Example output:
# { "connector": "postgres-orders-source", "state": "RUNNING", "tasks": ["RUNNING", "RUNNING", "RUNNING", "RUNNING"] }
# { "connector": "s3-events-sink", "state": "RUNNING", "tasks": ["RUNNING", "FAILED", "RUNNING"] }
# ^--- alert on this
# Restart a single failed task
curl -X POST \
http://connect:8083/connectors/s3-events-sink/tasks/1/restartManaging the Connector Lifecycle
Blue-Green Connector Deployments
When updating a connector configuration, never delete and recreate — you will lose offset checkpoints and replay all data from the beginning. Instead use PUT to update config atomically:
# Safe update — offsets are preserved curl -X PUT http://connect:8083/connectors/my-connector/config \ -H "Content-Type: application/json" \ -d @updated-config.json # Only use DELETE when you truly intend to reset offsets # curl -X DELETE http://connect:8083/connectors/my-connector
Pause / Resume for Maintenance
# Pause without losing offset position curl -X PUT http://connect:8083/connectors/my-connector/pause # Resume after maintenance window curl -X PUT http://connect:8083/connectors/my-connector/resume
Key Takeaways
Monitor All Your Kafka Connect Clusters
KLogic gives you a unified view of every connector and task across all your Kafka Connect clusters — with automatic alerting when tasks fail and DLQ message inspection built in.
Request a Demo