When Your Message Queue Loses a Settlement Event

A settlement event is published to Kafka. The producer confirms the publish. The topic rebalances. The event disappears. Three days later, reconciliation discovers that 200 USD in payments never settled. This failure mode is real.

Message queues are core infrastructure in payment systems. They decouple settlement submission from settlement execution. A client publishes a settlement event. A consumer processes it asynchronously. If the consumer crashes, the queue holds the event until recovery.

This design works only if the queue actually holds the event. Many teams discover the hard way that message queues do not guarantee delivery by default.

Durability Configuration

Message queues exist on a spectrum from ephemeral to durable. On one end is a pure in-memory queue. Events are stored only in RAM. If the process crashes, all events are lost. No persistence to disk. In-memory queues are used for high-throughput, low-value events where loss is acceptable. Settlement events should never use in-memory queues.

On the other end is a database-backed queue. Each event is persisted to a durable database before the producer receives acknowledgment. This is maximally safe but slower. Event throughput is limited by database write latency.

In the middle are Kafka and RabbitMQ. Both offer configurable durability.

Kafka is a distributed log. By default, a Kafka producer publishes an event to a broker. The broker acknowledges immediately. The broker has not persisted the event to disk. If the broker crashes, the event is lost. This is the acks=1 configuration. It is fast and appropriate for telemetry. For settlement, it is unacceptable.

Kafka supports acks=all (or acks=-1). The producer publishes an event. The broker waits for all in-sync replicas to confirm the write. Only when all replicas have persisted the event does the broker acknowledge the producer. This ensures the event survives any single broker crash.

acks=all adds latency but is mandatory for settlement. The producer must wait for network round trips to all replicas. For a replication factor of 3, the producer waits for two additional confirmations. In a 5 ms network, this adds 10 ms latency per event. For settlement events, this overhead is acceptable.

RabbitMQ offers persistent queues and persistent messages. A persistent queue is declared with durable=true. Persistent messages are published with delivery_mode=2. When both are set, messages are written to disk before acknowledgment. RabbitMQ also supports publisher confirms. When enabled, the broker acknowledges each message only after it has been persisted.

Deduplication and Failure Handling

Both Kafka and RabbitMQ, when configured durably, provide at-least-once delivery. An event published to the queue will be delivered to the consumer at least one time. It may be delivered more than one time.

This is fundamentally different from exactly-once delivery. Exactly-once is impossible in distributed systems. The consumer must handle at-least-once semantics and deduplicate.

In a payment settlement system, at-least-once delivery combined with lack of deduplication causes duplicate settlements. A settlement event is published (Transfer 100 USD from A to B). The consumer receives it, executes the transfer, and updates the database. Before the consumer commits, the process crashes. The queue still holds the event. On recovery, the consumer receives the same event again. It executes the transfer again. Now 200 USD has been transferred.

This is prevented only by idempotent settlement processing combined with deduplication keys. Each settlement event must carry a unique idempotent key. The settlement processor must store the outcome of processing each key. If the same key is received twice, the processor returns the cached outcome without executing a second transfer.

Dead letter queues handle failed settlement events. If a settlement event fails to process, it is not acknowledged to the main queue. Instead, it is published to a dead letter queue. Separate processing logic examines DLQ events, determines why they failed, and either fixes them or escalates them for manual review.

Kafka does not natively support dead letter queues. Instead, the consumer catches failures and publishes a failed event to a separate Kafka topic. A monitoring system watches the failed topic and alerts on high failure rates.

RabbitMQ natively supports dead letter queues. When a message is nacked, it can be automatically routed to a dead letter exchange. This exchange then routes the message to a dead letter queue.

Circuit breakers detect when the queue is unavailable and fail fast. The producer attempts to publish. If it times out, the circuit breaker increments a failure counter. After a threshold is reached (for example, 5 consecutive timeouts), the circuit breaker opens. Further attempts immediately fail without attempting the network call. If the queue is down, blocking on network calls is worse than failing immediately. The client should receive an error and retry at a higher level.

As the queue recovers, the circuit breaker enters a half-open state. One probe request is sent. If it succeeds, the circuit breaker closes and normal operation resumes.

Message queue reliability is table stakes. Durability, deduplication, and monitoring form the foundation. Missing any of these is a recipe for silent payment loss.

When Your Message Queue Loses a Settlement Event.

Durability Configuration

Deduplication and Failure Handling

Tell us what you are building.