@RetryPolicy and @DeadLetterQueue — to handle failures gracefully in Change Stream handlers. Together, they ensure that transient errors are retried automatically and permanent failures are captured for later investigation instead of being silently lost.
How Retry Works
When@RetryPolicy is placed on a handler class, the framework catches exceptions thrown by the handler and retries the invocation up to maxAttempts times with increasing delays between attempts.
A simple example:
Exponential Backoff
The delay between retries is computed using exponential backoff with an optional jitter:initialDelay = "500ms", multiplier = 2.0, maxDelay = "30s", jitter = false), the delays are:
| Attempt | Delay |
|---|---|
| 1 (initial) | — |
| 2 (1st retry) | 500ms |
| 3 (2nd retry) | 1s |
| 4 | 2s |
| 5 | 4s |
| 6 | 8s |
| 7 | 16s |
| 8+ | 30s (capped) |
Exception Filtering
Retry Only Specific Exceptions
UseretryOn to limit retries to specific exception types. Any other exception causes immediate failure.
Exclude Specific Exceptions
UsenoRetryOn to skip retry for specific exceptions. This overrides retryOn — if an exception matches both, it is not retried.
Tracking Retry Attempts
UseChangeStreamContext.getAttemptNumber() to know which attempt is currently running. This is 1-based: the initial invocation is attempt 1, the first retry is 2, and so on.
How the Dead Letter Queue Works
When@DeadLetterQueue is present on a handler class, events that fail (after all retries are exhausted, if @RetryPolicy is also present) are persisted to a dedicated MongoDB collection instead of being silently lost. The stream continues processing subsequent events without blocking.
A simple example:
_fw_dlq collection with a 30-day TTL, including the original document and full stack trace.
Automatic DLQ Routing
When all retries are exhausted (or on first failure if no@RetryPolicy is present), the event is automatically sent to the DLQ. No extra code is needed — just add the annotation.
Manual DLQ Routing
You can manually send an event to the DLQ from any handler usingctx.sendToDlq(reason). This is useful when you detect a business-level error that shouldn’t be retried.
sendToDlq() requires a DlqStore bean to be available. The MongoDB implementation (MongoDlqStore) is auto-configured when @DeadLetterQueue is present on any stream.DLQ Document Schema
Each failed event is stored as a MongoDB document in the DLQ collection:DLQ document structure
DLQ document structure
| Field | Description |
|---|---|
streamName | Name of the originating @ChangeStream |
operationType | MongoDB operation (INSERT, UPDATE, DELETE, etc.) |
documentKey | _id of the source document |
fullDocument | Original document (if includeOriginalDocument = true) |
error.type | Exception class name |
error.message | Exception message |
error.stackTrace | Full stack trace (if includeStackTrace = true) |
attempts | Total number of processing attempts (including retries) |
status | Event status (PENDING) |
expiresAt | Computed from createdAt + ttlDays (null if ttlDays = 0) |
Retry + DLQ Together
@DeadLetterQueue works with or without @RetryPolicy:
| Configuration | Behavior |
|---|---|
@RetryPolicy + @DeadLetterQueue | Event is retried up to maxAttempts times. If all retries fail, the event is sent to the DLQ. |
@DeadLetterQueue only | Event is sent to the DLQ immediately after the first failure. |
@RetryPolicy only | Event is retried, but if all retries fail, the event is lost. |
| Neither | Event is lost on first failure. |
Complete Example
Best Practices
- Always pair
@RetryPolicywith@DeadLetterQueuefor critical streams. Retry handles transient errors; the DLQ captures permanent failures. - Keep
maxAttemptslow (3–5) for synchronous operations. Use the DLQ for reprocessing rather than excessive retries that block the stream. - Use
retryOnto narrow scope when your handler calls external services — only retry on transient exceptions (timeouts, connection errors), not on validation errors. - Leave
jitter = truein production to prevent synchronized retry storms when a shared dependency recovers. - Monitor
getAttemptNumber()in your handlers to add context to logs and metrics on retries. - Use
sendToDlq(reason)for business validation failures that you know retrying won’t fix (e.g., invalid data, missing required fields). - Set
ttlDaysbased on your investigation SLA. 30 days is a good default. UsettlDays = 0for regulatory or audit-critical streams. - Disable
includeOriginalDocumentfor large documents if storage is a concern. You can always look up the document bydocumentKey. - Monitor the
_fw_dlqcollection — growing DLQ entries indicate a handler issue that needs attention.
See Also
@RetryPolicy Reference
All attributes, defaults, and YAML configuration for retry
@DeadLetterQueue Reference
All attributes, DLQ Storage SPI, and YAML configuration
@Checkpoint
Resume token persistence for reliable stream recovery
ChangeStreamContext
Runtime context including sendToDlq(), attempt number, and more

