@Checkpoint - FlowWarden

The @Checkpoint annotation enables automatic persistence of MongoDB Change Stream resume tokens with at-least-once delivery guarantees. FlowWarden tracks two independent tokens — lastSeenToken (advances on every event received, even ones rejected by @Filter) and lastProcessedToken (advances only on confirmed handler success) — and combines them in a 3-level resume cascade on restart.

Basic Usage

@ChangeStream(name = "order-watcher", collection = "orders")
@Checkpoint
public class OrderStreamHandler {

    @OnInsert
    void handle(ChangeStreamContext<Order> ctx) {
        System.out.println(ctx.summary());
    }
}

With the default configuration, a checkpoint is saved after every event and the stream resumes from the last processed token on restart.

Attributes

Attribute	Type	Default	Description
`saveEveryN`	`int`	`1`	Persist `lastProcessedToken` every N successful handler invocations. Must be ≥ 1 — rejected at startup otherwise.
`saveIntervalSeconds`	`int`	`5`	Heartbeat timer interval in seconds — advances only `lastSeenToken` on each tick using the most recent received event’s token. Set to `0` to disable (also disables the resume cascade level-2 safety net).
`startPosition`	`StartPosition`	`RESUME`	Where to start consuming when the stream starts. With `RESUME`, the dual-token cascade decides the actual position.
`onHistoryLost`	`OnHistoryLost`	`FAIL`	Strategy when both persisted tokens are unusable (cascade level 3).
`resumeStrategy`	`ResumeStrategy`	`PROCESSED_FIRST`	Which of the two persisted tokens the cascade tries first. `PROCESSED_FIRST` preserves strict at-least-once; `SEEN_FIRST` trades in-flight re-delivery for fast restart on low-volume / filter-heavy streams.

Checkpoints are stored in the _fw_checkpoints collection in your MongoDB database. The collection name is not configurable.

Setting saveIntervalSeconds = 0 together with saveEveryN > 1 emits a startup warning: the combination disables the cascade level-2 safety net entirely, so a crash mid-handler can force the stream to fall back to onHistoryLost.

`saveEveryN`

Controls how often lastProcessedToken is persisted, based on successful handler invocations. Setting this to a higher value reduces write pressure on MongoDB at the cost of potentially replaying more events on crash recovery (cascade level 1).

// Save lastProcessedToken every 10 successful events
@Checkpoint(saveEveryN = 10)

saveEveryN < 1 is rejected at startup with a BeanCreationException. Use saveIntervalSeconds = 0 if you want to disable the heartbeat — there is no way to disable the per-event counter.

`saveIntervalSeconds`

A periodic timer that persists only lastSeenToken at the specified interval, even when no events are arriving and even for events rejected by @Filter. This is the cascade level-2 safety net — it keeps the resume position fresh on idle streams or streams where saveEveryN rarely triggers.

// Heartbeat every 30 seconds — advances only lastSeenToken
@Checkpoint(saveIntervalSeconds = 30)

// Disable the heartbeat (also disables cascade level 2)
@Checkpoint(saveIntervalSeconds = 0)

saveEveryN and saveIntervalSeconds are independent and complementary. The timer targets the lastSeenToken field on each tick; the counter targets the lastProcessedToken field after handler success. They never race or overwrite each other — they write different fields of the same document.

`startPosition`

Determines where the stream starts consuming when it is first created or when no checkpoint exists.

Value	Description
`RESUME`	Resume from the last persisted checkpoint (event-based or heartbeat). If no checkpoint exists (first-ever start, or `saveIntervalSeconds = 0` with no events processed yet), starts from now. (default)
`LATEST`	Ignore any existing checkpoint and start from now. Useful to skip a backlog after a long outage.

Resume cascade

On restart with startPosition = RESUME, FlowWarden applies a 3-level cascade to decide where to resume the Change Stream. Which token sits at each level depends on the resumeStrategy:

PROCESSED_FIRST (default)
SEEN_FIRST

Level	Token used	Behaviour
1	`lastProcessedToken`	Strict at-least-once. Events in flight at crash time are re-delivered.
2	`lastSeenToken`	Fallback when `lastProcessedToken` has aged out of the oplog. In-flight events at crash time are not re-delivered, but the stream avoids a `ChangeStreamHistoryLost`. Emits a `WARN` log and increments `flowwarden.stream.resume.fallback_to_seen`.
3	(none)	Both tokens unusable — apply the `onHistoryLost` strategy. Increments `flowwarden.stream.resume.history_lost`.

Level	Token used	Behaviour
1	`lastSeenToken`	Fast restart from the heartbeat-fresh token. Events in flight at crash time may be skipped (whatever was past `lastProcessedToken` but before `lastSeenToken`).
2	`lastProcessedToken`	Fallback when `lastSeenToken` has aged out of the oplog (rare, since the heartbeat keeps it fresh). At-least-once delivery is preserved for this fallback. Emits a `WARN` log and increments `flowwarden.stream.resume.fallback_to_processed`.
3	(none)	Both tokens unusable — apply the `onHistoryLost` strategy. Increments `flowwarden.stream.resume.history_lost`.

The cascade only runs when startPosition = RESUME (the default). With startPosition = LATEST, both persisted tokens are ignored and the stream starts from the current moment.

`onHistoryLost`

Determines the recovery strategy at cascade level 3 — when neither lastProcessedToken nor lastSeenToken can resume the stream. With the cascade, reaching level 3 means the downtime exceeded the oplog window and the heartbeat timer was disabled (or also aged out), which is genuinely abnormal.

Strategy	Events in gap	Behavior
`FAIL`	Unknown	Stream refuses to start. Throws `HistoryLostException`. Operator must intervene. (default)
`RESUME_FROM_OPLOG_START`	Partially lost	Resumes from the oldest available oplog entry. Falls back to `RESUME_FROM_NOW` if the oplog is inaccessible.
`RESUME_FROM_NOW`	All lost	Starts from the current moment. The entire gap is skipped — no replay.

// Default: fail loudly — operator must decide
@Checkpoint(onHistoryLost = OnHistoryLost.FAIL)

// Resume from oldest available oplog entry (partial recovery)
@Checkpoint(onHistoryLost = OnHistoryLost.RESUME_FROM_OPLOG_START)

// Skip gap entirely — for streams where missing events is acceptable
@Checkpoint(onHistoryLost = OnHistoryLost.RESUME_FROM_NOW)

FAIL is the default because silent data loss is worse than a stream that refuses to start. Only use RESUME_FROM_NOW or RESUME_FROM_OPLOG_START for streams where missing events is acceptable (e.g., cache invalidation, non-critical notifications).

HistoryLostException

When onHistoryLost = FAIL, the framework throws a HistoryLostException with an actionable message:

public class HistoryLostException extends RuntimeException {
    public String getStreamName();
    public Instant getLastCheckpointTimestamp();
}

The exception message includes the stream name, the last checkpoint timestamp, and suggestions for recovery strategies.

`resumeStrategy`

Picks which of the two persisted tokens the cascade tries first. The other token remains the cascade fallback before onHistoryLost, so the dual-token safety net applies under both modes.

Mode	Cascade order	In-flight events on crash	Restart cost
`PROCESSED_FIRST` (default)	`lastProcessedToken` → `lastSeenToken` → `onHistoryLost`	Re-delivered (strict at-least-once)	MongoDB scans from `lastProcessedToken`, which can be far behind the oplog head on low-volume or filter-heavy streams.
`SEEN_FIRST`	`lastSeenToken` → `lastProcessedToken` → `onHistoryLost`	May be skipped	MongoDB scans from `lastSeenToken`, kept fresh by the heartbeat — typically only a few seconds of events.

// Default — strict at-least-once
@Checkpoint(resumeStrategy = ResumeStrategy.PROCESSED_FIRST)

// Fast restart on low-volume streams — heartbeat-fresh seen token first
@Checkpoint(resumeStrategy = ResumeStrategy.SEEN_FIRST)

Pick SEEN_FIRST when the cost of a long oplog scan on restart outweighs the cost of skipping a few in-flight events — typically streams that emit one event per week, or streams where a @Filter rejects almost everything so lastProcessedToken rarely advances. For critical streams (billing, audit), keep PROCESSED_FIRST.

SEEN_FIRST still falls back to lastProcessedToken before escalating to onHistoryLost. To skip even that fallback and start from the current moment when the seen token is unusable, combine SEEN_FIRST with onHistoryLost = RESUME_FROM_NOW. The intermediate fallback to processed will still fire if it is valid — the RESUME_FROM_NOW only kicks in at level 3.

Comprehensive Example

@ChangeStream(name = "order-stream", collection = "orders", documentType = Order.class)
@Checkpoint(saveEveryN = 5, saveIntervalSeconds = 10, startPosition = StartPosition.LATEST)
public class OrderStreamHandler {

    @OnInsert
    void handle(ChangeStreamContext<Order> ctx) {
        Order order = ctx.getFullDocument(Order.class);
        log.info("New order: {}", order.getId());
    }
}

@ChangeStream(name = "order-stream", collection = "orders", documentType = Order.class)
@Checkpoint(saveEveryN = 5, saveIntervalSeconds = 10, startPosition = StartPosition.LATEST)
public class OrderStreamHandler {

    @OnInsert
    Mono<Void> handle(ChangeStreamContext<Order> ctx) {
        Order order = ctx.getFullDocument(Order.class);
        log.info("New order: {}", order.getId());
        return Mono.empty();
    }
}

How It Works

FlowWarden uses a dual-token checkpoint model internally:

Token	Advances when	Persisted by
`lastSeenToken`	Any event is received from MongoDB (including events rejected by `@Filter`)	The heartbeat timer (`saveIntervalSeconds`)
`lastProcessedToken`	The handler returns successfully	The per-event counter (`saveEveryN`)

This separation enables at-least-once delivery guarantees: if the application crashes after receiving an event but before the handler succeeds, the stream resumes from lastProcessedToken on the next start (cascade level 1) and the in-flight event is re-delivered.

Crash mid-handler (cascade level 1)
Idle / filtered stream (heartbeat)
Long outage (cascade level 2 fallback)

Interaction with @Pipeline

When a @Pipeline is present, it filters events server-side, creating a gap between the actual oplog position and the last event your handler processed. The lastSeenToken tracks the oplog position independently (via periodic sampling), so on restart, MongoDB resumes from the most recent position — not from the last processed event.Without this mechanism, a restart would force MongoDB to re-scan the entire oplog from lastProcessedToken, potentially scanning tens of thousands of filtered-out events.See the @Pipeline reference for details.

Storage

Checkpoints are persisted via the CheckpointStore SPI. The MongoDB-backed implementation (MongoCheckpointStore / ReactiveMongoCheckpointStore) is auto-configured by default; a Redis-backed alternative ships in flowwarden-redis. Custom backends (JDBC, in-memory, …) plug in by declaring a @Bean CheckpointStore.

Best Practices

Keep saveEveryN = 1 for critical streams (e.g., order processing, billing) to minimize event replay on crash.
Increase saveEveryN for high-throughput streams where replaying a few events is acceptable — this reduces checkpoint write pressure.
Do not disable saveIntervalSeconds unless you have a specific reason. The heartbeat is the cascade level-2 safety net and the only mechanism that advances lastSeenToken on idle or filter-heavy streams.
Use startPosition = LATEST only for streams where historical events are irrelevant (e.g., cache invalidation).
Alert on flowwarden.stream.resume.fallback_to_seen — under PROCESSED_FIRST, a recurring level-2 fallback signals that lastProcessedToken ages out faster than the handler can drain its backlog, typically a sign of a stuck or retry-pathological handler.
Alert on flowwarden.stream.resume.fallback_to_processed — under SEEN_FIRST, this fires only when the heartbeat-fresh lastSeenToken has aged out, which is unusual and worth investigating (heartbeat timer disabled? saveIntervalSeconds too long? outage longer than the oplog window?).
Alert on flowwarden.stream.resume.history_lost — reaching level 3 means downtime overran the oplog window with no usable fallback; investigate operationally rather than silencing it.

MongoDB must be configured as a Replica Set for Change Streams (and therefore checkpointing) to work. This applies to development environments as well. Testcontainers automatically provisions a single-node Replica Set for testing.

Checkpoint & Resume Guide

Step-by-step guide to configuring checkpoint and resume behavior

@ChangeStream

The main annotation for declaring Change Stream handlers

@Pipeline

Server-side filtering and its interaction with dual-token checkpointing

@RetryPolicy

Configure retry behavior for failed event processing

​Basic Usage

​Attributes

​saveEveryN

​saveIntervalSeconds

​startPosition

​Resume cascade

​onHistoryLost

​resumeStrategy

​Comprehensive Example

​How It Works

​Storage

​Best Practices

​See Also

Checkpoint & Resume Guide

@ChangeStream

@Pipeline

@RetryPolicy

Basic Usage

Attributes

`saveEveryN`

`saveIntervalSeconds`

`startPosition`

Resume cascade

`onHistoryLost`

`resumeStrategy`

Comprehensive Example

How It Works

Storage

Best Practices

See Also