Products Consulting About Blog Contact Us Česky
arrow_back Back to blog

Replacing Axon Sagas with Stateful Event Handlers: What We Built Instead

Replacing Axon Sagas with Stateful Event Handlers: What We Built Instead

A follow-up to Migrating from Axon Framework 4 to 5: What We Learned. The original migration post mentioned “the saga-to-stateful-handler refactoring was the biggest change but also the most valuable.” This post explains what that refactoring actually looks like — and why we didn’t reach for EventScheduler or DeadlineManager along the way.

The Question That Started This Post

After the migration write-up went up, a reader asked a sharp question:

Did you have to use event scheduler instead of a deadline manager for stateful event handling, since the scope for deadlines’ is only sagas and aggregates?

It’s a great question — and the answer is “neither.” We replaced both. That’s the surprising part of the redesign worth writing about: the assumption baked into the question (you need some Axon scheduling primitive) is the assumption we let go of. Once sagas were off the table, the things sagas had been hiding — coordination of multiple events, waiting until a future moment, and orchestrating long-running multi-step workflows — separated cleanly into three unrelated concerns, each better served by a tool we already had.

This post walks through what replaced sagas, what replaced deadlines, where workflow orchestration ended up, and the one production gotcha we hit on the way.

Why Sagas Had to Go

The motivation for the framework upgrade itself isn’t this post’s subject — that’s covered in the Axon 4 → 5 migration write-up. What this section is about is the saga-shaped choice we made during that migration.

Axon 5 kept sagas reachable through a legacy-support path. So strictly, “we had to redesign” overstates it — we could have lifted the existing sagas across more or less unchanged. What we couldn’t do was treat that as the forward direction. The framework’s signal was clear: stateful event handlers were where new work was meant to go, and sagas were on a path of being supported but no longer invested in.

That nudge mattered for a second reason. Our service runs a fully reactive Spring stack, and the v4 saga implementation we’d been carrying never composed cleanly with reactive event handling. Blocking entry points kept leaking back into reactive chains; we’d worked around it, but every saga touched in v4 was a small drag on the codebase.

Migration time turned that drag into a decision. With the framework signaling away from sagas anyway, and reactive composition already a known pain, we treated the redesign as an opportunity rather than a burden. Once we started rewriting the saga use cases by hand, we noticed they’d been hiding three unrelated shapes inside one abstraction — the split the rest of this post is about.

The Coordination Half: Stateful Event Handlers

The first thing sagas were doing for us was coordination: track an in-flight multi-step process across several events, decide when it’s complete, react to that completion. We replaced this with an ordinary event handler that reads and writes a row in a Postgres table.

The table is generic — one row per in-flight job, regardless of which module owns the job:

SQL
CREATE TABLE job_state (
    id              TEXT PRIMARY KEY,
    job_type        TEXT NOT NULL,
    correlation_id  TEXT NOT NULL,
    owner_id        TEXT NOT NULL,
    total_items     INT  NOT NULL,
    completed_items INT  NOT NULL DEFAULT 0,
    failed_items    INT  NOT NULL DEFAULT 0,
    status          TEXT NOT NULL,
    metadata        JSONB,
    created_at      TIMESTAMPTZ NOT NULL,
    updated_at      TIMESTAMPTZ NOT NULL,
    completed_at    TIMESTAMPTZ,
    version         BIGINT NOT NULL DEFAULT 0,
    UNIQUE (job_type, correlation_id)
);

A small reactive service wraps the table — createJob(...), markItemCompleted(...), markItemFailed(...), getJob(...), deleteJob(...). Then any module’s event handler that needs saga-like behavior just uses it.

A worked example — coordinating a multi-batch import job:

Java
@Component
@RequiredArgsConstructor
public class OrderImportEventHandler {

    private static final String JOB_TYPE = "ORDER_IMPORT";

    private final ImportWorkerService importWorker;
    private final JobStateService jobStateService;

    @EventHandler
    public Mono<Void> on(OrderImportRequested event) {
        return jobStateService
            .createJob(JOB_TYPE, event.importId(), event.requestedBy(), event.batchCount())
            .doOnSuccess(job -> importWorker.processAsync(event.importId()))
            .then();
    }

    @EventHandler
    public Mono<Void> on(BatchProcessed event) {
        return jobStateService
            .markItemCompleted(JOB_TYPE, event.importId())
            .then();
    }

    @EventHandler
    public Mono<Void> on(OrderImportCompleted event) {
        return jobStateService.deleteJob(JOB_TYPE, event.importId());
    }
}

What sagas were doing implicitly — “find me the saga instance for this importId, mutate its state, decide if it should end” — becomes three plain method calls. The correlation that sagas hide behind @SagaEventHandler(associationProperty = "importId") is now an explicit argument. More verbose; also more debuggable.

A few properties that fall out of this design:

  • Plain Spring beans. No saga lifecycle, no @StartSaga / @EndSaga, no associated lifecycle quirks. The handler is testable with the same Spock fixtures as any other reactive service.
  • DB-native introspection. SELECT * FROM job_state WHERE status = 'IN_PROGRESS' AND created_at < now() - interval '1 hour' finds stuck jobs without any framework-specific tooling.
  • Replayable in the obvious way. The job state is a side-effect projection of events, just like any other read model. Wipe the table, replay the event store, you get the same state back.
  • Idempotent. createJob checks for an existing row before inserting. markItemCompleted is a single SQL statement. Re-delivering an event doesn’t corrupt state.

The Deadline Half: Scheduled Queries

The second thing sagas were doing for us was waiting — “in N hours, if X hasn’t happened, do Y.” In v4 you’d express that with DeadlineManager from inside the saga. With sagas gone, where does that go?

Two observations made this easy:

  1. The “deadline” never actually depends on saga in-memory state. It always depends on a database column — when does this job time out, when does this scheduled action become due, when did this resource become eligible for the next state. The state is in the database already.
  2. Delivering a deadline at the exact moment is rarely the actual requirement. “Some time after T, no later than T + N minutes” is almost always good enough.

Given those, the right primitive isn’t an event scheduler — it’s a periodic job that asks the database what’s due:

Java
@Component
@RequiredArgsConstructor
public class StuckImportSweeper {

    private final ImportRepository imports;
    private final CommandGateway commandGateway;

    @Scheduled(cron = "${app.import.sweeper.cron}")
    @SchedulerLock(name = "import_stuck_sweeper")
    public void sweep() {
        imports.findStuckImports()
            .flatMap(stuck -> commandGateway
                .send(new TimeoutImportCommand(stuck.id()))
                .onErrorResume(e -> Mono.empty()))
            .then()
            .block(); // @Scheduled doesn't await reactive return values
    }
}

That’s it. ShedLock keeps it singleton across pods. The query decides what’s due. The command does the work. There is no saga to wake up, no deadline to register, no scheduled-event upcaster to maintain across migrations.

This pattern absorbed every former-deadline use case we had: timing out stuck operations, transitioning resources between states once a window has passed, sending periodic notifications to subscribers. None of them needed sub-minute precision; all of them needed durability across pod restarts. A scheduled DB query gives you both.

When the Workload Is a Workflow

Some of what lived in our v4 sagas wasn’t really saga-shaped at all — it was workflow-shaped. Long-running, multi-step, with retries, parallel branches, and timers measured in minutes-to-hours. Forcing those into a saga, or later into a JobState row, was always awkward. The strain showed up wherever we wanted explicit retry policies per step, fan-out / fan-in across heterogeneous activities, or durable mid-flight pauses that survive process restarts.

For those, we reach for Temporal. Not as a saga replacement — as the right tool for a different shape of problem. Temporal owns workflow state and replay; we own the activities, which delegate to Axon command dispatch whenever state changes need to be event-sourced. The boundary is clean: Temporal orchestrates, Axon records facts, neither reaches into the other’s state.

The full decision tree we ended up with:

Shape of the problemTool
Coordinate N events for one logical job; no time dimensionJobStateService row
Wait until time T, then act on whatever state the DB holds@Scheduled + DB query (StuckImportSweeper above)
Multi-step orchestration with retries, parallel branches, durable timersTemporal workflow

This post is about the first two — the saga-shaped use cases. The Temporal half is a separate story; the only reason it’s on this page at all is that it draws the boundary of where stateful event handlers stop being the right answer. If a workflow needs explicit retry policies and durable timers per step, you do not want to rebuild that on top of @EventHandler and a Postgres table.

The Production Gotcha: Atomic Increments Beat Optimistic Locking

The “obvious” implementation of markItemCompleted is:

Java
public Mono<JobState> markItemCompleted(String jobType, String correlationId) {
    return repo.findByJobTypeAndCorrelationId(jobType, correlationId)
        .flatMap(job -> {
            job.setCompletedItems(job.getCompletedItems() + 1);
            return repo.save(job);
        });
}

This is read-modify-write, and it works exactly as long as events arrive one at a time. The first time you have parallel handlers for events with the same correlation ID — say, fan-out completion events from a batch worker — Spring Data’s optimistic version check rejects every concurrent write but one. The losers retry, the throughput collapses, and the failure mode is silent unless you’re watching OptimisticLockingFailureException counts.

The fix is to push the increment into the database:

Java
@Modifying
@Query("""
    UPDATE job_state
    SET completed_items = completed_items + 1,
        version         = version + 1,
        updated_at      = NOW(),
        status = CASE
            WHEN completed_items + 1 + failed_items >= total_items
                  AND failed_items > 0
                THEN 'COMPLETED_WITH_ERRORS'
            WHEN completed_items + 1 + failed_items >= total_items
                THEN 'COMPLETED'
            ELSE status
        END,
        completed_at = CASE
            WHEN completed_items + 1 + failed_items >= total_items THEN NOW()
            ELSE completed_at
        END
    WHERE job_type = :jobType AND correlation_id = :correlationId
    """)
Mono<Long> atomicIncrementCompleted(String jobType, String correlationId);

A single UPDATE does the increment, the version bump, and the status transition — atomically, in one round-trip, with no read-modify-write window. The trade-off: the completion logic now lives in two places (the SQL CASE and any in-memory checks). That’s fine for state transitions this simple; for richer transitions you’d push the whole thing into a stored procedure or accept the optimistic-locking cost and retry.

This is the kind of detail that the v4 saga abstraction would have hidden from us. It’s not obviously a regression that we’re now writing the SQL by hand. The contention pattern was always there; the framework was just wrapping it in a retry loop.

How This Will Compare to Axon 5.2’s Saga Module

As of Axon 5.1, sagas are still on the 5.2 roadmap and have not shipped — the same is true of DeadlineManager, EventScheduler, and upcasters. So the comparison below is forward-looking, against what’s been signaled for 5.2 rather than what we can run today. The natural question, once 5.2 lands: do we tear out JobStateService and adopt the saga module?

Probably not for the use cases this post covers, and probably yes for ones we don’t have today.

What we’d give up by porting back to sagas:

  • Atomic SQL increments. Saga stores serialize the whole saga; high-contention coordinations would either retry or need careful sharding by saga identifier.
  • DB-native introspection. A saga’s state lives in the saga store, not in a table that operations dashboards can SELECT from.
  • Plain testability. Saga test fixtures are powerful but framework-specific; ours are just Spock against a service.
  • The clean split between coordination and waiting. Sagas re-couple them.

What sagas would give us back:

  • Less boilerplate. @SagaEventHandler(associationProperty = "importId") is more compact than jobStateService.markItemCompleted(JOB_TYPE, importId) repeated across handlers.
  • In-saga deadlines. scheduleDeadline(Duration.ofHours(24), "timeout") inside saga logic is a more natural expression of “react to this event, or react to this absence after a delay” than splitting it across a handler and a sweeper.
  • Saga-snapshot-based replay. For sagas that accumulate genuinely complex state (not just “have we seen all the batches yet”), automatic snapshotting is real ergonomic value.

In practice, almost none of our current workloads tip us toward sagas. The shape that would tempt us — a workflow long enough to need durable coordination, with retries and timers — already lives in Temporal, where the abstraction fits properly. The cases that match the saga sweet spot are narrower than they look: a process short-lived enough that spinning up a Temporal workflow is overkill, yet with a deadline that depends on accumulated in-memory state and not just a column we can sweep on a schedule. We don’t currently have one of those. For the workloads in this post, we’d keep what we have.

Back to the Original Question

It’s a sharp question, and the framework historically did couple these concerns — sagas owned both the multi-event coordination and the deadlines that fired against accumulated saga state. So expecting a one-for-one replacement (EventScheduler instead of DeadlineManager) is the natural reading.

What we found was that they aren’t one concern with two API surfaces; they’re three concerns sharing a single home in v4. We replaced all of them, with different tools depending on shape. Multi-event coordination became JobStateService rows. Time-based triggers on DB-resident state became Spring @Scheduled + DB queries. Anything that wanted a real workflow engine — retries, parallel branches, durable timers — went to Temporal. The Axon scheduling primitives sat in an awkward middle: too coupled to saga state to act as a generic scheduler, too thin to express real workflows.

Decoupling those concerns turned out to make each tool clearer about what it’s for. So the short answer to your question: we didn’t pick between event scheduler and deadline manager because the choice itself was the wrong frame.

Closing Thought

The Axon 4 → 5 migration forced us to redesign every saga we had. Like most forced redesigns, the work didn’t feel like progress while we were doing it. The result was a smaller surface area per problem: three purpose-fit tools — a Postgres-backed state service, a @Scheduled sweeper, and Temporal for genuine workflows — each covering one shape of what our sagas had been doing, with better operability and no framework-specific abstractions to teach new engineers.

If the saga module returns in 5.2 with a clean reactive story and inline deadline support, we’ll happily reach for it the next time we’re modeling a process whose deadline depends on accumulated in-memory state and that’s too short-lived to justify a workflow engine. For everything else, “the right tool for each shape” turned out to be the upgrade.

Aside on Axon 5.1. This post is about the saga half of the migration. Orthogonal to that story: 5.1 brought aggregate snapshotting back, added first-class Spring support to AxonTestFixture, and introduced JSpecify nullability annotations. Worth knowing about if you upgraded straight from a 5.0.x release and haven’t tracked the point releases — none of it changes the argument here, but the snapshot return is a real win for hot aggregates with long event histories.


Cover photo by Eric Prouzet on Unsplash.

More from the Blog