Saga Orchestration Beats Choreography by Default

Why I default to orchestrated sagas across distributed transactions, with TypeScript code, compensating actions, and two production incidents that set the rule.

Saturday afternoon at the combat-sports tournament platform I CTO’d in London. A live federation tournament was being broadcast publicly, and the standings page froze at 14:32 local time. The match-events topic on Kafka kept growing on the broker, but the standings-projector consumer group was rebalancing every thirty seconds or so. The federation’s tech contact pinged me directly. I was acting as CTO. We had hundreds of microservices in production and Kafka was the async backbone we’d standardized on. Everything I thought I knew about choreography went on trial that afternoon.

I’ll get to the fix. First, the thing I actually want to argue. After running both patterns across hundreds of services on that federation platform, the branded-mobile-app pipeline at the creator-economy platform I worked at, and a couple of other production systems, my default for any business-critical multi-step distributed flow is orchestration. Choreography earns the work only when the flow is short, the steps are truly independent, and the compensation graph stays trivial. Most real flows don’t stay there for long.

Why 2PC isn’t on the table

My team reached for two-phase commit back when we were carving an order flow out of a monolith. Two weeks in, we threw it out. Coordinator was a single point of failure, blocking locks killed throughput, and when the coordinator died mid-transaction we had orphaned locks across three databases. Sagas trade atomicity for availability. You accept temporary inconsistency and design so the system moves forward to a correct state.

Choreography in plain code

Choreography means no central brain. Each service listens for events and decides what to do next.

import { Kafka, logLevel } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'order-service',
  brokers: process.env.KAFKA_BROKERS!.split(','),
  logLevel: logLevel.INFO,
});

const producer = kafka.producer({ idempotent: true });

export async function createOrder(input: CreateOrderDto) {
  const order = await orderRepository.save({ ...input, status: 'PENDING' });

  await producer.send({
    topic: 'order.created',
    messages: [
      {
        key: order.id,
        value: JSON.stringify({
          orderId: order.id,
          userId: order.userId,
          items: order.items,
          totalAmount: order.totalAmount,
          schemaVersion: 1,
        }),
        headers: { 'x-trace-id': input.traceId },
      },
    ],
  });

  return order;
}

Payment listens, charges, emits its own event. Inventory listens for payment.completed, reserves stock, confirms or asks for a refund. Standard fan-out, reads clean on a whiteboard.

import { Kafka } from 'kafkajs';

const kafka = new Kafka({ clientId: 'payment-service', brokers: BROKERS });
const consumer = kafka.consumer({
  groupId: 'payment-service',
  sessionTimeout: 30_000,
  maxWaitTimeInMs: 500,
});

await consumer.subscribe({ topic: 'order.created', fromBeginning: false });

await consumer.run({
  autoCommit: false,
  eachMessage: async ({ message, topic, partition, heartbeat }) => {
    const event = JSON.parse(message.value!.toString());
    const idempotencyKey = `order-${event.orderId}`;

    try {
      const payment = await paymentGateway.charge({
        userId: event.userId,
        amount: event.totalAmount,
        idempotencyKey,
      });

      await producer.send({
        topic: 'payment.completed',
        messages: [{
          key: event.orderId,
          value: JSON.stringify({ orderId: event.orderId, paymentId: payment.id }),
        }],
      });
    } catch (err) {
      await producer.send({
        topic: 'payment.failed',
        messages: [{
          key: event.orderId,
          value: JSON.stringify({ orderId: event.orderId, reason: (err as Error).message }),
        }],
      });
    } finally {
      await heartbeat();
      await consumer.commitOffsets([
        { topic, partition, offset: (Number(message.offset) + 1).toString() },
      ]);
    }
  },
});

Commit-after-process. That detail matters in the war story coming up.

I’ve shipped choreography happily for notification fan-out and search-index updates on that federation platform. Three or four consumers, no shared state, each team owns its piece. Adding a channel was a new subscriber. Fine.

For an order flow with five services and a real compensation graph, choreography rots two ways. No single place answers “where is this order right now.” You piece it together from three topics and two DLQs. And services start subscribing across domain boundaries to figure out when to compensate. The event graph turns into spaghetti, the cyclic dependencies are the exact thing microservices were supposed to remove.

Rebalance loop on tournament day

Back to that Saturday on the federation platform. The standings page was frozen on a live broadcast. First instinct was operational. kubectl rollout restart deployment/standings-projector and hope the consumers re-joined cleanly. They did, then triggered another rebalance forty seconds later. We were doing the same dance the group was doing on its own.

Real fix came from pulling pod logs side by side. One pod out of six had a different max.poll.interval.ms. Five pods on 300s, the sixth on 60s. The sixth pod was a stale container image. Someone had pushed a config-touching fix without bumping the tag and the deployment had pulled :latest. That pod’s handler made a slow downstream call to a federation-rules service that occasionally took seventy seconds. Past its poll interval, kicked out of the group, rebalance for everyone, rinse and repeat. Cordoned the bad pod and the storm drained in about ninety seconds.

Patch went in over the weekend. Pin image SHAs, never tags, on anything that touches a Kafka consumer group. Commit offsets more frequently with smaller batches. Pull the slow call out of the hot consumer loop. Cost was twelve minutes of stale standings during a live broadcast. The commentators were less understanding than the federation.

The lesson that generalized: in choreography, your saga’s health is the sum of every consumer group’s invariants. One misconfigured pod is a saga-wide outage. There is no orchestrator log to ask.

Orchestration with explicit state

When the compensation graph matters, I reach for an orchestrator. State machine in code, state persisted in PostgreSQL, idempotency keys on every external call. Boring on purpose.

import { Injectable } from '@nestjs/common';

type SagaStatus =
  | 'STARTED'
  | 'PAYMENT_DONE'
  | 'INVENTORY_DONE'
  | 'SHIPPING_DONE'
  | 'COMPLETED'
  | 'COMPENSATING'
  | 'FAILED';

interface SagaState {
  orderId: string;
  status: SagaStatus;
  paymentId?: string;
  shipmentId?: string;
  failedStep?: SagaStatus;
  retryCount: number;
  version: number;
}

@Injectable()
export class OrderSagaOrchestrator {
  constructor(
    private readonly repo: SagaRepository,
    private readonly payments: PaymentClient,
    private readonly inventory: InventoryClient,
    private readonly shipping: ShippingClient,
  ) {}

  async advance(saga: SagaState): Promise<void> {
    try {
      switch (saga.status) {
        case 'STARTED': {
          const p = await this.payments.charge({
            orderId: saga.orderId,
            idempotencyKey: `saga-${saga.orderId}-payment`,
          });
          saga.paymentId = p.id;
          return this.transition(saga, 'PAYMENT_DONE');
        }
        case 'PAYMENT_DONE': {
          await this.inventory.reserve({
            orderId: saga.orderId,
            idempotencyKey: `saga-${saga.orderId}-inventory`,
          });
          return this.transition(saga, 'INVENTORY_DONE');
        }
        case 'INVENTORY_DONE': {
          const s = await this.shipping.arrange({
            orderId: saga.orderId,
            idempotencyKey: `saga-${saga.orderId}-shipping`,
          });
          saga.shipmentId = s.id;
          return this.transition(saga, 'SHIPPING_DONE');
        }
        case 'SHIPPING_DONE':
          return this.transition(saga, 'COMPLETED');
        case 'COMPENSATING':
          return this.compensate(saga);
      }
    } catch (err) {
      await this.onStepFailure(saga, err as Error);
    }
  }

  private async compensate(saga: SagaState) {
    const steps = [
      { when: !!saga.shipmentId, run: () => this.shipping.cancel({
          shipmentId: saga.shipmentId!,
          idempotencyKey: `saga-${saga.orderId}-cancel-shipping`,
        })},
      { when: saga.failedStep !== 'STARTED', run: () => this.inventory.release({
          orderId: saga.orderId,
          idempotencyKey: `saga-${saga.orderId}-release-inventory`,
        })},
      { when: !!saga.paymentId, run: () => this.payments.refund({
          paymentId: saga.paymentId!,
          idempotencyKey: `saga-${saga.orderId}-refund`,
        })},
    ];

    for (const step of steps) {
      if (!step.when) continue;
      await step.run();
    }
    await this.transition(saga, 'FAILED');
  }

  private async onStepFailure(saga: SagaState, err: Error) {
    if (saga.retryCount < 3) {
      saga.retryCount++;
      await new Promise(r => setTimeout(r, 2 ** saga.retryCount * 1000));
      return this.advance(saga);
    }
    saga.failedStep = saga.status;
    await this.transition(saga, 'COMPENSATING');
    return this.compensate(saga);
  }

  private async transition(saga: SagaState, next: SagaStatus) {
    saga.status = next;
    await this.repo.updateWithVersion(saga);
    if (next !== 'COMPLETED' && next !== 'FAILED') {
      await this.advance(saga);
    }
  }
}

Three things to call out. Idempotency keys aren’t optional. The orchestrator can crash mid-step and restart, and every external call must absorb that. State lives in PostgreSQL with an optimistic-locking version column, because in-memory state is how you charge a customer twice. Compensation runs in reverse on the actual steps that completed, not a static list.

When orchestration is the only honest answer

The branded-mobile-app pipeline at the creator platform I worked at taught me this one. It shipped native iOS and Android submissions for thousands of creator-owned apps serving millions of end customers. Rails plus Python plus Fastlane plus GitHub Actions, Sidekiq orchestrator on top. It’d been in production six months and felt boring in the good way.

Wednesday morning, pending_apple_review started backing up. By lunch, around 270 customer builds were stuck “Waiting for Review” on App Store Connect, but our pipeline thought they were submitted clean. The Connect API was silently throttling our endpoint and returning 200 OK with a body that looked normal. Submission was being dropped on their side. Customer support had 80+ tickets in by 2 p.m. Pacific.

First fix made it worse. We already had auto-retry on 5xx and we extended it to retry on “stuck” state too. The store started seeing what looked like duplicate submissions and a chunk of customers ended up with two competing review records and conflicting metadata.

Real fix was a circuit breaker around the submit step that verified state via a separate GET against the Connect resource rather than trusting the POST response. A one-shot reconciliation job, keyed on app_id + version + git_sha, deduped against the store’s source of truth. General rule that stuck: when the upstream is human-moderated, App Review, Play Review, billing disputes, don’t trust the response of a write. Read-after-write against the upstream. Cost was three days of slipped releases and a lot of unhappy creators.

You can do that as choreography. You shouldn’t. The compensation chain crosses both app stores, our billing, our database, and the mobile CX team’s manual cleanup. A central orchestrator with a state machine you can query is the only honest way to keep that flow auditable when an executive walks over and asks what happened to creator X.

My default, and when I switch

Orchestration by default for anything with five or more steps, real money, or a compensation chain that crosses a human-moderated boundary. The state machine pays its rent the first time you have to answer a support ticket. Choreography earns the work when the flow has three or four truly independent consumers, no coordinated rollback, and a new step is genuinely just a new subscriber. The cleanest production setup I’ve shipped is hybrid. Orchestration drives the core order flow, then emits a single order.completed event, and downstream side effects (notifications, search index, analytics) choreograph from there.

Takeaways

Sagas trade atomicity for availability. Design for temporary inconsistency, not against it.
Idempotency keys on every external call. Crashes are routine, double-charges aren’t.
Compensations are forward transactions, not rollbacks. Some aren’t reversible at all.
Choreography is fine for three independent consumers. It rots at five with shared state.
Orchestration wins anywhere the upstream is human-moderated. Read-after-write against the source of truth.
Persist saga state with optimistic locking. In-memory state is how you charge a customer twice.

Thanks for reading. If you’ve got thoughts, send them my way.