Async Messaging with Kafka RabbitMQ and NATS

How I pick between Kafka, RabbitMQ, and NATS for async microservice communication, with production code, broker config, and two war stories from the Kafka backbone I ran for years.

Saturday afternoon. A live combat-sports tournament is being broadcast publicly. The federation and commentators are watching the standings page in real time. Around the third bout, our standings-projector consumer group at the combat-sports tournament platform I CTO’d starts rebalancing every 30 seconds. The match-events topic keeps growing on the broker. The leaderboard freezes at 14:32. Three PagerDuty pages in two minutes, plus a direct ping from the federation’s tech contact.

OK so that outage was the moment I stopped treating “we use Kafka” as an architectural answer. The backbone was fine. The way one consumer pod talked to it was not. A few brokers later, that’s still the part most teams get wrong.

I’ve shipped all three of these in production. Kafka was the standardized async backbone at the federation platform across hundreds of microservices. RabbitMQ I’ve used for routing-heavy notification flows. NATS I reach for when latency matters more than durability and I want a single binary on the cluster. Same job description, very different operational shapes.

What you’re actually picking

Before the feature matrix, decide what you’re optimizing for. Throughput, latency, durability, ordering, and operational budget pull in different directions and you don’t get all of them.

The order I run through in my head:

Do I need replay weeks later? Kafka or JetStream.
Do I need content-based routing with patterns? RabbitMQ.
Do I need sub-millisecond fan-out and minimal ops? NATS.
Is my team five people? Pick the one with the fewest knobs.

Everything else is a detail. The brokers don’t fail you. The misalignment between consumer code and broker semantics fails you.

Kafka, the distributed log

Kafka isn’t a queue. It’s a partitioned, append-only log with consumer groups on top. The mental model has to match or you’ll fight the broker forever.

import { Kafka, logLevel, Partitioners } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'orders-producer',
  brokers: process.env.KAFKA_BROKERS!.split(','),
  ssl: true,
  sasl: {
    mechanism: 'scram-sha-512',
    username: process.env.KAFKA_USER!,
    password: process.env.KAFKA_PASS!,
  },
  logLevel: logLevel.WARN,
});

const producer = kafka.producer({
  createPartitioner: Partitioners.DefaultPartitioner,
  idempotent: true,
  maxInFlightRequests: 5,
  transactionTimeout: 30_000,
});

await producer.connect();

export async function emitOrderCreated(order: Order) {
  await producer.send({
    topic: 'orders.v1',
    acks: -1,
    messages: [
      {
        key: order.userId,
        value: JSON.stringify({
          type: 'ORDER_CREATED',
          version: 1,
          orderId: order.id,
          userId: order.userId,
          items: order.items,
          ts: Date.now(),
        }),
        headers: { 'x-trace-id': order.traceId },
      },
    ],
  });
}

The key is the only thing standing between you and out-of-order processing for a given entity. Same user, same partition, ordered forever. Lose that and you’re debugging a haunted house.

On the consumer side, the dangerous knobs are max.poll.interval.ms, session.timeout.ms, and what you actually do inside eachMessage. Slow handlers blow past the poll interval, the group thinks the pod is dead, the coordinator rebalances, and now every pod is paused. That’s the rebalance loop from the opening.

import { Kafka } from 'kafkajs';

const consumer = kafka.consumer({
  groupId: 'standings-projector',
  sessionTimeout: 30_000,
  heartbeatInterval: 3_000,
  maxWaitTimeInMs: 500,
  maxBytesPerPartition: 1_048_576,
});

await consumer.connect();
await consumer.subscribe({ topic: 'match-events', fromBeginning: false });

await consumer.run({
  autoCommit: false,
  partitionsConsumedConcurrently: 4,
  eachMessage: async ({ topic, partition, message, heartbeat }) => {
    const evt = JSON.parse(message.value!.toString());

    try {
      await projectStandings(evt);
      await consumer.commitOffsets([
        { topic, partition, offset: (BigInt(message.offset) + 1n).toString() },
      ]);
    } catch (err) {
      await heartbeat();
      throw err;
    }
  },
});

War story, the Saturday rebalance loop

Setting. The combat-sports tournament platform I CTO’d, hundreds of microservices, Kafka as the async backbone. Live tournament broadcast. I was acting CTO and got pinged into the war room.

What went wrong. The standings-projector group rebalanced every 30 seconds. The topic backed up. The standings page froze for 12 minutes.

First wrong fix. kubectl rollout restart deployment/standings-projector. The pods rejoined cleanly. Then they rebalanced again 40 seconds later. We were doing the same dance the group was already doing on its own.

Real fix. Side-by-side pod logs. One pod out of six had a different max.poll.interval.ms. 300s on five of them, 60s on the sixth. Honestly, the sixth pod was running a stale image because someone had pushed a config-touching fix without bumping the tag and the deployment had pulled :latest. Its handler did a slow downstream call to a rules service that occasionally took 70s, past its poll interval, so it got kicked out repeatedly. Cordoned the pod, drained the storm in 90 seconds. Over the weekend we SHA-pinned every Kafka-touching deployment, committed offsets in smaller batches, and split the slow downstream call out of the hot consumer loop.

Cost. 12 minutes of stale standings during a live broadcast. The federation was understanding. Commentators were less so. Standing rule from that day, pin image SHAs, never tags, on anything that touches a consumer group. CI fails the deploy if any consumer manifest references :latest.

RabbitMQ, the smart broker

RabbitMQ is where the broker does the routing instead of the consumer. Exchanges, bindings, dead letters, message TTL, priority queues. If I need to route by content or pattern, this is the one I pick. Under 50K messages per second, the operational story is also kinder than Kafka.

import amqp from 'amqplib';

const conn = await amqp.connect(process.env.AMQP_URL!);
const ch = await conn.createConfirmChannel();

await ch.assertExchange('events', 'topic', { durable: true });
await ch.assertExchange('events.dlx', 'direct', { durable: true });

await ch.assertQueue('notifications.us', {
  durable: true,
  arguments: {
    'x-dead-letter-exchange': 'events.dlx',
    'x-dead-letter-routing-key': 'notifications.us.dead',
    'x-max-length': 500_000,
    'x-overflow': 'reject-publish',
    'x-message-ttl': 60_000,
  },
});
await ch.bindQueue('notifications.us', 'events', 'order.*.us');
await ch.bindQueue('notifications.us', 'events', 'payment.*.us');

await ch.prefetch(50);

await ch.consume('notifications.us', async (msg) => {
  if (!msg) return;
  try {
    const evt = JSON.parse(msg.content.toString());
    await sendNotification(evt);
    ch.ack(msg);
  } catch (err) {
    ch.nack(msg, false, false);
  }
});

Dead letter exchanges are non-negotiable. Every failed message has to land somewhere visible. The number of times I’ve seen “we’ll add a DLQ later” turn into “we lost five hours of writes” is too many. Set x-max-length and x-overflow: reject-publish too, or a slow consumer will eat your broker’s memory and start dropping connections.

NATS, the speed demon

NATS is what I reach for when latency is the product. Core NATS is fire-and-forget. JetStream adds durability and replay when you need it. The whole thing is one Go binary, which makes the Kubernetes story easy.

version: "3.9"
services:
  nats:
    image: nats:2.10-alpine
    command:
      - "-js"
      - "-sd"
      - "/data"
      - "-m"
      - "8222"
    ports:
      - "4222:4222"
      - "8222:8222"
    volumes:
      - nats-data:/data
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:8222/healthz"]
      interval: 5s
      timeout: 2s
      retries: 5

volumes:
  nats-data: {}

For a new project that needs Kafka-shaped semantics without Kafka-shaped operations, JetStream is a serious option. I wouldn’t migrate a healthy Kafka cluster to it. But starting fresh on a small team, I’d think about it hard.

Derived indexes need their own freshness metric

Brokers don’t fail in isolation. The systems they feed do.

Setting. Same federation platform. The rankings page was Elasticsearch fronting a rankings-indexer consumer reading off Kafka. PostgreSQL was the system of record. ES was the query layer.

What went wrong. A tournament finished Saturday night. The new champion’s ranking should have updated in minutes. Eight hours later the page still showed the old number one. The athlete noticed before we did and tweeted a screenshot tagging the federation.

First wrong fix. SSH into the indexer pod. Logs were quiet. Restart it. The indexer started reprojecting from a 12-hour-stale checkpoint, did the right thing for new events, didn’t backfill.

Real fix. One-shot reindex job, read all current rankings from Postgres, bulk-write into a new ES index, atomic alias swap. 25 minutes. Root cause of the original drift: the bulk-write client had silently entered a circuit-open state after a transient ES blip overnight. The breaker didn’t have a path back to half-open without restart. Patched it to try half-open every 60s.

Cost. 8 hours of stale rankings during a publicly visible competition. The federation called the following Monday. Lesson in one sentence, derived indexes need their own health metric, not just “is the consumer still consuming.” Measure freshness, not throughput.

How I pick now

Kafka when I need replay, multiple independent consumers on the same stream, and I already have the operational muscle. RabbitMQ when routing logic is the actual product and throughput is moderate. NATS when latency is the budget and I want the smallest cluster footprint I can get away with.

The broker is rarely the hard part. The hard parts are idempotency, consumer-group config drift, schema evolution, and treating derived state like a first-class citizen with its own SLOs.

Takeaways

Pick the broker that matches your hardest constraint, not the one with the best marketing page.
Pin consumer-pod images by SHA. :latest on a consumer group is a rebalance loop waiting for a Saturday.
Dead letter queues and queue-length limits go in on day one, not “after launch.”
At-least-once plus idempotent consumers beats exactly-once for almost every real workload.
Derived indexes have their own freshness SLO. Don’t trust consumer lag as a health signal.
Schema versioning belongs in the payload from message one.

Thanks for reading. If you’ve got thoughts, send them my way.