How I Think About CAP Trade-offs

CAP is not a label you stick on a database. It's a per-endpoint choice I make every week on Aurora, Elasticsearch, and real-time price feeds. Here's how I actually pick.

10:14 a.m. PT, a Tuesday at the creator-economy platform I worked at. Datadog’s AuroraReplicaLagMaximum > 60s for 2m alert fires. Community feeds start serving p99 reads at 8 seconds instead of 120 ms. I wasn’t on-call that week, but the Slack thread had my name on it within minutes because I’d been floating across the Aurora work for Community. By the time I joined, replica lag was at 14 minutes and climbing.

That moment is what the CAP theorem actually feels like in production. You have a multi-terabyte Aurora writer, three reader replicas, a community feed that millions of people refresh on their phones, and 4 minutes to decide whether you serve stale data or no data.

I’m going to skip the part where I define C, A, and P. If you’re reading this you already know. What I want to share is how I actually choose between them when I’m picking a database or designing a read path.

P Is Not Optional

The framing “pick two of three” is misleading. Partitions happen. Network blips, AZ failovers, a noisy neighbor on a shared NIC, a long-running ANALYZE on a hot table starving WAL - all of these look like partitions to your read path. The real question is CP or AP, per use case.

Per use case is the part most teams get wrong. They pick a database, label it “we’re a CP shop because we run Postgres”, and ship every endpoint through the same pattern. That’s how you end up serving the home feed with the same consistency guarantees as checkout. Wrong granularity.

Picking CP When Money Is On The Line

On a real-time trading and charting platform I architected, I designed the data plane for ~10M concurrent live market-data requests. Retail and institutional investors watching tick-level prices. A stale tick is worse than no tick. If a chart shows last hour’s price and a trader acts on it, that’s a real loss for a real human.

So price writes were CP. The write path waited for quorum before acking the client, and any read that could not reach quorum returned an error rather than a stale value. The product surface handled the error - a “reconnecting” badge on the chart, no price displayed, no trade allowed.

Concretely on PostgreSQL, this is what the synchronous-replication config looks like when you actually want CP semantics on writes:

# postgresql.conf on the primary
wal_level = replica
max_wal_senders = 10
synchronous_commit = remote_apply
synchronous_standby_names = 'ANY 2 (replica_a, replica_b, replica_c)'
wal_sender_timeout = 5s

# pg_hba.conf - replicas connect with the replication role
host replication replicator 10.0.0.0/16 scram-sha-256

remote_apply is the strict one. The commit doesn’t return until WAL is applied on at least 2 of 3 named standbys. Latency goes up. Throughput goes down. That’s the cost of CP and you pay it on purpose.

Aurora is a different shape - the storage layer handles durability for you - but the trade-off is the same. Read off a reader endpoint, you can see stale data. For CP-flavored reads I route through the writer and accept the throughput hit.

Here’s the read helper I reach for when an endpoint really cannot tolerate stale data. It fans across reader endpoints, gets a version stamp from each, and refuses to return until R replies agree:

import { Pool, PoolClient } from 'pg';

type Endpoint = { name: string; pool: Pool };

interface VersionedRead<T> {
  row: T;
  rowVersion: number;
}

async function quorumRead<T>(
  endpoints: Endpoint[],
  readQuorum: number,
  query: string,
  params: unknown[],
): Promise<VersionedRead<T>> {
  const attempts = endpoints.map(async (ep) => {
    const client = await ep.pool.connect();
    try {
      const res = await client.query<{ row_version: number } & T>(query, params);
      if (res.rowCount === 0) {
        throw new Error(`endpoint ${ep.name} returned empty`);
      }
      return { row: res.rows[0] as T, rowVersion: res.rows[0].row_version };
    } finally {
      client.release();
    }
  });

  const settled = await Promise.allSettled(attempts);
  const ok = settled
    .filter((s): s is PromiseFulfilledResult<VersionedRead<T>> => s.status === 'fulfilled')
    .map((s) => s.value);

  if (ok.length < readQuorum) {
    throw new Error(`quorum not reached: got ${ok.length}, needed ${readQuorum}`);
  }

  const maxVersion = Math.max(...ok.map((r) => r.rowVersion));
  const agreeing = ok.filter((r) => r.rowVersion === maxVersion);

  if (agreeing.length < readQuorum) {
    throw new Error(`version mismatch: top version has ${agreeing.length} acks, needed ${readQuorum}`);
  }

  return agreeing[0];
}

W plus R greater than N is the actual lever. The acronym CAP is the headline. Quorum is the knob.

For the write side, the retry shape matters as much as the storage config. A write that times out is not a write that failed, it’s a write of unknown state. The handler has to treat it that way, idempotently:

import { Injectable, Logger } from '@nestjs/common';
import CircuitBreaker from 'opossum';
import { PriceWriteRepo } from './price-write.repo';

@Injectable()
export class PriceTickService {
  private readonly log = new Logger(PriceTickService.name);
  private readonly breaker: CircuitBreaker<[PriceTick], void>;

  constructor(private readonly repo: PriceWriteRepo) {
    this.breaker = new CircuitBreaker(this.commit.bind(this), {
      timeout: 800,
      errorThresholdPercentage: 25,
      resetTimeout: 5_000,
      rollingCountTimeout: 10_000,
    });

    this.breaker.on('open', () => this.log.warn('price write breaker OPEN'));
    this.breaker.on('halfOpen', () => this.log.log('price write breaker HALF_OPEN'));
  }

  async ingest(tick: PriceTick): Promise<void> {
    try {
      await this.breaker.fire(tick);
    } catch (err) {
      // refuse to ack the upstream feed - we'd rather drop than store stale
      this.log.error({ err, tick }, 'rejected tick, breaker or quorum failed');
      throw err;
    }
  }

  private async commit(tick: PriceTick): Promise<void> {
    const idempotencyKey = `${tick.symbol}:${tick.sourceTs}:${tick.seq}`;
    await this.repo.insertWithQuorum(tick, idempotencyKey, { writeQuorum: 2 });
  }
}

interface PriceTick {
  symbol: string;
  price: number;
  sourceTs: number;
  seq: number;
}

Breaker opens fast, idempotency key on the wire, the service refuses to ack rather than store something it can’t confirm. On a CP path, dropping is correct. Storing a maybe is not.

Picking AP For Feeds And Search

The same week the Aurora alert fired, I’d happily have served slightly-stale posts. Stale by a few seconds is a non-event for a feed. Stale by 14 minutes is not, but that’s a separate problem. For feeds, search results, leaderboards, recommendation lists, rankings, I default to AP.

On the creator platform, the Community feed reads off Aurora readers. We accept that a brand-new post might take a second or two to show up everywhere. The product is designed around that. We don’t block the writer on cross-AZ replica apply.

The danger with AP is silent staleness. Which brings me to the war story I should have led with.

Elasticsearch Index Drift

This was at the combat-sports tournament platform I CTO’d in London. Rankings were one of the most-trafficked surfaces on the product. Public, federations watching, athletes watching. The rankings page read from Elasticsearch. PostgreSQL was the system of record, and a rankings-indexer consumer read off Kafka and projected ranking events into ES.

A federation tournament finished on a Saturday night. The new champion’s ranking should have updated in minutes. Eight hours later, the page still showed the old number one. The athlete in question had a verified account and noticed before we did. He tweeted a screenshot of the broken page and tagged the federation.

First wrong fix: SSH into the indexer pod, look at logs. Logs were quiet. Restarted the indexer. It cleared the offset and started reprojecting from a checkpoint 12 hours stale. New events flowed again. The old wrong rankings stayed in the index.

Real fix: full reindex from PostgreSQL using a one-shot job that bulk-wrote all current rankings to a new ES index, then atomic-aliased the read index to the new one. Took ~25 minutes. Root cause of the drift was almost dumb. The indexer’s bulk-write client had silently entered “circuit open” after a transient ES blip the night before, with no automated retry path back to closed. Patched it to attempt half-open every 60s.

Cost: 8 hours of stale rankings during a publicly visible competition. One pissed-off athlete on Twitter. A call with the federation. Lesson, in one sentence - derived indexes need their own freshness metric, not just “is the consumer alive.”

That’s the rule I live by on every AP path now. Measure staleness as a first-class signal. If your dashboard shows “indexer is consuming Kafka offsets fine” but does not show “the median row in the read index was written N seconds ago”, you have an outage waiting to happen.

What I Actually Reach For

When I’m picking the consistency model for a new endpoint, I run a quick mental triage. Money, identity, inventory, anything where a customer would call support if it was wrong: CP. Stale-is-fine reads, feeds, search, derived views, social signals: AP, with a freshness SLO bolted on.

The Aurora reader incident I opened with was a partition in everything but name. A long-running ANALYZE was holding write-side locks and starving WAL. The on-call’s first move was to bump reader instance class - reasonable, wrong root cause. The fix was pg_stat_activity on the writer, kill the analyze, lag drained in ~6 minutes. Same week I shipped a small Ruby helper, db_safe_maintenance.rb, that refuses to run heavy maintenance between 06:00 and 22:00 UTC. The runbook now opens with a literal sentence: “Before touching reader scaling, check pg_stat_activity on the writer.” I’m the reason that sentence is in there.

22 minutes of degraded experience for millions of customers. No data loss. That’s the shape of a CAP incident in real life. Not a clean choice between two letters, a messy decision about which guarantee to relax for the next 4 minutes while you find root cause.

Takeaways

Partitions happen. The real choice is CP or AP, per endpoint, not per system.
Money, identity, inventory: CP. Feeds, search, rankings: AP.
W + R > N is the lever. CAP is the headline.
On Postgres, synchronous_commit = remote_apply is what CP actually costs you.
Derived indexes (ES, materialized views, denormalized caches) need a freshness metric of their own.
Before you scale readers, check what the writer is doing.

Thanks for reading. If you’ve got thoughts, send them my way.