Background Job Architecture in Rails

Sidekiq vs GoodJob from a creator-platform monolith perspective, idempotency that holds up to Apple's retry storms, and the queue metrics that actually predict outages.

It was a Wednesday afternoon at the creator-economy platform I worked at when our pending_apple_review Sidekiq queue started lying to us. A few hundred stuck builds, each one marked submitted: true in our DB. Over at App Store Connect, Apple had no record of any of them. Support tickets kept piling in by 2 p.m. Pacific. One creator was already drafting a tweet.

I’d built most of that pipeline with two other engineers. It had been in production for six months and felt boring, in the good way. Then it wasn’t.

The thing that fixed it wasn’t a job backend choice. It was a unique index and a small change to where we acknowledged Apple’s webhook. That afternoon is why I stopped writing “Sidekiq vs GoodJob” articles like feature comparisons. The backend is the easy call. The patterns wrapped around it are where everything actually breaks.

Pick the backend in one minute

If you’ve got Redis in the stack and any meaningful throughput, use Sidekiq. I ran it on a Rails monolith pushing through a branded-mobile-app pipeline for thousands of creator-owned apps serving millions of end customers. Threads, fast pop off Redis lists, plenty of mileage.

If you’re running a smaller Rails service and Redis would just be one more thing to operate, GoodJob. Postgres LISTEN/NOTIFY for pickup, advisory locks for concurrency control, the same database you already babysit. I default to it on side products I CTO, including a logistics-sector hiring platform with a Rails ops surface bolted on.

# app/jobs/match_referral_job.rb
class MatchReferralJob < ApplicationJob
  include GoodJob::ActiveJobExtensions::Concurrency

  good_job_control_concurrency_with(
    perform_limit: 1,
    key: -> { "match-referral-#{arguments.first}" }
  )

  queue_as :matching

  def perform(referral_id)
    referral = Referral.find(referral_id)
    return if referral.matched_at.present?

    MatchingService.new(referral).run!
  end
end

perform_limit: 1 looks like a config knob. It’s actually a Postgres advisory lock that keeps two workers from racing on the same row. That’s the kind of thing you pay Redis Sentinel babysitting time to avoid in Sidekiq land.

Where I would not pick GoodJob: a hot Aurora cluster. Our writer at the creator platform was a multi-terabyte beast and the working set was already tight. Layering high-throughput job traffic on top would have been an own goal. There, Redis earns its operational cost cleanly.

Idempotency is the contract

Back to the Wednesday afternoon. The stuck pending_apple_review queue was a symptom. The disease was upstream of it.

Apple’s SubscriptionRenewal server-to-server notification has a 30 second deadline. Past that, Apple retries. Hard. Our webhook handler did receipt validation and the creator_subscriptions row write inline. Sometimes that took 31 seconds. Apple retried, the retry landed on a different worker, and our handler had no idempotency check, so it created a second subscription row. Across a few thousand customers spread over dozens of customer apps, every card got charged twice that month.

Worse, our first fix was a frontend patch. “Show only the latest subscription per customer.” Visible only. Apple had already moved real money. The creator who’d been drafting a tweet went and posted it.

The real fix went in within a week:

# db/migrate/create_apple_renewal_notifications.rb
class CreateAppleRenewalNotifications < ActiveRecord::Migration[7.1]
  def change
    create_table :apple_renewal_notifications do |t|
      t.string   :original_transaction_id, null: false
      t.string   :notification_uuid,       null: false
      t.jsonb    :payload, null: false, default: {}
      t.datetime :processed_at
      t.timestamps
    end

    add_index :apple_renewal_notifications,
              [:original_transaction_id, :notification_uuid],
              unique: true,
              name: "idx_apple_renewal_notifs_dedup"
  end
end

# app/jobs/handle_apple_renewal_job.rb
class HandleAppleRenewalJob < ApplicationJob
  class Duplicate < StandardError; end

  queue_as :billing
  retry_on ActiveRecord::Deadlocked, wait: :polynomially_longer, attempts: 5
  discard_on Duplicate

  def perform(raw_payload)
    tx_id  = raw_payload.fetch("original_transaction_id")
    uuid   = raw_payload.fetch("notification_uuid")

    AppleRenewalNotification.create!(
      original_transaction_id: tx_id,
      notification_uuid:       uuid,
      payload:                 raw_payload
    )

    CreatorSubscription.transaction do
      apply_renewal!(raw_payload)
      AppleRenewalNotification
        .where(notification_uuid: uuid)
        .update_all(processed_at: Time.current)
    end
  rescue ActiveRecord::RecordNotUnique
    raise Duplicate
  end
end

The endpoint enqueues, returns under 200 ms, well inside Apple’s window. If Apple retries, the next job tries to insert the same row, hits the unique index, raises RecordNotUnique, and we discard. Dedup lives at the database, where two workers cannot disagree about whose insert wins.

The general rule I now live by: never trust the response of a write to an upstream that retries on its own clock. Apple, Google Play, Stripe, anything human-moderated. Either you read after write against the upstream’s source of truth, or you put a unique constraint on your side and own dedup yourself.

Retries that survive contact

retry: 5, backoff: :exponential looks safe in a tutorial. In production it’s a tiny bomb. A transient downstream blip affecting 10,000 jobs sends 10,000 retries on roughly the same schedule. You just built the thundering herd you were paying queues to avoid.

Two small changes pay for themselves immediately.

class StripeSyncJob < ApplicationJob
  queue_as :billing

  retry_on Net::OpenTimeout, wait: ->(executions) {
    base = 2 ** executions
    base + rand(0..(base / 2.0))
  }, attempts: 6

  retry_on Stripe::RateLimitError, wait: 30.seconds, attempts: 10

  discard_on ActiveRecord::RecordNotFound
  discard_on Customer::Anonymized

  def perform(customer_id)
    customer = Customer.find(customer_id)
    StripeSync.new(customer).run!
  end
end

The jitter line is three characters of real difference. Default :exponential keeps the herd synchronized. Adding rand(0..(base/2.0)) desynchronizes it. The second thing: discard_on is not a footnote. retry_on is “this might succeed if we try again.” discard_on is “this will never succeed, stop burning capacity.” They’re different decisions. Mixing them up is how you end up with a dead set the size of a small country.

Orchestration is usually a state machine

Once a quarter someone on a squad proposes a DAG framework. Define dependencies, watch the graph execute, get callbacks on complete. Sidekiq Pro Batches do a focused version. Temporal does the full version. I’ve shipped both.

Honest take: most “orchestration” needs are a bounded operation pretending to be a graph. If you’re drawing seven boxes and arrows, you probably want one row with a state column and one job per transition.

# app/models/apple_submission.rb
class AppleSubmission < ApplicationRecord
  enum :state, {
    pending: 0,
    validated: 10,
    binary_uploaded: 20,
    submitted: 30,
    in_review: 40,
    approved: 50
  }

  def advance!
    case state.to_sym
    when :pending         then ValidateMetadataJob.perform_later(id)
    when :validated       then UploadBinaryJob.perform_later(id)
    when :binary_uploaded then SubmitToAppleJob.perform_later(id)
    when :submitted       then PollAppleReviewJob.set(wait: 30.minutes).perform_later(id)
    end
  end
end

Each job loads the row, checks state, does its one thing, transitions, and calls advance!. Idempotency is automatic because every step is conditional on the current state. If a worker dies mid-step, the next try sees the same state and does the same work. No DAG library needed. No surprise callbacks. The “graph” is implicit in the enum.

Reach for batches when you genuinely have N independent items and need exactly one callback when all of them finish. That’s what batches are for. Sequential work with shared context is not it.

What I actually monitor

Per-class duration and per-class error rate are the easy half. Every dashboard has them. They tell you if a single job class is sick. They do not tell you the queue itself is dying.

The two signals I care about:

Queue depth growth rate. A queue sitting at 50K and steadily draining is fine. A queue at 5K growing 10% per minute is a fire. The alert lives on the derivative, not the absolute. Datadog query, one line of Grafana, however you cut it. The trick is the slope, not the level.

End-to-end latency per queue. Time from enqueue to start of processing. If billing’s p95 is normally 30 seconds and starts trending toward 5 minutes, you’ve got a worker shortage forming before anything’s actually backed up. Catching it there is the difference between a quiet runbook entry and a war room.

Takeaways

Backend choice between Sidekiq and GoodJob is small. The patterns wrapped around either one are the whole game.
Every job that writes to an external system or to a DB needs an idempotency key. Enforce it at the database with a unique index, not in app code.
Add jitter to exponential backoff. Three characters of real difference.
Be explicit about discard_on. Don’t let your dead set become an archive.
Most orchestration is a state machine on a row, not a DAG.
Monitor queue depth growth rate and end-to-end latency, not just per-job duration.

Thanks for reading. If you’ve got thoughts, send them my way.