Sidekiq vs GoodJob from a creator-platform monolith perspective, idempotency that holds up to Apple's retry storms, and the queue metrics that actually predict outages.
It was a Wednesday afternoon at the creator-economy platform I worked at when our pending_apple_review Sidekiq queue started lying to us. A few hundred stuck builds, each one marked submitted: true in our DB. Over at App Store Connect, Apple had no record of any of them. Support tickets kept piling in by 2 p.m. Pacific. One creator was already drafting a tweet.
I’d built most of that pipeline with two other engineers. It had been in production for six months and felt boring, in the good way. Then it wasn’t.
The thing that fixed it wasn’t a job backend choice. It was a unique index and a small change to where we acknowledged Apple’s webhook. That afternoon is why I stopped writing “Sidekiq vs GoodJob” articles like feature comparisons. The backend is the easy call. The patterns wrapped around it are where everything actually breaks.
If you’ve got Redis in the stack and any meaningful throughput, use Sidekiq. I ran it on a Rails monolith pushing through a branded-mobile-app pipeline for thousands of creator-owned apps serving millions of end customers. Threads, fast pop off Redis lists, plenty of mileage.
If you’re running a smaller Rails service and Redis would just be one more thing to operate, GoodJob. Postgres LISTEN/NOTIFY for pickup, advisory locks for concurrency control, the same database you already babysit. I default to it on side products I CTO, including a logistics-sector hiring platform with a Rails ops surface bolted on.
# app/jobs/match_referral_job.rb
class MatchReferralJob < ApplicationJob
include GoodJob::ActiveJobExtensions::Concurrency
good_job_control_concurrency_with(
perform_limit: 1,
key: -> { "match-referral-#{arguments.first}" }
)
queue_as :matching
def perform(referral_id)
referral = Referral.find(referral_id)
return if referral.matched_at.present?
MatchingService.new(referral).run!
end
end
perform_limit: 1 looks like a config knob. It’s actually a Postgres advisory lock that keeps two workers from racing on the same row. That’s the kind of thing you pay Redis Sentinel babysitting time to avoid in Sidekiq land.
Where I would not pick GoodJob: a hot Aurora cluster. Our writer at the creator platform was a multi-terabyte beast and the working set was already tight. Layering high-throughput job traffic on top would have been an own goal. There, Redis earns its operational cost cleanly.
Back to the Wednesday afternoon. The stuck pending_apple_review queue was a symptom. The disease was upstream of it.
Apple’s SubscriptionRenewal server-to-server notification has a 30 second deadline. Past that, Apple retries. Hard. Our webhook handler did receipt validation and the creator_subscriptions row write inline. Sometimes that took 31 seconds. Apple retried, the retry landed on a different worker, and our handler had no idempotency check, so it created a second subscription row. Across a few thousand customers spread over dozens of customer apps, every card got charged twice that month.
Worse, our first fix was a frontend patch. “Show only the latest subscription per customer.” Visible only. Apple had already moved real money. The creator who’d been drafting a tweet went and posted it.
The real fix went in within a week:
# db/migrate/create_apple_renewal_notifications.rb
class CreateAppleRenewalNotifications < ActiveRecord::Migration[7.1]
def change
create_table :apple_renewal_notifications do |t|
t.string :original_transaction_id, null: false
t.string :notification_uuid, null: false
t.jsonb :payload, null: false, default: {}
t.datetime :processed_at
t.timestamps
end
add_index :apple_renewal_notifications,
[:original_transaction_id, :notification_uuid],
unique: true,
name: "idx_apple_renewal_notifs_dedup"
end
end
# app/jobs/handle_apple_renewal_job.rb
class HandleAppleRenewalJob < ApplicationJob
class Duplicate < StandardError; end
queue_as :billing
retry_on ActiveRecord::Deadlocked, wait: :polynomially_longer, attempts: 5
discard_on Duplicate
def perform(raw_payload)
tx_id = raw_payload.fetch("original_transaction_id")
uuid = raw_payload.fetch("notification_uuid")
AppleRenewalNotification.create!(
original_transaction_id: tx_id,
notification_uuid: uuid,
payload: raw_payload
)
CreatorSubscription.transaction do
apply_renewal!(raw_payload)
AppleRenewalNotification
.where(notification_uuid: uuid)
.update_all(processed_at: Time.current)
end
rescue ActiveRecord::RecordNotUnique
raise Duplicate
end
end
The endpoint enqueues, returns under 200 ms, well inside Apple’s window. If Apple retries, the next job tries to insert the same row, hits the unique index, raises RecordNotUnique, and we discard. Dedup lives at the database, where two workers cannot disagree about whose insert wins.
The general rule I now live by: never trust the response of a write to an upstream that retries on its own clock. Apple, Google Play, Stripe, anything human-moderated. Either you read after write against the upstream’s source of truth, or you put a unique constraint on your side and own dedup yourself.
retry: 5, backoff: :exponential looks safe in a tutorial. In production it’s a tiny bomb. A transient downstream blip affecting 10,000 jobs sends 10,000 retries on roughly the same schedule. You just built the thundering herd you were paying queues to avoid.
Two small changes pay for themselves immediately.
class StripeSyncJob < ApplicationJob
queue_as :billing
retry_on Net::OpenTimeout, wait: ->(executions) {
base = 2 ** executions
base + rand(0..(base / 2.0))
}, attempts: 6
retry_on Stripe::RateLimitError, wait: 30.seconds, attempts: 10
discard_on ActiveRecord::RecordNotFound
discard_on Customer::Anonymized
def perform(customer_id)
customer = Customer.find(customer_id)
StripeSync.new(customer).run!
end
end
The jitter line is three characters of real difference. Default :exponential keeps the herd synchronized. Adding rand(0..(base/2.0)) desynchronizes it. The second thing: discard_on is not a footnote. retry_on is “this might succeed if we try again.” discard_on is “this will never succeed, stop burning capacity.” They’re different decisions. Mixing them up is how you end up with a dead set the size of a small country.
Once a quarter someone on a squad proposes a DAG framework. Define dependencies, watch the graph execute, get callbacks on complete. Sidekiq Pro Batches do a focused version. Temporal does the full version. I’ve shipped both.
Honest take: most “orchestration” needs are a bounded operation pretending to be a graph. If you’re drawing seven boxes and arrows, you probably want one row with a state column and one job per transition.
# app/models/apple_submission.rb
class AppleSubmission < ApplicationRecord
enum :state, {
pending: 0,
validated: 10,
binary_uploaded: 20,
submitted: 30,
in_review: 40,
approved: 50
}
def advance!
case state.to_sym
when :pending then ValidateMetadataJob.perform_later(id)
when :validated then UploadBinaryJob.perform_later(id)
when :binary_uploaded then SubmitToAppleJob.perform_later(id)
when :submitted then PollAppleReviewJob.set(wait: 30.minutes).perform_later(id)
end
end
end
Each job loads the row, checks state, does its one thing, transitions, and calls advance!. Idempotency is automatic because every step is conditional on the current state. If a worker dies mid-step, the next try sees the same state and does the same work. No DAG library needed. No surprise callbacks. The “graph” is implicit in the enum.
Reach for batches when you genuinely have N independent items and need exactly one callback when all of them finish. That’s what batches are for. Sequential work with shared context is not it.
Per-class duration and per-class error rate are the easy half. Every dashboard has them. They tell you if a single job class is sick. They do not tell you the queue itself is dying.
The two signals I care about:
Queue depth growth rate. A queue sitting at 50K and steadily draining is fine. A queue at 5K growing 10% per minute is a fire. The alert lives on the derivative, not the absolute. Datadog query, one line of Grafana, however you cut it. The trick is the slope, not the level.
End-to-end latency per queue. Time from enqueue to start of processing. If billing’s p95 is normally 30 seconds and starts trending toward 5 minutes, you’ve got a worker shortage forming before anything’s actually backed up. Catching it there is the difference between a quiet runbook entry and a war room.
discard_on. Don’t let your dead set become an archive.Thanks for reading. If you’ve got thoughts, send them my way.