MpegFlowBlogBack to home
← Blog·ffmpeg

Running FFmpeg at scale: queue, retry, and the audit trail

What FFmpeg-in-production actually demands — the queue patterns, retry semantics, and audit-trail design that get a single binary to behave like infrastructure.

ByMpegFlow Engineering Team
·May 5, 2026·11 min read·2,107 words
In this post
  1. The naive setup
  2. Retry semantics aren't trivial for FFmpeg
  3. What a video queue needs
  4. Workers: what to actually scale on
  5. The audit trail problem
  6. The hard parts
  7. A few things to avoid
  8. Where MpegFlow fits

A 4-hour encode job dies at minute 219. The exit code is 1. There's a half-written .mp4 on the worker's local disk, a job row in your database stuck in running, and a webhook that promised the customer their file would be ready in five minutes. Now what?

Almost every team that runs video infrastructure goes through some version of this. FFmpeg is an extraordinary piece of software — but it's a binary, not a service. It doesn't know about your queue, your idempotency model, or what "retry" means in your business. The engineering work isn't getting FFmpeg to encode video. It's getting FFmpeg to behave like infrastructure.

This post is about that operational layer. Not codecs, not bitrates, not ABR ladders. The plumbing. We've been building this for a couple years; here's what we wish someone had told us on day one.

#The naive setup

Every video team starts here:

ffmpeg -i input.mov -c:v libx264 -c:a aac -movflags faststart output.mp4

Wrap that in a Python script. Drop it on a VM. Trigger it from cron, or a job queue, or your existing application worker pool. Stand up a webhook to fire when the file lands in S3.

This works. It's how almost every video product starts, and it's the right shape for the first 100 jobs a day. The problems creep in around the time you cross some of these thresholds:

  • Jobs longer than ~5 minutes — your queue runner's visibility timeout starts mattering
  • More than one worker box — output paths collide, network drives become bottlenecks
  • Customer-facing SLAs — "running" jobs are now a liability, not just a status
  • Diverse input formats — different codecs trigger different failure modes, your single try/except stops being enough
  • Encoder version drift — you upgrade FFmpeg on one box, output bytes change, customers notice

There's no exact line; it's a slow accumulation of "we should fix that someday." When someday arrives, the work is operational, not encoding. You're rebuilding a workflow engine.

#Retry semantics aren't trivial for FFmpeg

The first instinct is to wrap the FFmpeg call in a retry loop. This is dangerous specifically for FFmpeg in ways it isn't for, say, an HTTP request.

A failed FFmpeg run usually leaves artifacts. A partial .mp4 half-written. Two-pass encoding intermediate files (ffmpeg2pass-0.log, etc.). Maybe a fragmented MP4 with valid moov atom but truncated mdat. If you retry naïvely, you're not retrying — you're competing with the previous run's debris.

There's also a category problem. FFmpeg failures fall into roughly three buckets, and they want different retry strategies:

Bucket Examples Retry strategy
Transient Network blip pulling input from S3, OOM kill, worker preempted Retry with backoff, same parameters
Deterministic Unsupported codec in input, malformed container, missing required stream Don't retry — surface to user, mark failed_input
Operational Encoder version mismatch, missing font for subtitle burn-in, license expiry Retry, but on a different worker pool

The naïve "retry 3 times then dead-letter" treats all of these the same. You end up burning encoder hours retrying jobs that will never succeed, while genuinely transient failures hit the dead letter queue because the limit was too tight.

The principle: inspect FFmpeg's exit + stderr before deciding whether to retry. Exit code 234 means something. Stderr containing Invalid data found when processing input means something. Build a small classifier; have it return one of the three buckets above. Each bucket gets its own policy.

#What a video queue needs

Most generic job queues — RabbitMQ, SQS, Redis-based queues — work fine for video. But there are configuration choices that, if you get them wrong, you'll spend a quarter rediscovering.

Visibility timeout. Set it to at least expected_job_duration × 2, with a hard ceiling near your max acceptable hang time. If your visibility is 5 minutes and your average job runs 4, every long-tail job (slow input, network stall, hot-pool throttling) will get re-delivered to a second worker. You'll have two FFmpeg processes encoding the same job, racing to write to the same output path. We've seen this in production at multiple companies. It's brutal to debug because the symptoms are corrupt outputs, not job failures.

Priority lanes. Live and VOD don't share a queue. Live can't wait behind a 4-hour archive transcode; archive transcodes can't preempt live. Run separate queues. If your queue system supports priority natively (RabbitMQ does, SQS doesn't), use it carefully — priority queues can cause low-priority starvation if you don't cap the high-priority share.

Dead letter queues with re-enqueue. A DLQ that nobody looks at is a graveyard. The pattern that works: when a job lands in the DLQ, fire an alert, but also expose a one-click "reprocess" affordance. Most DLQ entries are wrong-version-of-FFmpeg or transient pool issues — re-running them after a deploy fixes 60% without investigation.

The metric to watch. Throughput tells you whether your fleet is busy. Time-in-queue tells you whether your customers are waiting. They're different. A fleet running at 95% throughput with 30-minute time-in-queue is a fleet that's about to hit incident.

#Workers: what to actually scale on

The instinct is to scale on CPU. That's right for most encodes — libx264 is CPU-bound and parallelizes within a single job up to a point. But video workers have other dimensions you can't ignore.

Memory. FFmpeg's memory profile is wildly variable. A 1080p H.264 transcode runs comfortably in 2GB; a 4K HDR HEVC transcode with HDR10+ metadata pass-through can demand 16GB. Worse, certain input pathologies (broken timestamps, unusual GOP structures) can balloon memory by 5–10× without warning. Set hard memory limits per job; let the OS kill bloated processes; classify the OOM as a transient retry bucket on a higher-memory pool.

GPU. NVENC, QSV, and AMF are 5–20× faster than libx264 at similar quality. If your workload is throughput-bound and quality-tolerant (live encoding, archive ingest), GPU pools are the right answer. If your workload is quality-bound (premium VOD, broadcast), you're staying on CPU. Run separate pools — don't try to make the workers fungible.

Storage co-location. A 50GB input file streamed from S3 to a worker, transcoded locally, then uploaded back, sees its disk as the bottleneck. Network-attached storage (EFS, GCS FUSE, etc.) sounds elegant but adds latency to every write. Local NVMe per worker, with a clear "stage in / encode / stage out" pattern, is dull but fast.

Spot economics. Spot/preemptible instances are 60–90% cheaper than on-demand. They're also a worse fit for video than for almost any other workload, because long jobs lose all their work when preempted. The pattern that works: short jobs on spot, long jobs on on-demand, with a budget-aware scheduler that doesn't put a 4-hour job on a 90-minute-average-life spot pool.

#The audit trail problem

Here is the part most teams under-build, and the part where the operational debt accumulates fastest.

When a customer calls and says "this file looks wrong," what do you need to answer? At minimum:

  • What were the exact encoder parameters? Not the preset name — the actual command line, with every flag, including the FFmpeg version that ran it
  • What input did it process? Hash, size, format probe output
  • Which worker, which pool, which time? With enough fidelity to correlate against fleet metrics from that window
  • What was the full stderr? Compressed, but kept whole. Excerpts mislead
  • Did it succeed on first try, or after a retry? And if retried, why
  • What was emitted? Output file hash, container manifest if applicable, downstream webhook delivery status

Most teams log a subset of this and miss the rest. The miss is usually:

"We have the FFmpeg command but not the FFmpeg version, because we deployed a new container three weeks ago and the old jobs in the audit log were run on a different binary, and we can't tell which without correlating with deploy timestamps."

The fix: make the encoder binary identity a first-class field on every job event. Container hash or compiled FFmpeg version with a deterministic build, attached to the job record at submission time. When a customer later asks "why did this output change," you can answer.

The second miss is stderr discarding. Stderr is FFmpeg's status channel — it's where the encode progress lives, where filter graph errors print, where bitstream warnings appear. Teams that strip it down to "last 100 lines" lose the warnings that would have predicted the failure. Compress and keep the whole thing. Storage is cheap; root-cause analysis is not.

#The hard parts

A few things that trip up almost every team.

Encoder version pinning across rolling deploys. You roll out a new FFmpeg version. Half your fleet has v6.0; half has v6.1. Customer X gets jobs randomly assigned to either. Outputs differ subtly. Customer X's QC system flags the difference and you spend a week chasing a non-bug. Solution: every job records the FFmpeg version (or container hash) it ran on. Customer-visible "encoder revision" becomes a real concept.

Output cleanup on cancel or timeout. A job is cancelled mid-encode. The worker has half a file. Now the job's status is cancelled but there's a stray output partially in S3 with no row pointing to it. Multiply by months. Garbage collection on output objects is non-optional. Treat outputs as ephemeral until they're committed by a successful job-completion event.

Partial-success ladders. ABR encodes produce N renditions. Rendition 5 fails (out of memory, say) while 1–4 succeed. Your job is now in a weird state: not fully successful, not entirely failed. Most teams retry the whole ladder. Better: structure the job so each rendition is its own task, with a manifest-emit task that depends on all of them. Now retrying is per-rendition, and your audit trail reflects what actually happened.

Quotas on the delivery side. You complete an encode. You need to purge the CDN cache for the customer's playlist URL. Fastly, Akamai, and Cloudflare all rate-limit purges. Bursty completion patterns can hit those limits. The flag-of-shame here is the encode finishing successfully but the customer not seeing it for 30 minutes because the purge queue backed up. Treat delivery as part of the job, not a fire-and-forget side effect.

#A few things to avoid

Logging FFmpeg stderr line-by-line and shipping each line. This is a popular failure mode. FFmpeg emits stderr at high velocity; line-shipping to a centralized log system burns money and slows the worker. Buffer-and-flush per job; ship the whole thing as a single artifact.

Treating FFmpeg as a black box. You must parse its output. The -progress flag emits a structured key-value stream that's far more reliable than tailing stderr. Use it.

A single shared volume across all workers. Convenient until your fleet is 50 boxes and the volume becomes the bottleneck for everything. Local-first, sync to durable storage at stage boundaries.

Queue runners without ack semantics. A worker grabs a job, starts encoding, the worker process dies. Without ack-on-completion, the job is silently dropped. Not "marked failed" — dropped. The customer's file never appears. The audit log has nothing. Always ack on terminal state, never on grab.

#Where MpegFlow fits

Most of this post is about what you'd build yourself. We built it because you shouldn't have to.

MpegFlow is a video pipeline engine where the queue, retry classifier, audit trail, and worker pool semantics are the product. You bring FFmpeg knowledge — the codec choices, the preset library, the QC rules. We run the orchestration around it. Same binary runs as SaaS or self-hosted, so you can validate locally and graduate to managed without rewriting your jobs.

If you've been on the wrong side of any of the failure modes in this post, the beta cohort is open. We're shipping the encoder MVP this quarter; you'll get an email when your slot can take traffic.

If you want to go deeper on the architectural choice underneath all of this — why we model the whole pipeline as a directed acyclic graph instead of an imperative job spec — that's the topic of our next post. If you're specifically running FFmpeg in Kubernetes and want the deployment-pattern climb (K8s Job per encode → worker Deployment → KEDA queue-depth autoscaling → operator pattern), FFmpeg in Kubernetes walks the four-rung ladder and where each one breaks.

The short version: if you're going to build the operational layer for FFmpeg, model it as a DAG. The audit trail, retry semantics, and partial-success handling all become natural consequences of the structure, instead of features you have to remember to add.

Topics
  • FFmpeg
  • Scale
  • Operations
  • architecture
See also

Related reading

  • Engineering blog
    FFmpeg presets that survive production
    What works, what bites, what to pin across encoder versions
  • Engineering blog
    FFmpeg in Kubernetes: pod, queue, operator
    The four patterns and where each one breaks
  • Engineering blog
    Broadcast workflow: build vs buy
    A decision framework for video teams
Want this in production?

Join the MpegFlow beta.

We're shipping the encoder MVP this quarter. Slot opens when it can take your traffic — no card, no console waiting.

Join the beta More posts
© 2026 MpegFlow, Inc. · Trust & complianceAll systems nominal·StatusPrivacy