Cost-aware spot-instance encoder pool

MpegFlow

Cost-aware spot-instance encoder pool

Production architecture for running video transcoding on AWS Spot, GCP Preemptible, and Azure Spot instances. Interruption-tolerant queue topology, fleet diversification, atomic upload semantics, and the cost math that makes self-hosted video pipelines beat per-minute pricing at scale.

ByMpegFlow Engineering Team·For platform engineers and infrastructure leads optimizing video transcoding costs at scale

·Cost-aware spot-instance encoder pool·8 min read·1,601 words·May 9, 2026

Spot instances are 60–90% cheaper than on-demand. They're also interruption-prone, with two-minute warnings and no guarantee of availability. For workloads that can't tolerate interruption — your transactional database, your synchronous HTTP path, your real-time decision systems — they're the wrong shape.

For video transcoding, they are exactly the right shape. Encodes are mostly idempotent. Mezzanine assets sit at rest in object storage. A worker that gets interrupted mid-encode loses its in-flight progress, but the job can be re-queued and run on the next available worker without correctness consequences. The economics are straightforward: per-minute pricing from managed services trends toward $0.015–$0.04 per output minute; running your own fleet on spot instances trends toward $0.002–$0.008 per output minute. At any meaningful volume, the gap is large enough to fund a small infrastructure team.

This document is the production architecture for that pattern. It builds on the Kubernetes + KEDA deployment reference and extends it with the spot-specific concerns: interruption handling, fleet diversification, bin-packing, and the on-demand baseline you keep for the work that can't tolerate retry latency.

Use case in scope

You are running:

>1M output minutes / month sustained, where per-minute managed pricing has become uncomfortable
A workload that tolerates encode-time variance — most VOD pipelines, archive migration, batch transcoding for distribution; not low-latency live ingest
An engineering team comfortable operating Kubernetes and willing to own a fleet
A storage layer that sits outside the worker — S3, R2, GCS, MinIO — so an interrupted worker can be replaced without data loss

If your workload is live ingest, real-time packaging, or sub-minute time-to-first-frame requirements, spot is not your shape. Stay on on-demand or reserved.

The cost math, made concrete

Take a workload of 5M output minutes per month — moderate VOD operator scale.

Path	Per-minute cost	Monthly bill
AWS MediaConvert (on-demand)	$0.0150	$75,000
Bitmovin (committed-volume tier)	$0.0090	$45,000
Self-hosted on-demand (m5.4xlarge)	$0.0055	$27,500
Self-hosted spot (mixed fleet, 80% spot)	$0.0019	$9,500

The spot delta over managed services is ~$65K/month at this volume. That funds an infrastructure engineer with budget left over. At 50M minutes/month — large OTT or broadcaster scale — the delta is over half a million dollars per year.

These are illustrative numbers, not quotes; your real numbers depend on codec mix (HEVC is ~3× more expensive than H.264 to encode), GPU vs CPU choices, and how aggressive your bin-packing is. But the order of magnitude is right.

Architecture overview

flowchart LR
    subgraph "Control plane"
      C[Coordinator API]
      Q[Redis queue<br/>partition: spot-cpu-h264]
      Q2[Redis queue<br/>partition: ondemand-cpu-h264]
      O[MpegFlow Operator]
    end

    subgraph "On-demand baseline (10–20%)"
      OD1[on-demand pool<br/>m5.4xlarge]
    end

    subgraph "Spot fleet (80–90%)"
      S1[spot pool · AZ-a<br/>c5.4xlarge]
      S2[spot pool · AZ-b<br/>c5n.4xlarge]
      S3[spot pool · AZ-c<br/>m5n.4xlarge]
      S4[spot pool · AZ-d<br/>c6i.4xlarge]
    end

    subgraph "Storage (out of worker)"
      ST[(S3 / R2 / GCS<br/>mezzanine + outputs)]
    end

    C -->|enqueue| Q
    C -->|enqueue| Q2
    Q --> S1
    Q --> S2
    Q --> S3
    Q --> S4
    Q2 --> OD1
    S1 -.->|interruption| Q
    S2 -.->|interruption| Q
    S1 -->|read mezzanine| ST
    S1 -->|atomic upload| ST
    OD1 -->|read + write| ST
    O -.->|reconciles WorkerPool CRDs| S1
    O -.->|reconciles WorkerPool CRDs| OD1

The shape: a coordinator that dispatches work onto two classes of queue (spot and on-demand), a fleet of spot pools spread across instance types and availability zones, an on-demand baseline for work that can't wait for retry, and a storage layer that sits entirely outside the worker so interruptions don't lose data.

Component walkthrough

Coordinator + queue partitioning

The coordinator dispatches each job to one of two queues based on a priority flag in the job spec:

priority: cost → spot queue. Acceptable to retry. May take longer end-to-end if the spot fleet is interrupted heavily during the run.
priority: latency → on-demand queue. Bounded retry latency. More expensive per encode, but reliable.

Most VOD batch work flows to the spot queue. Customer-facing flows (an asset needed for a live show in 30 minutes) go to on-demand. The coordinator can promote a job from spot to on-demand if it's been retried more than N times — call this the "stuck job escalation" pattern.

Spot fleet diversification

The single most important pattern for spot economics: diversify across instance types and availability zones. AWS Spot interruptions are correlated by instance pool — when c5.4xlarge in us-east-1a gets reclaimed, all c5.4xlarges in us-east-1a tend to go together. Running four pools across {c5, c5n, m5n, c6i} × {AZ-a, AZ-b, AZ-c, AZ-d} reduces the probability of correlated interruption from "the entire fleet evaporates" to "one of sixteen pools gets reclaimed at a time."

The MpegFlow Operator reconciles each pool from a WorkerPool CRD that includes the instance-type + AZ scope. Adding a pool is a CRD apply; the operator handles the Deployment, ScaledObject, and IAM configuration.

Interruption handling

AWS sends a two-minute interruption warning to instance metadata. Workers poll metadata every 5 seconds; on warning, the worker:

Stops accepting new jobs from the queue.
Marks the in-flight job as "interrupted" in the coordinator (which re-enqueues it).
Drains stderr to the audit log so the partial-run is recorded.
Exits cleanly within 90 seconds, well inside the 2-minute window.

Because the worker has no persistent state and the storage layer sits outside (presigned-URL pattern, see strict-broker security), the only loss is the in-flight FFmpeg progress — which the next worker re-runs from scratch. There is no half-written output to clean up because the worker writes outputs to a temp prefix and only renames atomically on completion.

Bin-packing: chunked encoding for long jobs

Spot interruptions are stochastic. A 4-hour encode running on a single worker has a meaningful probability of being interrupted at least once during its run; if you re-run from scratch every time, the expected cost can exceed running the job on on-demand.

The fix: chunk long encodes. Split the input into 5-minute segments, encode each segment as an independent job, package the segments back together. If a single segment is interrupted, only that 5 minutes is lost, not the whole 4 hours. The coordinator handles the chunk DAG; the worker only ever sees segment-shaped work.

Chunk size is a tradeoff: smaller chunks → less waste on interruption, more orchestration overhead, more output stitching. We've found 3–10 minutes works for most VOD workloads on H.264 / HEVC.

On-demand baseline

The on-demand pool exists for two reasons:

Stuck-job escalation: when a job has been retried on spot more than N times, the coordinator promotes it to on-demand to break the loop.
Latency-sensitive jobs: customer-facing work that has a deadline tighter than the spot retry window. Routed by the priority: latency flag.

Sizing the on-demand pool: 10–20% of total fleet capacity is the right starting point. Below 10%, the stuck-job escalation queue can back up; above 20%, you're not getting the full spot economics benefit. KEDA scales it on its own queue depth.

Failure modes and what they cost you

Failure	Frequency (typical)	Cost impact
Single-instance interruption	2–5% per hour per instance	One re-run of the in-flight chunk
Pool-wide interruption (one AZ + instance type)	A few times per day during high-demand	KEDA reschedules surviving pools; minutes of throughput dip
Region-wide spot capacity exhaustion	Rare, but real during major events	Coordinator promotes to on-demand baseline; cost spikes for the duration
All pools interrupted simultaneously	Vanishingly rare with proper diversification	Same as above; on-demand absorbs

The architecture is designed so the worst realistic failure (region-wide spot exhaustion) degrades gracefully into on-demand pricing for the duration, not into outage. Your bill spikes; your customers don't notice.

Companion architectures

This pattern complements rather than replaces the rest of the MpegFlow architecture set:

Kubernetes + KEDA deployment — the base K8s topology this builds on. Read first.
Strict-broker security — why workers can be ephemeral without a security cost.
Multi-region failover — how spot pools across regions extend this pattern for active-active deployments.
Broadcast-grade VOD transcoding — the full pipeline this slot into.

What this pattern does not solve

Spot pools are a cost strategy, not a latency strategy. If your workload is real-time live encoding, sub-minute time-to-first-frame, or has hard deadlines that can't tolerate retry windows, the spot path is not for you — keep your fleet on on-demand or reserved.

This pattern also assumes a workload size where the operational cost of running the fleet (engineer time, monitoring, on-call) is lower than the cost savings. Below ~1M output minutes/month, managed services usually win on engineer-hours-per-dollar even if they lose on per-minute pricing. The decision framework lives in build vs buy in 2026.

Closing

The architecture is two-layered: a spot-heavy fleet that does the bulk of the work cheaply, with an on-demand baseline that catches anything spot can't reliably finish. The MpegFlow Operator reconciles each pool from a CRD; the coordinator partitions queues by priority; workers handle interruptions cleanly because they have no persistent state.

For teams running >1M output minutes/month, this pattern is the largest single cost lever available — typically 5–10× cheaper per encoded minute than managed services. For teams below that volume, the operational cost usually outweighs the savings. Knowing where you sit on that curve is the actual point of this document.

If you'd like to walk through whether the math works for your workload specifically, the design partner program includes that analysis as part of onboarding.

Topics

Cost-aware spot-instance encoder pool

Use case in scope

The cost math, made concrete

Architecture overview

Component walkthrough

Coordinator + queue partitioning

Spot fleet diversification

Interruption handling

Bin-packing: chunked encoding for long jobs

On-demand baseline

Failure modes and what they cost you

Companion architectures

What this pattern does not solve

Closing

Related architectures and reading

Cost-aware spot-instance encoder pool

Use case in scope

The cost math, made concrete

Architecture overview

Component walkthrough

Coordinator + queue partitioning

Spot fleet diversification

Interruption handling

Bin-packing: chunked encoding for long jobs

On-demand baseline

Failure modes and what they cost you

Companion architectures

What this pattern does not solve

Closing

Related architectures and reading

Cost-aware spot-instance encoder pool

#Use case in scope

#The cost math, made concrete

#Architecture overview

#Component walkthrough

#Coordinator + queue partitioning

#Spot fleet diversification

#Interruption handling

#Bin-packing: chunked encoding for long jobs

#On-demand baseline

#Failure modes and what they cost you

#Companion architectures

#What this pattern does not solve

#Closing

Related architectures and reading

Cost-aware spot-instance encoder pool

#Use case in scope

#The cost math, made concrete

#Architecture overview

#Component walkthrough

#Coordinator + queue partitioning

#Spot fleet diversification

#Interruption handling

#Bin-packing: chunked encoding for long jobs

#On-demand baseline

#Failure modes and what they cost you

#Companion architectures

#What this pattern does not solve

#Closing

Related architectures and reading

Use case in scope

The cost math, made concrete

Architecture overview

Component walkthrough

Coordinator + queue partitioning

Spot fleet diversification

Interruption handling

Bin-packing: chunked encoding for long jobs

On-demand baseline

Failure modes and what they cost you

Companion architectures

What this pattern does not solve

Closing

Use case in scope

The cost math, made concrete

Architecture overview

Component walkthrough

Coordinator + queue partitioning

Spot fleet diversification

Interruption handling

Bin-packing: chunked encoding for long jobs

On-demand baseline

Failure modes and what they cost you

Companion architectures

What this pattern does not solve

Closing