This document describes a production deployment of MpegFlow for a Tier-1 broadcaster running primary VOD encoding at scale. It's the shape of deployment we recommend for teams whose business depends on getting transcoded video out the door reliably, predictably, and with full provenance.
It's a reference architecture — not the only deployment shape. Specific decisions (cloud provider, CDN, DRM partner) are interchangeable; the structural choices (DAG runtime, audit-trail-first, per-pool worker isolation) are the parts we'd push back on if you tried to skip them.
Use case in scope
A broadcaster or OTT operator is producing finished video that needs to land in a CDN as a multi-rendition ABR manifest, with full audit provenance, idempotent retries, and contractual SLA on completion time. Volume is in the band of 100K – 1M transcoded minutes per month, with mixed input from production (file-based ingest, with occasional broadcast-quality master files at 200+ GB).
Out of scope for this architecture:
- Live streaming (separate reference architecture, ships when live primitives ship in 2026 Q3)
- Real-time editing or interactive video
- Sub-1-second latency workloads
- Consumer UGC at YouTube scale
High-level deployment topology
graph TB
subgraph CP["MpegFlow control plane"]
API["REST API (Axum)<br/>:8080"]
GRPC["gRPC Coordinator (Tonic)<br/>:50051"]
WS["WebSocket /ws<br/>(live job events)"]
EVENT["EventBus<br/>(audit + WS + webhooks)"]
end
subgraph PROBE["Probe pool"]
P1["ffprobe worker"]
P2["ffprobe worker"]
end
subgraph ENCODE["Encode pool (KEDA-autoscaled)"]
E1["FFmpeg worker"]
E2["FFmpeg worker"]
EN["..."]
end
subgraph PKG["Package + emit pool"]
PK1["HLS/DASH packager"]
PK2["CDN purge"]
end
subgraph STORAGE["Object storage (S3-compatible)"]
IN["Input bucket<br/>(versioned, 30d retention)"]
INT["Intermediate bucket<br/>(7d lifecycle)"]
OUT["Output bucket<br/>(CDN-fronted)"]
end
DB[("PostgreSQL<br/>jobs · audit · webhooks")]
REDIS[("Redis<br/>queues · sessions")]
P1 -->|"gRPC"| GRPC
E1 -->|"gRPC"| GRPC
PK1 -->|"gRPC"| GRPC
GRPC --> EVENT
API --> EVENT
EVENT --> DB
API --> DB
GRPC --> REDIS
P1 -->|"presigned PUT/GET"| IN
E1 -->|"presigned PUT/GET"| INT
PK1 -->|"presigned PUT"| OUT
The DAG runtime in the control plane orchestrates per-stage scheduling. Each pool is a horizontally-scalable group of workers with its own preemption and retry policy. Pools are not fungible — encode pool boxes are CPU-rich, package pool boxes are network-rich, probe boxes can be small.
Component-by-component
Control plane
What: the MpegFlow runtime, deployed as a stateless service against a Postgres backing store. Exposes the REST/gRPC API your code submits jobs to.
Where: typically deployed on the same cloud as your encoder pools (latency to workers matters for control), with the Postgres in a managed offering (RDS, Cloud SQL, etc.).
Sizing: for the volume band in scope, two control-plane instances behind a load balancer are sufficient. Not the bottleneck.
State: all job records, audit history, encoder version pinning, and webhook delivery state lives in Postgres. Treat it like any other production OLTP database — backups, point-in-time recovery, read replicas if needed for analytics.
Probe pool
Purpose: runs ffprobe against incoming inputs, extracts metadata (codec, resolution, bitrate, audio tracks, GOP structure, color metadata), validates inputs against your accepted-formats policy, classifies for routing.
Why a separate pool: probe is fast and cheap. Mixing it with encode workers wastes encode capacity on second-long jobs. Separate pool means you can over-provision probe capacity for spike absorption.
Sizing: roughly 1 probe worker per 10 active encode workers. Small boxes (2-core, 4 GB RAM) are sufficient.
Encode pool
Purpose: runs FFmpeg for each rendition in the ABR ladder. CPU-bound for libx264 / libx265; can be GPU-accelerated for throughput-tolerant workloads.
Why a separate pool: isolation. An OOM event in encoding shouldn't take down probes or packaging. Worker churn in encode is high (long jobs, occasional crashes) — the rest of the system shouldn't see it.
Sizing for 500K minutes/month:
- Assume 5-rendition Professional-tier ladder (240p, 480p, 720p, 1080p, 2160p)
- Aggregate encode minutes: 500K × 5 = 2.5M minutes/month
- Per worker (16-core): ~40K minutes/month sustained
- Required pool size: ~60-70 workers, with headroom for peaks
Key configuration:
- Per-job hard memory limit (kill at 75% of host RAM)
- Encoder version pinned per worker pool (deterministic outputs)
- Local NVMe for stage-in/stage-out, not network storage
- Worker pool tied to a specific FFmpeg container image hash
Package + emit pool
Purpose: assembles renditions into HLS/DASH manifests, fragmens, signs DRM (when DRM ships), uploads to output bucket, fires CDN purge.
Why a separate pool: I/O bound, not CPU bound. Different scaling profile from encode. Also this is where CDN-purge rate-limiting bites — you want it isolated so encode throughput isn't held back by CDN backpressure.
Sizing: typically 10% of encode pool size. Network-rich boxes.
Object storage
Three buckets, deliberate roles:
- Input bucket: customer drops finished masters here. Versioned (so an accidental overwrite doesn't lose work). Lifecycle policy: 30-day retention if not yet processed.
- Intermediate bucket: stage-out from encoders, stage-in to packagers. Aggressive lifecycle: delete after 7 days. Should be in the same region as encode workers.
- Output bucket: finished outputs. CDN-fronted. Long retention. Versioned.
S3 / R2 / GCS / on-prem MinIO are interchangeable. Network egress from the encode workers to storage is often a real cost line — co-locate.
CDN
Out of scope for this architecture in detail, but worth flagging: outbound delivery is rate-limited by every CDN. Bursty completion patterns (large catalog re-encode) can hit purge-API limits even when the encode pool has plenty of capacity. The package pool's job-completion event should be the rate-limited boundary, not the encode pool.
Capacity sizing — worked example
For a broadcaster running roughly 500K transcoded minutes/month:
| Component | Size | Rough monthly cost (cloud) |
|---|---|---|
| Control plane (2 instances + LB) | 2 × 4-core / 8 GB | $400 |
| Postgres (managed, multi-AZ) | medium tier with backups | $400 |
| Probe pool | 5–8 workers (2-core / 4 GB) | $300 |
| Encode pool | 60–70 workers (16-core / 32 GB) | $9,000 – 12,000 |
| Package pool | 6–8 workers (8-core / 32 GB, network-optimized) | $700 |
| Object storage | input + intermediate + output | $1,200 – 2,000 |
| Egress / CDN purge API | varies by traffic shape | $500 – 2,000 |
| Total cloud cost (rough) | ~$13,000 – 17,000 / month |
Plus your team's operational time on top, plus MpegFlow license fees (TBD at GA — beta cohort runs free).
For comparison, the equivalent workload on AWS Elemental MediaConvert at a 5-rendition Professional ladder is roughly 500K × 5 × $0.015 ≈ $37,500/month in transcode alone (no storage, no CDN). The break-even math is in our build-vs-buy post.
Security posture
The non-negotiables for broadcast contractual workflows:
Network isolation
- Workers in private subnets. No public ingress.
- Control plane behind your VPN or via private link to your application.
- Object storage with bucket-level IAM scoped to specific worker IAM roles.
- No worker has access to any bucket it doesn't strictly need.
Encryption
- TLS 1.3 between every component pair
- Object storage encrypted at rest with customer-managed KMS keys (CMK) — pre-broadcast content typically requires this
- Encoder workers' local NVMe encrypted at rest (LUKS, EBS volume encryption, GCP CMEK, etc.)
Audit
- Every job records: input hash, ffprobe output, encoder version, full FFmpeg command, stage timestamps, retry history, output hashes
- Audit table is append-only — no UPDATE on
beta_audit_events-equivalent tables - Backed up nightly, retained for the duration of any contractual reporting period (often 7+ years for broadcast)
Vulnerability disclosure
The MpegFlow security disclosure path is security@mpegflow.com (PGP key on /security once that ships). Encoder pool images should be scanned on every deploy via your existing tooling (Trivy, Snyk, etc.).
Compliance considerations
For Tier-1 broadcasters, the relevant frameworks tend to be:
- SOC 2 Type II: MpegFlow's audit window opens 2026 Q4. Until then, design partner deployments operate under bilateral NDA + DPA. SOC 2 customers should target 2026 Q1 GA or later for public-facing production deployments.
- GDPR (EU subjects): controlled by your input/output regions. MpegFlow itself is content-agnostic and processes only the metadata you submit (job parameters, audit fields). For EU subject data, deploy in EU regions; sub-processor list documented on
/trust. - MPA / TPN best practices: for pre-release content, an air-gap deployment of the self-hosted distribution is the right shape (separate reference architecture; planned for 2026 Q4 once self-hosted ships).
- Broadcast-specific contractual requirements: content-restriction, geographic-residency, watermarking, and forensic-audit requirements vary by contract. The architecture above supports each — talk to us during onboarding for specifics.
Operational runbook
The few things that hurt at scale and how to handle them:
Encoder version drift across rolling deploys
The pattern: you deploy encoder v6.1; existing jobs running on v6.0 finish; new jobs start hitting v6.1. Some bitstream parameters subtly change. QC catches it three weeks later. Solution: pin encoder version per job at submission, not per worker. The control plane records the FFmpeg version each job ran on. New deploys get a new pool tier; old jobs finish on the old tier; nothing crosses.
Output cleanup on cancel
When a job is cancelled mid-encode, partial outputs in the intermediate bucket are orphaned. Lifecycle policies catch most of them (7-day deletion); for stricter cleanup, the control plane fires a cleanup task on every terminal-state job event.
Partial-success ABR ladders
If 4 of 5 renditions succeed and the 5th OOMs repeatedly, the package stage is blocked. Two policies: (a) wait, retry the failing rendition on a higher-memory pool; (b) emit the manifest with the 4 successful renditions and flag the missing one for ops review. Default is (a) with a 3-retry cap, then escalate to (b). Configurable per workflow.
Spot instance economics
Encode pool can run 50–70% on spot/preemptible instances if you're tolerant to job preemption. The trade-off: longer jobs become harder to put on spot (more chance of preemption mid-encode). Heuristic: jobs predicted < 90 minutes go on spot pool first; longer jobs stay on on-demand.
Scope and companion architectures
This document is specifically the VOD transcoding pipeline. Adjacent concerns each have their own answer:
- Global redundancy and failover → see multi-region failover for the geographic-redundancy pattern that wraps this single-region deployment.
- Multi-tenant isolation → see the strict-broker security architecture for the credential-isolation model that this deployment runs on top of.
- Live streaming origin → ships in 2026 Q3 alongside SRT/RTMP ingest primitives. Companion architecture lands then.
- DRM packaging → pairs with established partners (Vualto, EZDRM, etc.) via SPEKE today; native packaging on the GA roadmap.
- Player + analytics → out of scope by design. Pair with Bitmovin Player, Mux Data, JW Player, or your existing player stack — MpegFlow is the encoding/orchestration layer.
- Editorial / production management → integrates with Adobe Premiere Pro Production, Iconik, EditShare, IMS, etc. We're the engine; your DAM is the cockpit.
How to evaluate this architecture for your team
If you're at a Tier-1 broadcaster or OTT operator considering this shape:
- Map your current workflow to the components above. Where do you have a substitute (your own encoder pool, your own audit table)? Where is the substitute working well? Where is it the bottleneck?
- The biggest friction points we hear: per-rendition retry, encoder version pinning, audit-trail provenance. Score yours.
- The biggest blockers we hear: SOC 2 timing, DRM coverage, live streaming. We are honest about where we're not yet — see the schedule above.
- If the architecture maps to where you'd want to be, apply to the design partner program — we work with 3–5 broadcast/OTT engineering teams ahead of GA.