Broadcast-grade VOD transcoding on MpegFlow

MpegFlow

End-to-end reference architecture for a Tier-1 broadcaster running primary VOD encoding on MpegFlow. Components, capacity sizing, security posture, compliance, and rough cost order.

This document describes a production deployment of MpegFlow for a Tier-1 broadcaster running primary VOD encoding at scale. It's the shape of deployment we recommend for teams whose business depends on getting transcoded video out the door reliably, predictably, and with full provenance.

It's a reference architecture — not the only deployment shape. Specific decisions (cloud provider, CDN, DRM partner) are interchangeable; the structural choices (DAG runtime, audit-trail-first, per-pool worker isolation) are the parts we'd push back on if you tried to skip them.

Use case in scope

A broadcaster or OTT operator is producing finished video that needs to land in a CDN as a multi-rendition ABR manifest, with full audit provenance, idempotent retries, and contractual SLA on completion time. Volume is in the band of 100K – 1M transcoded minutes per month, with mixed input from production (file-based ingest, with occasional broadcast-quality master files at 200+ GB).

Out of scope for this architecture:

Live streaming (separate reference architecture, ships when live primitives ship in 2026 Q3)
Real-time editing or interactive video
Sub-1-second latency workloads
Consumer UGC at YouTube scale

High-level deployment topology

graph TB
    subgraph CP["MpegFlow control plane"]
        API["REST API (Axum)<br/>:8080"]
        GRPC["gRPC Coordinator (Tonic)<br/>:50051"]
        WS["WebSocket /ws<br/>(live job events)"]
        EVENT["EventBus<br/>(audit + WS + webhooks)"]
    end

    subgraph PROBE["Probe pool"]
        P1["ffprobe worker"]
        P2["ffprobe worker"]
    end

    subgraph ENCODE["Encode pool (KEDA-autoscaled)"]
        E1["FFmpeg worker"]
        E2["FFmpeg worker"]
        EN["..."]
    end

    subgraph PKG["Package + emit pool"]
        PK1["HLS/DASH packager"]
        PK2["CDN purge"]
    end

    subgraph STORAGE["Object storage (S3-compatible)"]
        IN["Input bucket<br/>(versioned, 30d retention)"]
        INT["Intermediate bucket<br/>(7d lifecycle)"]
        OUT["Output bucket<br/>(CDN-fronted)"]
    end

    DB[("PostgreSQL<br/>jobs · audit · webhooks")]
    REDIS[("Redis<br/>queues · sessions")]

    P1 -->|"gRPC"| GRPC
    E1 -->|"gRPC"| GRPC
    PK1 -->|"gRPC"| GRPC
    GRPC --> EVENT
    API --> EVENT
    EVENT --> DB
    API --> DB
    GRPC --> REDIS
    P1 -->|"presigned PUT/GET"| IN
    E1 -->|"presigned PUT/GET"| INT
    PK1 -->|"presigned PUT"| OUT

The DAG runtime in the control plane orchestrates per-stage scheduling. Each pool is a horizontally-scalable group of workers with its own preemption and retry policy. Pools are not fungible — encode pool boxes are CPU-rich, package pool boxes are network-rich, probe boxes can be small.

Component-by-component

Control plane

What: the MpegFlow runtime, deployed as a stateless service against a Postgres backing store. Exposes the REST/gRPC API your code submits jobs to.

Where: typically deployed on the same cloud as your encoder pools (latency to workers matters for control), with the Postgres in a managed offering (RDS, Cloud SQL, etc.).

Sizing: for the volume band in scope, two control-plane instances behind a load balancer are sufficient. Not the bottleneck.

State: all job records, audit history, encoder version pinning, and webhook delivery state lives in Postgres. Treat it like any other production OLTP database — backups, point-in-time recovery, read replicas if needed for analytics.

Probe pool

Purpose: runs ffprobe against incoming inputs, extracts metadata (codec, resolution, bitrate, audio tracks, GOP structure, color metadata), validates inputs against your accepted-formats policy, classifies for routing.

Why a separate pool: probe is fast and cheap. Mixing it with encode workers wastes encode capacity on second-long jobs. Separate pool means you can over-provision probe capacity for spike absorption.

Sizing: roughly 1 probe worker per 10 active encode workers. Small boxes (2-core, 4 GB RAM) are sufficient.

Encode pool

Purpose: runs FFmpeg for each rendition in the ABR ladder. CPU-bound for libx264 / libx265; can be GPU-accelerated for throughput-tolerant workloads.

Why a separate pool: isolation. An OOM event in encoding shouldn't take down probes or packaging. Worker churn in encode is high (long jobs, occasional crashes) — the rest of the system shouldn't see it.

Sizing for 500K minutes/month:

Assume 5-rendition Professional-tier ladder (240p, 480p, 720p, 1080p, 2160p)
Aggregate encode minutes: 500K × 5 = 2.5M minutes/month
Per worker (16-core): ~40K minutes/month sustained
Required pool size: ~60-70 workers, with headroom for peaks

Key configuration:

Per-job hard memory limit (kill at 75% of host RAM)
Encoder version pinned per worker pool (deterministic outputs)
Local NVMe for stage-in/stage-out, not network storage
Worker pool tied to a specific FFmpeg container image hash

Package + emit pool

Purpose: assembles renditions into HLS/DASH manifests, fragmens, signs DRM (when DRM ships), uploads to output bucket, fires CDN purge.

Why a separate pool: I/O bound, not CPU bound. Different scaling profile from encode. Also this is where CDN-purge rate-limiting bites — you want it isolated so encode throughput isn't held back by CDN backpressure.

Sizing: typically 10% of encode pool size. Network-rich boxes.

Object storage

Three buckets, deliberate roles:

Input bucket: customer drops finished masters here. Versioned (so an accidental overwrite doesn't lose work). Lifecycle policy: 30-day retention if not yet processed.
Intermediate bucket: stage-out from encoders, stage-in to packagers. Aggressive lifecycle: delete after 7 days. Should be in the same region as encode workers.
Output bucket: finished outputs. CDN-fronted. Long retention. Versioned.

S3 / R2 / GCS / on-prem MinIO are interchangeable. Network egress from the encode workers to storage is often a real cost line — co-locate.

CDN

Out of scope for this architecture in detail, but worth flagging: outbound delivery is rate-limited by every CDN. Bursty completion patterns (large catalog re-encode) can hit purge-API limits even when the encode pool has plenty of capacity. The package pool's job-completion event should be the rate-limited boundary, not the encode pool.

Capacity sizing — worked example

For a broadcaster running roughly 500K transcoded minutes/month:

Component	Size	Rough monthly cost (cloud)
Control plane (2 instances + LB)	2 × 4-core / 8 GB	$400
Postgres (managed, multi-AZ)	medium tier with backups	$400
Probe pool	5–8 workers (2-core / 4 GB)	$300
Encode pool	60–70 workers (16-core / 32 GB)	$9,000 – 12,000
Package pool	6–8 workers (8-core / 32 GB, network-optimized)	$700
Object storage	input + intermediate + output	$1,200 – 2,000
Egress / CDN purge API	varies by traffic shape	$500 – 2,000
Total cloud cost (rough)		~$13,000 – 17,000 / month

Plus your team's operational time on top, plus MpegFlow license fees (TBD at GA — beta cohort runs free).

For comparison, the equivalent workload on AWS Elemental MediaConvert at a 5-rendition Professional ladder is roughly 500K × 5 × $0.015 ≈ $37,500/month in transcode alone (no storage, no CDN). The break-even math is in our build-vs-buy post.

Security posture

The non-negotiables for broadcast contractual workflows:

Network isolation

Workers in private subnets. No public ingress.
Control plane behind your VPN or via private link to your application.
Object storage with bucket-level IAM scoped to specific worker IAM roles.
No worker has access to any bucket it doesn't strictly need.

Encryption

TLS 1.3 between every component pair
Object storage encrypted at rest with customer-managed KMS keys (CMK) — pre-broadcast content typically requires this
Encoder workers' local NVMe encrypted at rest (LUKS, EBS volume encryption, GCP CMEK, etc.)

Audit

Every job records: input hash, ffprobe output, encoder version, full FFmpeg command, stage timestamps, retry history, output hashes
Audit table is append-only — no UPDATE on beta_audit_events-equivalent tables
Backed up nightly, retained for the duration of any contractual reporting period (often 7+ years for broadcast)

Vulnerability disclosure

The MpegFlow security disclosure path is security@mpegflow.com (PGP key on /security once that ships). Encoder pool images should be scanned on every deploy via your existing tooling (Trivy, Snyk, etc.).

Compliance considerations

For Tier-1 broadcasters, the relevant frameworks tend to be:

SOC 2 Type II: MpegFlow's audit window opens 2026 Q4. Until then, design partner deployments operate under bilateral NDA + DPA. SOC 2 customers should target 2026 Q1 GA or later for public-facing production deployments.
GDPR (EU subjects): controlled by your input/output regions. MpegFlow itself is content-agnostic and processes only the metadata you submit (job parameters, audit fields). For EU subject data, deploy in EU regions; sub-processor list documented on /trust.
MPA / TPN best practices: for pre-release content, an air-gap deployment of the self-hosted distribution is the right shape (separate reference architecture; planned for 2026 Q4 once self-hosted ships).
Broadcast-specific contractual requirements: content-restriction, geographic-residency, watermarking, and forensic-audit requirements vary by contract. The architecture above supports each — talk to us during onboarding for specifics.

Operational runbook

The few things that hurt at scale and how to handle them:

Encoder version drift across rolling deploys

The pattern: you deploy encoder v6.1; existing jobs running on v6.0 finish; new jobs start hitting v6.1. Some bitstream parameters subtly change. QC catches it three weeks later. Solution: pin encoder version per job at submission, not per worker. The control plane records the FFmpeg version each job ran on. New deploys get a new pool tier; old jobs finish on the old tier; nothing crosses.

Output cleanup on cancel

When a job is cancelled mid-encode, partial outputs in the intermediate bucket are orphaned. Lifecycle policies catch most of them (7-day deletion); for stricter cleanup, the control plane fires a cleanup task on every terminal-state job event.

Partial-success ABR ladders

If 4 of 5 renditions succeed and the 5th OOMs repeatedly, the package stage is blocked. Two policies: (a) wait, retry the failing rendition on a higher-memory pool; (b) emit the manifest with the 4 successful renditions and flag the missing one for ops review. Default is (a) with a 3-retry cap, then escalate to (b). Configurable per workflow.

Spot instance economics

Encode pool can run 50–70% on spot/preemptible instances if you're tolerant to job preemption. The trade-off: longer jobs become harder to put on spot (more chance of preemption mid-encode). Heuristic: jobs predicted < 90 minutes go on spot pool first; longer jobs stay on on-demand.

Scope and companion architectures

This document is specifically the VOD transcoding pipeline. Adjacent concerns each have their own answer:

Global redundancy and failover → see multi-region failover for the geographic-redundancy pattern that wraps this single-region deployment.
Multi-tenant isolation → see the strict-broker security architecture for the credential-isolation model that this deployment runs on top of.
Live streaming origin → ships in 2026 Q3 alongside SRT/RTMP ingest primitives. Companion architecture lands then.
DRM packaging → pairs with established partners (Vualto, EZDRM, etc.) via SPEKE today; native packaging on the GA roadmap.
Player + analytics → out of scope by design. Pair with Bitmovin Player, Mux Data, JW Player, or your existing player stack — MpegFlow is the encoding/orchestration layer.
Editorial / production management → integrates with Adobe Premiere Pro Production, Iconik, EditShare, IMS, etc. We're the engine; your DAM is the cockpit.

How to evaluate this architecture for your team

If you're at a Tier-1 broadcaster or OTT operator considering this shape:

Map your current workflow to the components above. Where do you have a substitute (your own encoder pool, your own audit table)? Where is the substitute working well? Where is it the bottleneck?
The biggest friction points we hear: per-rendition retry, encoder version pinning, audit-trail provenance. Score yours.
The biggest blockers we hear: SOC 2 timing, DRM coverage, live streaming. We are honest about where we're not yet — see the schedule above.
If the architecture maps to where you'd want to be, apply to the design partner program — we work with 3–5 broadcast/OTT engineering teams ahead of GA.

Use case in scope

Out of scope for this architecture:

Live streaming (separate reference architecture, ships when live primitives ship in 2026 Q3)
Real-time editing or interactive video
Sub-1-second latency workloads
Consumer UGC at YouTube scale

High-level deployment topology

graph TB
    subgraph CP["MpegFlow control plane"]
        API["REST API (Axum)<br/>:8080"]
        GRPC["gRPC Coordinator (Tonic)<br/>:50051"]
        WS["WebSocket /ws<br/>(live job events)"]
        EVENT["EventBus<br/>(audit + WS + webhooks)"]
    end

    subgraph PROBE["Probe pool"]
        P1["ffprobe worker"]
        P2["ffprobe worker"]
    end

    subgraph ENCODE["Encode pool (KEDA-autoscaled)"]
        E1["FFmpeg worker"]
        E2["FFmpeg worker"]
        EN["..."]
    end

    subgraph PKG["Package + emit pool"]
        PK1["HLS/DASH packager"]
        PK2["CDN purge"]
    end

    subgraph STORAGE["Object storage (S3-compatible)"]
        IN["Input bucket<br/>(versioned, 30d retention)"]
        INT["Intermediate bucket<br/>(7d lifecycle)"]
        OUT["Output bucket<br/>(CDN-fronted)"]
    end

    DB[("PostgreSQL<br/>jobs · audit · webhooks")]
    REDIS[("Redis<br/>queues · sessions")]

    P1 -->|"gRPC"| GRPC
    E1 -->|"gRPC"| GRPC
    PK1 -->|"gRPC"| GRPC
    GRPC --> EVENT
    API --> EVENT
    EVENT --> DB
    API --> DB
    GRPC --> REDIS
    P1 -->|"presigned PUT/GET"| IN
    E1 -->|"presigned PUT/GET"| INT
    PK1 -->|"presigned PUT"| OUT

Component-by-component

Control plane

What: the MpegFlow runtime, deployed as a stateless service against a Postgres backing store. Exposes the REST/gRPC API your code submits jobs to.

Where: typically deployed on the same cloud as your encoder pools (latency to workers matters for control), with the Postgres in a managed offering (RDS, Cloud SQL, etc.).

Sizing: for the volume band in scope, two control-plane instances behind a load balancer are sufficient. Not the bottleneck.

Probe pool

Sizing: roughly 1 probe worker per 10 active encode workers. Small boxes (2-core, 4 GB RAM) are sufficient.

Encode pool

Purpose: runs FFmpeg for each rendition in the ABR ladder. CPU-bound for libx264 / libx265; can be GPU-accelerated for throughput-tolerant workloads.

Sizing for 500K minutes/month:

Assume 5-rendition Professional-tier ladder (240p, 480p, 720p, 1080p, 2160p)
Aggregate encode minutes: 500K × 5 = 2.5M minutes/month
Per worker (16-core): ~40K minutes/month sustained
Required pool size: ~60-70 workers, with headroom for peaks

Key configuration:

Per-job hard memory limit (kill at 75% of host RAM)
Encoder version pinned per worker pool (deterministic outputs)
Local NVMe for stage-in/stage-out, not network storage
Worker pool tied to a specific FFmpeg container image hash

Package + emit pool

Purpose: assembles renditions into HLS/DASH manifests, fragmens, signs DRM (when DRM ships), uploads to output bucket, fires CDN purge.

Sizing: typically 10% of encode pool size. Network-rich boxes.

Object storage

Three buckets, deliberate roles:

Input bucket: customer drops finished masters here. Versioned (so an accidental overwrite doesn't lose work). Lifecycle policy: 30-day retention if not yet processed.
Intermediate bucket: stage-out from encoders, stage-in to packagers. Aggressive lifecycle: delete after 7 days. Should be in the same region as encode workers.
Output bucket: finished outputs. CDN-fronted. Long retention. Versioned.

S3 / R2 / GCS / on-prem MinIO are interchangeable. Network egress from the encode workers to storage is often a real cost line — co-locate.

CDN

Capacity sizing — worked example

For a broadcaster running roughly 500K transcoded minutes/month:

Component	Size	Rough monthly cost (cloud)
Control plane (2 instances + LB)	2 × 4-core / 8 GB	$400
Postgres (managed, multi-AZ)	medium tier with backups	$400
Probe pool	5–8 workers (2-core / 4 GB)	$300
Encode pool	60–70 workers (16-core / 32 GB)	$9,000 – 12,000
Package pool	6–8 workers (8-core / 32 GB, network-optimized)	$700
Object storage	input + intermediate + output	$1,200 – 2,000
Egress / CDN purge API	varies by traffic shape	$500 – 2,000
Total cloud cost (rough)		~$13,000 – 17,000 / month

Plus your team's operational time on top, plus MpegFlow license fees (TBD at GA — beta cohort runs free).

Security posture

The non-negotiables for broadcast contractual workflows:

Network isolation

Workers in private subnets. No public ingress.
Control plane behind your VPN or via private link to your application.
Object storage with bucket-level IAM scoped to specific worker IAM roles.
No worker has access to any bucket it doesn't strictly need.

Encryption

TLS 1.3 between every component pair
Object storage encrypted at rest with customer-managed KMS keys (CMK) — pre-broadcast content typically requires this
Encoder workers' local NVMe encrypted at rest (LUKS, EBS volume encryption, GCP CMEK, etc.)

Audit

Every job records: input hash, ffprobe output, encoder version, full FFmpeg command, stage timestamps, retry history, output hashes
Audit table is append-only — no UPDATE on beta_audit_events-equivalent tables
Backed up nightly, retained for the duration of any contractual reporting period (often 7+ years for broadcast)

Vulnerability disclosure

Compliance considerations

For Tier-1 broadcasters, the relevant frameworks tend to be:

SOC 2 Type II: MpegFlow's audit window opens 2026 Q4. Until then, design partner deployments operate under bilateral NDA + DPA. SOC 2 customers should target 2026 Q1 GA or later for public-facing production deployments.
GDPR (EU subjects): controlled by your input/output regions. MpegFlow itself is content-agnostic and processes only the metadata you submit (job parameters, audit fields). For EU subject data, deploy in EU regions; sub-processor list documented on /trust.
MPA / TPN best practices: for pre-release content, an air-gap deployment of the self-hosted distribution is the right shape (separate reference architecture; planned for 2026 Q4 once self-hosted ships).
Broadcast-specific contractual requirements: content-restriction, geographic-residency, watermarking, and forensic-audit requirements vary by contract. The architecture above supports each — talk to us during onboarding for specifics.

Operational runbook

The few things that hurt at scale and how to handle them:

Encoder version drift across rolling deploys

Output cleanup on cancel

Partial-success ABR ladders

Spot instance economics

Scope and companion architectures

This document is specifically the VOD transcoding pipeline. Adjacent concerns each have their own answer:

Global redundancy and failover → see multi-region failover for the geographic-redundancy pattern that wraps this single-region deployment.
Multi-tenant isolation → see the strict-broker security architecture for the credential-isolation model that this deployment runs on top of.
Live streaming origin → ships in 2026 Q3 alongside SRT/RTMP ingest primitives. Companion architecture lands then.
DRM packaging → pairs with established partners (Vualto, EZDRM, etc.) via SPEKE today; native packaging on the GA roadmap.
Player + analytics → out of scope by design. Pair with Bitmovin Player, Mux Data, JW Player, or your existing player stack — MpegFlow is the encoding/orchestration layer.
Editorial / production management → integrates with Adobe Premiere Pro Production, Iconik, EditShare, IMS, etc. We're the engine; your DAM is the cockpit.

How to evaluate this architecture for your team

If you're at a Tier-1 broadcaster or OTT operator considering this shape:

Map your current workflow to the components above. Where do you have a substitute (your own encoder pool, your own audit table)? Where is the substitute working well? Where is it the bottleneck?
The biggest friction points we hear: per-rendition retry, encoder version pinning, audit-trail provenance. Score yours.
The biggest blockers we hear: SOC 2 timing, DRM coverage, live streaming. We are honest about where we're not yet — see the schedule above.
If the architecture maps to where you'd want to be, apply to the design partner program — we work with 3–5 broadcast/OTT engineering teams ahead of GA.

Broadcast-grade VOD transcoding on MpegFlow

#Use case in scope

#High-level deployment topology

#Component-by-component

#Control plane

#Probe pool

#Encode pool

#Package + emit pool

#Object storage

#CDN

#Capacity sizing — worked example

#Security posture

#Network isolation

#Encryption

#Audit

#Vulnerability disclosure

#Compliance considerations

#Operational runbook

#Encoder version drift across rolling deploys

#Output cleanup on cancel

#Partial-success ABR ladders

#Spot instance economics

#Scope and companion architectures

#How to evaluate this architecture for your team

Related architectures and reading

Broadcast-grade VOD transcoding on MpegFlow

#Use case in scope

#High-level deployment topology

#Component-by-component

#Control plane

#Probe pool

#Encode pool

#Package + emit pool

#Object storage

#CDN

#Capacity sizing — worked example

#Security posture

#Network isolation

#Encryption

#Audit

#Vulnerability disclosure

#Compliance considerations

#Operational runbook

#Encoder version drift across rolling deploys

#Output cleanup on cancel

#Partial-success ABR ladders

#Spot instance economics

#Scope and companion architectures

#How to evaluate this architecture for your team

Related architectures and reading

Use case in scope

High-level deployment topology

Component-by-component

Control plane

Probe pool

Encode pool

Package + emit pool

Object storage

CDN

Capacity sizing — worked example

Security posture

Network isolation

Encryption

Audit

Vulnerability disclosure

Compliance considerations

Operational runbook

Encoder version drift across rolling deploys

Output cleanup on cancel

Partial-success ABR ladders

Spot instance economics

Scope and companion architectures

How to evaluate this architecture for your team

Use case in scope

High-level deployment topology

Component-by-component

Control plane

Probe pool

Encode pool

Package + emit pool

Object storage

CDN

Capacity sizing — worked example

Security posture

Network isolation

Encryption

Audit

Vulnerability disclosure

Compliance considerations

Operational runbook

Encoder version drift across rolling deploys

Output cleanup on cancel

Partial-success ABR ladders

Spot instance economics

Scope and companion architectures

How to evaluate this architecture for your team