If you're running MpegFlow at any meaningful scale, you're running it on Kubernetes. Not because Kubernetes is the right answer to every infrastructure problem — but because video transcoding's workload shape (variable-throughput, CPU-bound, periodic spikes, requires fleet-level coordination) maps naturally onto K8s + KEDA-style queue-driven autoscaling.
This document covers the production K8s deployment pattern. It assumes you have an existing K8s cluster with reasonable conventions (Helm, RBAC, NetworkPolicy, secrets management). If you're at the "I'll spin up a single VM and run FFmpeg" stage, our build-vs-buy post covers when K8s is justified.
Use case in scope
You are running:
- >100 transcoded minutes per hour sustained
- Variable load — peaks and quiets through the day or week
- Multi-tenant or multi-pool — different customers, different SLAs, different worker pool requirements
- Cost-sensitive — you can't afford to keep encode capacity running at peak load 24/7
You also have or are willing to set up:
- A managed K8s cluster (EKS, GKE, AKS) or self-hosted (you've done this before, you know what you're getting into)
- KEDA installed in the cluster
- A managed Postgres (RDS, Cloud SQL) — running stateful Postgres on K8s for production is hard; we don't recommend it
- Managed Redis (ElastiCache, Memorystore) — same reasoning
- Object storage (S3, GCS, R2, MinIO)
High-level deployment topology
graph TB
BROWSER["Browser / CLI<br/>(customer)"]
subgraph CLUSTER["K8s Cluster"]
INGRESS["Ingress<br/>(NGINX / ALB / Cloud LB)"]
subgraph APITIER["API tier (HPA-scaled)"]
API1["mpegflow-api pod 1<br/>:8080 REST<br/>:50051 gRPC<br/>:9090 metrics"]
API2["mpegflow-api pod 2"]
APIN["..."]
end
subgraph SHARED["Shared worker pool (Helm-managed)"]
WS1["worker pod"]
WS2["worker pod"]
KEDA_S["KEDA ScaledObject<br/>scales 0..maxReplicas<br/>on queue depth"]
end
subgraph DEDICATED["Dedicated worker pools (Operator-managed)"]
WD1["worker pod (tenant A)"]
WD2["worker pod (tenant B)"]
KEDA_D["KEDA ScaledObject"]
end
OP["mpegflow-operator<br/>(leader election via Lease)"]
end
subgraph DATA["Managed services (cluster-external)"]
PG[("PostgreSQL<br/>(RDS / Cloud SQL)")]
REDIS[("Redis<br/>(ElastiCache / Memorystore)")]
S3[("S3 / MinIO")]
end
BROWSER --> INGRESS --> API1
INGRESS --> API2
WS1 -->|"gRPC"| API1
WD1 -->|"gRPC"| API2
WS1 -->|"presigned"| S3
WD1 -->|"presigned"| S3
OP -->|"reconcile (replicas)"| WD1
OP -->|"read-only"| PG
KEDA_S -.->|"poll queue depth"| REDIS
KEDA_D -.->|"poll queue depth"| REDIS
API1 --> PG
API1 --> REDIS
API1 --> S3
Component-by-component
API tier (mpegflow-api)
The API binary serves four concerns from a single process:
- REST API on
:8080(Axum framework) — customer-facing endpoint - gRPC coordinator on
:50051(Tonic) — worker-facing endpoint - WebSocket on
/ws(Axum) — live job event streaming - Metrics server on
:9090— Prometheus scrape endpoint
Plus background services that run inside the API process:
- Stale Job Recovery — every 60s, requeues jobs whose worker died mid-encode
- Stale Worker Sweeper — every 60s, marks workers offline whose heartbeat lapsed
- Offline Worker Reaper — every 300s, removes offline workers after 2h grace
- Delayed Job Promoter — every 5s, moves backed-off retries from delayed set into pending
- Webhook Executor — every 5s, picks up pending webhook deliveries and POSTs them
Background services use SELECT … FOR UPDATE SKIP LOCKED semantics so they're idempotent across replicas — every API pod runs every service, but only one acquires each row of work at a time.
Deployment:
- Standard
Deploymentwith HPA on CPU (autoscale 2 → 8 typically) - Two replicas minimum for HA
- LivenessProbe:
GET /healthon:8080 - ReadinessProbe:
GET /ready(also checks DB + Redis connectivity) - Resources: 1 CPU / 1 GB RAM per pod is enough for moderate volume; scale up if Webhook Executor is the bottleneck
Shared worker pool (Helm-managed)
For most teams, the simplest pattern: one shared encoder pool that any tenant's jobs land in. Deployed via Helm chart, autoscaled by KEDA based on Redis queue depth.
# Conceptual KEDA ScaledObject for shared worker pool
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: mpegflow-workers-shared
spec:
scaleTargetRef:
name: mpegflow-worker
minReplicaCount: 0 # scale to zero when no jobs
maxReplicaCount: 50 # cap at 50 workers
pollingInterval: 10 # check every 10s
cooldownPeriod: 300 # wait 5min before scaling down
triggers:
- type: redis
metadata:
address: redis.cluster.svc:6379
listName: mpegflow:queue:default
listLength: "5" # +1 worker per 5 queued jobs
Properties:
- Scales to zero when the queue is empty — pure savings during quiet periods
- Scales linearly with queue depth — 50 queued jobs → ~10 workers
- Cooldown of 5 minutes before scale-down — avoids thrashing on bursty workloads
Dedicated worker pools (Operator-managed)
For Pro / Enterprise tiers and any tenant that needs isolation guarantees: a dedicated worker pool tied to an organization. Created at runtime by API call, not by editing Helm values.
This is where the Operator comes in. The MpegFlow Operator watches the worker_pools table in PostgreSQL and reconciles K8s Deployments to match:
sequenceDiagram
participant API
participant DB as PostgreSQL
participant OP as Operator (leader)
participant K8S as K8s API
API->>DB: INSERT worker_pools<br/>(org_id, max_workers=10, status='active')
Note over OP: Reconciliation loop<br/>(every 30s)
OP->>DB: SELECT * FROM worker_pools
OP->>K8S: GET deployment mpegflow-pool-{id}
K8S-->>OP: Not found
OP->>K8S: CREATE Deployment<br/>(replicas=10, pool_id=...)
OP->>K8S: CREATE ScaledObject<br/>(KEDA, max=10)
Note over API,K8S: Later — pool paused
API->>DB: UPDATE worker_pools SET status='paused'
OP->>DB: Re-read state
OP->>K8S: PATCH ScaledObject<br/>(maxReplicas=0)
Note over K8S: Existing workers drain<br/>and exit; no new pods
Key features of the Operator:
- Leader election via K8s Lease object — only one Operator pod takes action at a time, prevents conflicts during rolling updates. 30-second lease.
- Read-only DB access — the Operator never writes to PostgreSQL. It only reconciles K8s state to match what the API has already recorded.
- Per-pool ScaledObject — each dedicated pool gets its own KEDA configuration, independent from shared workers.
- Per-pool NetworkPolicy — workers in pool A cannot reach workers in pool B. Enforced at K8s networking layer.
Pool pause — instant + cost-saving
A subtle but useful feature: pools can be paused at two levels.
stateDiagram-v2
[*] --> active : Pool created
active --> paused : POST /pools/{id}/pause
paused --> active : POST /pools/{id}/resume
active --> [*] : Pool deleted
note right of active
Coordinator: assigns jobs normally
Operator: replicas = max_workers
KEDA: maxReplicaCount = max_workers
end note
note right of paused
Coordinator: refuses to assign
(instant — no DB query needed)
Operator: replicas = 0 (eventual)
KEDA: maxReplicaCount = 0
Jobs queue safely in Redis
end note
The two-level pause matters because:
Coordinator pause is instant. The moment you hit
POST /pools/{id}/pause, the coordinator stops handing jobs to that pool's workers. New job submissions queue up in Redis but don't get assigned. Customer sees no impact (jobs queue, then process when resumed).Operator scale-to-zero is eventual. Within ~30-60 seconds the Operator reconciles, sets
replicas=0, and existing workers drain and exit. Compute cost drops to zero.
For broadcast operators with predictable schedules ("we don't transcode overnight"), pool pause via cron saves ~30-50% on encode bills compared to running pools at always-on.
KEDA scaling strategies
The interesting part of K8s deployment for video infrastructure is the autoscaling. Three strategies, each with trade-offs:
Strategy 1: Pure queue-depth scaling
The default. KEDA polls Redis queue length every 10s, scales workers up at listLength per worker.
Pros: Simplest. Works well for bursty workloads. Cons: Slight cold-start latency — first job after scale-down waits for a worker to spin up (~30-60s on EKS, less on GKE).
Strategy 2: Pre-warmed minimum replicas
Set minReplicaCount: 2 instead of 0. Always keeps two workers running.
Pros: No cold-start penalty for low-volume workloads. Cons: ~$200-400/month per always-on worker (depending on instance type). Only worth it if cold-start matters.
Strategy 3: Time-based pre-scaling
Use cron-based ScaledObject triggers to pre-scale before known peak windows.
# Pre-scale to 10 workers at 8am ET weekdays for known morning encode burst
triggers:
- type: cron
metadata:
timezone: America/New_York
start: "55 7 * * 1-5" # 7:55am M-F
end: "0 18 * * 1-5" # 6:00pm M-F
desiredReplicas: "10"
- type: redis
metadata:
... # base scaling for off-hours
Pros: Best of both worlds for predictable workloads. Cons: Operationally heavier — you have to know your peaks. Most teams skip this.
For broadcast operators with daily catch-up patterns, Strategy 3 is usually worth it. For pure-VOD operators with random submission patterns, Strategy 1 is fine.
Permission model in production
The API server enforces RBAC across 5 roles:
Viewer Editor Admin Owner SuperAdmin
─────────────────────────────────────────────────────────────────────
Workflow Read ✅ ✅ ✅ ✅ ✅
Workflow Create/Update/Delete ─ ✅ ✅ ✅ ✅
Job Read ✅ ✅ ✅ ✅ ✅
Job Create/Cancel ─ ✅ ✅ ✅ ✅
Asset Read/Download ✅ ✅ ✅ ✅ ✅
Asset Upload/Delete ─ ✅ ✅ ✅ ✅
Webhook CRUD ─ ✅ ✅ ✅ ✅
Organization Read/Usage ✅ ✅ ✅ ✅ ✅
Organization Update ─ ─ ✅ ✅ ✅
Manage Members ─ ─ ✅ ✅ ✅
Manage Billing ─ ─ ─ ✅ ✅
Worker Read (fleet health) ─ ─ ✅ ✅ ✅
Pool Manage (pause/resume) ─ ─ ✅ ✅ ✅
Worker Manage (drain/evict) ─ ─ ─ ─ ✅
Platform Admin ─ ─ ─ ─ ✅
Why individual workers are SuperAdmin-only: Worker drain/evict commands conflict with KEDA autoscaling. If an Owner could drain a worker mid-job, the KEDA scaler doesn't know about it and might immediately try to provision a replacement, defeating the drain. Pool-level pause/resume is the user-controllable equivalent — and it works correctly with KEDA because it sets maxReplicaCount=0.
Observability
The metrics surface from a deployed cluster:
| Metric | What it tells you |
|---|---|
mpegflow_jobs_total{status} |
Job throughput by terminal state (completed / failed / cancelled) |
mpegflow_jobs_in_flight{pool_id} |
Real-time per-pool active jobs |
mpegflow_queue_depth{pool_id} |
Jobs waiting; KEDA's autoscaling input |
mpegflow_worker_count{pool_id, status} |
Fleet size — drives capacity planning |
mpegflow_webhook_deliveries_total{status} |
Outbound integration health |
mpegflow_webhook_consecutive_failures{webhook_id} |
Per-webhook circuit breaker state |
mpegflow_event_bus_dispatch_duration_seconds |
EventBus is on the hot path; watch p99 |
mpegflow_grpc_requests_total{method} |
Worker→coordinator traffic volume |
Standard Prometheus scrape on :9090 from each API pod. Grafana dashboards available as ConfigMap in the Helm chart.
Companion concerns and platform responsibility
This architecture covers the single-cluster K8s deployment of MpegFlow. Adjacent concerns each have their own answer:
- Cost optimization at scale → see the cost-aware spot-instance encoder pool architecture — extends this deployment with spot fleet diversification, interruption handling, and the on-demand baseline pattern for workloads above 1M output minutes per month.
- Pattern walkthrough by volume → for the four-pattern climb (K8s Job per encode → worker Deployment + queue → KEDA queue-depth autoscaling → operator pattern), the FFmpeg in Kubernetes blog post walks the decision tree end-to-end.
- Multi-region resilience → see the multi-region failover architecture — this single-cluster deployment is the foundation it builds on.
- Multi-tenant security model → see strict-broker security — this K8s deployment enforces the network-isolation and pod-security guarantees that strict-broker depends on.
- PostgreSQL HA → managed Postgres (RDS, Cloud SQL, Aiven) — running stateful HA on Kubernetes is a separate multi-week project that's not MpegFlow-specific. We recommend pairing with a managed offering.
- GPU-accelerated encoding → works on standard cloud GPU node groups (NVIDIA T4 and A10 production-tested). Provider-specific node-group setup; talk to us during onboarding for the exact patterns.
- Cluster federation for dedicated-cluster-per-customer Enterprise deployments → custom engagement; available on the Enterprise tier with named TAM.
- Kubernetes hardening baseline → standard CIS Benchmark + Pod Security Standards apply. Your platform team owns this layer; MpegFlow runs cleanly on top of any compliant cluster.
How to evaluate this architecture for your team
If you're an SRE or platform engineer evaluating:
- Verify you have or are willing to operate: managed Postgres, managed Redis, KEDA, an ingress controller, and S3-compatible storage. Five components — none MpegFlow-specific, all standard for K8s shops.
- Calculate your steady-state worker pool size from your average job volume. Set
maxReplicaCountto ~2× peak. - For dedicated tier customers, plan their pool with
minReplicaCountaccording to their SLA — pure autoscale-from-zero is fine for free / starter; pre-warmed minimums are right for Enterprise. - Wire your existing observability stack (Prometheus / Grafana / your APM) to the metrics endpoint. The metric set above is exhaustive enough for capacity planning, alerting, and SLA reporting.
- Run the strict-broker security checklist — the K8s deployment is what enforces most of those network-isolation and pod-security guarantees.
If you're early on the K8s journey and this feels like a lot, that's right — running production K8s + KEDA is real work. The trade-off is the operational savings during quiet periods (scale to zero is genuinely free) and the elasticity for spike handling. For 100K+ minutes/month workloads, the math works.
If your team would benefit from a guided deployment of this shape, the design partner program is where we co-deploy with our first cohort.