MpegFlowBlogBack to home
← Architectures·Kubernetes deployment with KEDA

Kubernetes deployment with KEDA autoscaling

Production K8s topology for MpegFlow — API tier, shared workers via Helm, dedicated workers via Operator, KEDA queue-depth autoscaling, leader election, pool pause for cost savings.

ByMpegFlow Engineering Team·For SREs and platform engineers deploying MpegFlow on Kubernetes
·Kubernetes deployment with KEDA·11 min read·2,111 words·May 5, 2026
In this architecture
  1. Use case in scope
  2. High-level deployment topology
  3. Component-by-component
  4. API tier (mpegflow-api)
  5. Shared worker pool (Helm-managed)
  6. Dedicated worker pools (Operator-managed)
  7. Pool pause — instant + cost-saving
  8. KEDA scaling strategies
  9. Strategy 1: Pure queue-depth scaling
  10. Strategy 2: Pre-warmed minimum replicas
  11. Strategy 3: Time-based pre-scaling
  12. Permission model in production
  13. Observability
  14. Companion concerns and platform responsibility
  15. How to evaluate this architecture for your team

If you're running MpegFlow at any meaningful scale, you're running it on Kubernetes. Not because Kubernetes is the right answer to every infrastructure problem — but because video transcoding's workload shape (variable-throughput, CPU-bound, periodic spikes, requires fleet-level coordination) maps naturally onto K8s + KEDA-style queue-driven autoscaling.

This document covers the production K8s deployment pattern. It assumes you have an existing K8s cluster with reasonable conventions (Helm, RBAC, NetworkPolicy, secrets management). If you're at the "I'll spin up a single VM and run FFmpeg" stage, our build-vs-buy post covers when K8s is justified.

#Use case in scope

You are running:

  • >100 transcoded minutes per hour sustained
  • Variable load — peaks and quiets through the day or week
  • Multi-tenant or multi-pool — different customers, different SLAs, different worker pool requirements
  • Cost-sensitive — you can't afford to keep encode capacity running at peak load 24/7

You also have or are willing to set up:

  • A managed K8s cluster (EKS, GKE, AKS) or self-hosted (you've done this before, you know what you're getting into)
  • KEDA installed in the cluster
  • A managed Postgres (RDS, Cloud SQL) — running stateful Postgres on K8s for production is hard; we don't recommend it
  • Managed Redis (ElastiCache, Memorystore) — same reasoning
  • Object storage (S3, GCS, R2, MinIO)

#High-level deployment topology

graph TB
    BROWSER["Browser / CLI<br/>(customer)"]

    subgraph CLUSTER["K8s Cluster"]
        INGRESS["Ingress<br/>(NGINX / ALB / Cloud LB)"]

        subgraph APITIER["API tier (HPA-scaled)"]
            API1["mpegflow-api pod 1<br/>:8080 REST<br/>:50051 gRPC<br/>:9090 metrics"]
            API2["mpegflow-api pod 2"]
            APIN["..."]
        end

        subgraph SHARED["Shared worker pool (Helm-managed)"]
            WS1["worker pod"]
            WS2["worker pod"]
            KEDA_S["KEDA ScaledObject<br/>scales 0..maxReplicas<br/>on queue depth"]
        end

        subgraph DEDICATED["Dedicated worker pools (Operator-managed)"]
            WD1["worker pod (tenant A)"]
            WD2["worker pod (tenant B)"]
            KEDA_D["KEDA ScaledObject"]
        end

        OP["mpegflow-operator<br/>(leader election via Lease)"]
    end

    subgraph DATA["Managed services (cluster-external)"]
        PG[("PostgreSQL<br/>(RDS / Cloud SQL)")]
        REDIS[("Redis<br/>(ElastiCache / Memorystore)")]
        S3[("S3 / MinIO")]
    end

    BROWSER --> INGRESS --> API1
    INGRESS --> API2
    WS1 -->|"gRPC"| API1
    WD1 -->|"gRPC"| API2
    WS1 -->|"presigned"| S3
    WD1 -->|"presigned"| S3
    OP -->|"reconcile (replicas)"| WD1
    OP -->|"read-only"| PG
    KEDA_S -.->|"poll queue depth"| REDIS
    KEDA_D -.->|"poll queue depth"| REDIS
    API1 --> PG
    API1 --> REDIS
    API1 --> S3

#Component-by-component

#API tier (mpegflow-api)

The API binary serves four concerns from a single process:

  • REST API on :8080 (Axum framework) — customer-facing endpoint
  • gRPC coordinator on :50051 (Tonic) — worker-facing endpoint
  • WebSocket on /ws (Axum) — live job event streaming
  • Metrics server on :9090 — Prometheus scrape endpoint

Plus background services that run inside the API process:

  • Stale Job Recovery — every 60s, requeues jobs whose worker died mid-encode
  • Stale Worker Sweeper — every 60s, marks workers offline whose heartbeat lapsed
  • Offline Worker Reaper — every 300s, removes offline workers after 2h grace
  • Delayed Job Promoter — every 5s, moves backed-off retries from delayed set into pending
  • Webhook Executor — every 5s, picks up pending webhook deliveries and POSTs them

Background services use SELECT … FOR UPDATE SKIP LOCKED semantics so they're idempotent across replicas — every API pod runs every service, but only one acquires each row of work at a time.

Deployment:

  • Standard Deployment with HPA on CPU (autoscale 2 → 8 typically)
  • Two replicas minimum for HA
  • LivenessProbe: GET /health on :8080
  • ReadinessProbe: GET /ready (also checks DB + Redis connectivity)
  • Resources: 1 CPU / 1 GB RAM per pod is enough for moderate volume; scale up if Webhook Executor is the bottleneck

#Shared worker pool (Helm-managed)

For most teams, the simplest pattern: one shared encoder pool that any tenant's jobs land in. Deployed via Helm chart, autoscaled by KEDA based on Redis queue depth.

# Conceptual KEDA ScaledObject for shared worker pool
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mpegflow-workers-shared
spec:
  scaleTargetRef:
    name: mpegflow-worker
  minReplicaCount: 0          # scale to zero when no jobs
  maxReplicaCount: 50          # cap at 50 workers
  pollingInterval: 10          # check every 10s
  cooldownPeriod: 300          # wait 5min before scaling down
  triggers:
    - type: redis
      metadata:
        address: redis.cluster.svc:6379
        listName: mpegflow:queue:default
        listLength: "5"        # +1 worker per 5 queued jobs

Properties:

  • Scales to zero when the queue is empty — pure savings during quiet periods
  • Scales linearly with queue depth — 50 queued jobs → ~10 workers
  • Cooldown of 5 minutes before scale-down — avoids thrashing on bursty workloads

#Dedicated worker pools (Operator-managed)

For Pro / Enterprise tiers and any tenant that needs isolation guarantees: a dedicated worker pool tied to an organization. Created at runtime by API call, not by editing Helm values.

This is where the Operator comes in. The MpegFlow Operator watches the worker_pools table in PostgreSQL and reconciles K8s Deployments to match:

sequenceDiagram
    participant API
    participant DB as PostgreSQL
    participant OP as Operator (leader)
    participant K8S as K8s API

    API->>DB: INSERT worker_pools<br/>(org_id, max_workers=10, status='active')
    Note over OP: Reconciliation loop<br/>(every 30s)
    OP->>DB: SELECT * FROM worker_pools
    OP->>K8S: GET deployment mpegflow-pool-{id}
    K8S-->>OP: Not found
    OP->>K8S: CREATE Deployment<br/>(replicas=10, pool_id=...)
    OP->>K8S: CREATE ScaledObject<br/>(KEDA, max=10)

    Note over API,K8S: Later — pool paused
    API->>DB: UPDATE worker_pools SET status='paused'
    OP->>DB: Re-read state
    OP->>K8S: PATCH ScaledObject<br/>(maxReplicas=0)
    Note over K8S: Existing workers drain<br/>and exit; no new pods

Key features of the Operator:

  • Leader election via K8s Lease object — only one Operator pod takes action at a time, prevents conflicts during rolling updates. 30-second lease.
  • Read-only DB access — the Operator never writes to PostgreSQL. It only reconciles K8s state to match what the API has already recorded.
  • Per-pool ScaledObject — each dedicated pool gets its own KEDA configuration, independent from shared workers.
  • Per-pool NetworkPolicy — workers in pool A cannot reach workers in pool B. Enforced at K8s networking layer.

#Pool pause — instant + cost-saving

A subtle but useful feature: pools can be paused at two levels.

stateDiagram-v2
    [*] --> active : Pool created

    active --> paused : POST /pools/{id}/pause
    paused --> active : POST /pools/{id}/resume
    active --> [*] : Pool deleted

    note right of active
        Coordinator: assigns jobs normally
        Operator: replicas = max_workers
        KEDA: maxReplicaCount = max_workers
    end note

    note right of paused
        Coordinator: refuses to assign
          (instant — no DB query needed)
        Operator: replicas = 0 (eventual)
        KEDA: maxReplicaCount = 0
        Jobs queue safely in Redis
    end note

The two-level pause matters because:

  1. Coordinator pause is instant. The moment you hit POST /pools/{id}/pause, the coordinator stops handing jobs to that pool's workers. New job submissions queue up in Redis but don't get assigned. Customer sees no impact (jobs queue, then process when resumed).

  2. Operator scale-to-zero is eventual. Within ~30-60 seconds the Operator reconciles, sets replicas=0, and existing workers drain and exit. Compute cost drops to zero.

For broadcast operators with predictable schedules ("we don't transcode overnight"), pool pause via cron saves ~30-50% on encode bills compared to running pools at always-on.

#KEDA scaling strategies

The interesting part of K8s deployment for video infrastructure is the autoscaling. Three strategies, each with trade-offs:

#Strategy 1: Pure queue-depth scaling

The default. KEDA polls Redis queue length every 10s, scales workers up at listLength per worker.

Pros: Simplest. Works well for bursty workloads. Cons: Slight cold-start latency — first job after scale-down waits for a worker to spin up (~30-60s on EKS, less on GKE).

#Strategy 2: Pre-warmed minimum replicas

Set minReplicaCount: 2 instead of 0. Always keeps two workers running.

Pros: No cold-start penalty for low-volume workloads. Cons: ~$200-400/month per always-on worker (depending on instance type). Only worth it if cold-start matters.

#Strategy 3: Time-based pre-scaling

Use cron-based ScaledObject triggers to pre-scale before known peak windows.

# Pre-scale to 10 workers at 8am ET weekdays for known morning encode burst
triggers:
  - type: cron
    metadata:
      timezone: America/New_York
      start: "55 7 * * 1-5"      # 7:55am M-F
      end: "0 18 * * 1-5"        # 6:00pm M-F
      desiredReplicas: "10"
  - type: redis
    metadata:
      ...                          # base scaling for off-hours

Pros: Best of both worlds for predictable workloads. Cons: Operationally heavier — you have to know your peaks. Most teams skip this.

For broadcast operators with daily catch-up patterns, Strategy 3 is usually worth it. For pure-VOD operators with random submission patterns, Strategy 1 is fine.

#Permission model in production

The API server enforces RBAC across 5 roles:

                              Viewer  Editor  Admin  Owner  SuperAdmin
─────────────────────────────────────────────────────────────────────
Workflow Read                   ✅      ✅      ✅     ✅      ✅
Workflow Create/Update/Delete   ─       ✅      ✅     ✅      ✅
Job Read                        ✅      ✅      ✅     ✅      ✅
Job Create/Cancel               ─       ✅      ✅     ✅      ✅
Asset Read/Download             ✅      ✅      ✅     ✅      ✅
Asset Upload/Delete             ─       ✅      ✅     ✅      ✅
Webhook CRUD                    ─       ✅      ✅     ✅      ✅
Organization Read/Usage         ✅      ✅      ✅     ✅      ✅
Organization Update             ─       ─       ✅     ✅      ✅
Manage Members                  ─       ─       ✅     ✅      ✅
Manage Billing                  ─       ─       ─      ✅      ✅
Worker Read (fleet health)      ─       ─       ✅     ✅      ✅
Pool Manage (pause/resume)      ─       ─       ✅     ✅      ✅
Worker Manage (drain/evict)     ─       ─       ─      ─       ✅
Platform Admin                  ─       ─       ─      ─       ✅

Why individual workers are SuperAdmin-only: Worker drain/evict commands conflict with KEDA autoscaling. If an Owner could drain a worker mid-job, the KEDA scaler doesn't know about it and might immediately try to provision a replacement, defeating the drain. Pool-level pause/resume is the user-controllable equivalent — and it works correctly with KEDA because it sets maxReplicaCount=0.

#Observability

The metrics surface from a deployed cluster:

Metric What it tells you
mpegflow_jobs_total{status} Job throughput by terminal state (completed / failed / cancelled)
mpegflow_jobs_in_flight{pool_id} Real-time per-pool active jobs
mpegflow_queue_depth{pool_id} Jobs waiting; KEDA's autoscaling input
mpegflow_worker_count{pool_id, status} Fleet size — drives capacity planning
mpegflow_webhook_deliveries_total{status} Outbound integration health
mpegflow_webhook_consecutive_failures{webhook_id} Per-webhook circuit breaker state
mpegflow_event_bus_dispatch_duration_seconds EventBus is on the hot path; watch p99
mpegflow_grpc_requests_total{method} Worker→coordinator traffic volume

Standard Prometheus scrape on :9090 from each API pod. Grafana dashboards available as ConfigMap in the Helm chart.

#Companion concerns and platform responsibility

This architecture covers the single-cluster K8s deployment of MpegFlow. Adjacent concerns each have their own answer:

  • Cost optimization at scale → see the cost-aware spot-instance encoder pool architecture — extends this deployment with spot fleet diversification, interruption handling, and the on-demand baseline pattern for workloads above 1M output minutes per month.
  • Pattern walkthrough by volume → for the four-pattern climb (K8s Job per encode → worker Deployment + queue → KEDA queue-depth autoscaling → operator pattern), the FFmpeg in Kubernetes blog post walks the decision tree end-to-end.
  • Multi-region resilience → see the multi-region failover architecture — this single-cluster deployment is the foundation it builds on.
  • Multi-tenant security model → see strict-broker security — this K8s deployment enforces the network-isolation and pod-security guarantees that strict-broker depends on.
  • PostgreSQL HA → managed Postgres (RDS, Cloud SQL, Aiven) — running stateful HA on Kubernetes is a separate multi-week project that's not MpegFlow-specific. We recommend pairing with a managed offering.
  • GPU-accelerated encoding → works on standard cloud GPU node groups (NVIDIA T4 and A10 production-tested). Provider-specific node-group setup; talk to us during onboarding for the exact patterns.
  • Cluster federation for dedicated-cluster-per-customer Enterprise deployments → custom engagement; available on the Enterprise tier with named TAM.
  • Kubernetes hardening baseline → standard CIS Benchmark + Pod Security Standards apply. Your platform team owns this layer; MpegFlow runs cleanly on top of any compliant cluster.

#How to evaluate this architecture for your team

If you're an SRE or platform engineer evaluating:

  1. Verify you have or are willing to operate: managed Postgres, managed Redis, KEDA, an ingress controller, and S3-compatible storage. Five components — none MpegFlow-specific, all standard for K8s shops.
  2. Calculate your steady-state worker pool size from your average job volume. Set maxReplicaCount to ~2× peak.
  3. For dedicated tier customers, plan their pool with minReplicaCount according to their SLA — pure autoscale-from-zero is fine for free / starter; pre-warmed minimums are right for Enterprise.
  4. Wire your existing observability stack (Prometheus / Grafana / your APM) to the metrics endpoint. The metric set above is exhaustive enough for capacity planning, alerting, and SLA reporting.
  5. Run the strict-broker security checklist — the K8s deployment is what enforces most of those network-isolation and pod-security guarantees.

If you're early on the K8s journey and this feels like a lot, that's right — running production K8s + KEDA is real work. The trade-off is the operational savings during quiet periods (scale to zero is genuinely free) and the elasticity for spike handling. For 100K+ minutes/month workloads, the math works.

If your team would benefit from a guided deployment of this shape, the design partner program is where we co-deploy with our first cohort.

Topics
  • reference architecture
  • Kubernetes
  • KEDA
  • Autoscaling
  • Operator
  • Helm
See also

Related architectures and reading

  • Architecture
    Cost-aware spot-instance pool
    Spot economics, interruption handling, the cost math
  • Engineering blog
    FFmpeg in Kubernetes: pod, queue, operator
    The four patterns and where each one breaks
  • Architecture
    DRM packaging pipeline
    Widevine, FairPlay, PlayReady via SPEKE — the protected-content path
Want to deploy this?

Apply to the design partner cohort.

We work directly with engineering teams deploying architectures like this one — free during beta, founder-direct, real influence on the roadmap.

Apply Other architectures
© 2026 MpegFlow, Inc. · Trust & complianceAll systems nominal·StatusPrivacy