Live video is a different workload from VOD. The latency budget is fixed (typically 4-8 seconds glass-to-glass for low-latency, 15-30 seconds for standard); the encoder cannot hit pause; the contribution feed never reaches a clean stop. Every architectural decision for live trades against latency or against reliability — and the wrong tradeoff at any layer compounds at the player.
This document is the reference architecture for production-grade live video on MpegFlow. Live ships in 2026 Q3 — this is the design we're building toward, with the components, capacity sizing, and failure modes documented honestly.
Use case in scope
You are running:
- Live broadcast or OTT events — sports, news, conferences, gaming, premium event programming
- Multi-bitrate ABR fanout — the same source encoded into 4-6 renditions for client adaptive playback
- End-to-end latency targets of 4-15 seconds depending on the use case (interactive ≤ 4s, broadcast ≤ 8s, on-demand-replay ≤ 15s)
- Contribution from production gear — broadcast cameras, OBS, encoder appliances (Elemental, Haivision, Osprey), or WebRTC contribution from clients
You also have or are willing to set up:
- A managed Kubernetes cluster with GPU node groups for live encoder pools (H.264 / HEVC live encoding without GPU is technically possible but not economical at scale)
- An origin / packager caching layer (we self-host nginx + Varnish or use managed Cloudflare R2 + Workers)
- A CDN with low-latency-streaming support (Cloudflare LL-HLS, Akamai Media Services Live, Fastly's streaming product)
Architecture overview
flowchart LR
subgraph "Contribution"
C1[Broadcaster<br/>SRT 1080p60]
C2[OBS streamer<br/>RTMP 720p30]
C3[WebRTC client<br/>VP8 480p30]
end
subgraph "Ingest layer"
ING[Ingest gateway<br/>SRT-listener<br/>RTMP-listener<br/>WHIP/WHEP]
end
subgraph "Live encoder pool (K8s + KEDA)"
LE1[Live encoder<br/>NVENC GPU<br/>1080p60 master]
LE2[Live transcoder<br/>NVENC GPU<br/>4-rendition ABR]
end
subgraph "Packaging"
PKG[LL-HLS packager<br/>CMAF + chunked transfer<br/>2-second segments<br/>500ms chunks]
end
subgraph "Origin"
ORG[Origin cache<br/>nginx + Varnish<br/>30-second window]
end
subgraph "Delivery"
CDN[CDN<br/>LL-HLS edge]
P[Player<br/>hls.js / Shaka]
end
C1 -->|SRT 1.4MB/s| ING
C2 -->|RTMP 800KB/s| ING
C3 -->|WHIP 300KB/s| ING
ING -->|raw frames| LE1
LE1 -->|H.264 master| LE2
LE2 -->|4 renditions| PKG
PKG -->|HLS manifest + segments| ORG
ORG -->|HTTP| CDN
CDN -->|LL-HLS| P
classDef live fill:#1a1a1c,stroke:#ff6b35,stroke-width:1.5px,color:#f5f5f5
classDef control fill:#0a0a0c,stroke:#71717a,stroke-width:1.2px,color:#a1a1aa
class C1,C2,C3,ING live
class LE1,LE2,PKG,ORG,CDN,P control
The shape: contribution flows in via three protocols (SRT for broadcast-grade, RTMP for legacy, WHIP/WebRTC for browser-based contribution), an ingest gateway normalizes them into raw frames, the live encoder pool produces a master + ABR ladder, the LL-HLS packager emits CMAF chunked segments, the origin holds a rolling 30-second window, and the CDN edge serves players.
Latency math
End-to-end glass-to-glass latency is a sum, not a product. Each layer adds:
| Layer | Typical latency contribution | Optimization headroom |
|---|---|---|
| Camera + production switcher | 100-300ms | Hardware-dependent |
| Contribution encoder (camera → SRT/RTMP) | 200-500ms | tune zerolatency, bf=0, refs=1 |
| SRT/RTMP transit | 200-500ms (depends on geo) | Keep contribution geo-close to ingest |
| Ingest gateway processing | 50-150ms | Mostly fixed |
| Live encoder (raw → H.264 ABR) | 300-800ms | NVENC reduces vs CPU; preset matters |
| LL-HLS packager (CMAF chunks) | 500-1000ms (2 chunks of 500ms) | Smaller chunks = less latency, more overhead |
| Origin → CDN propagation | 200-400ms | LL-HLS push helps; HTTP/2 server push is critical |
| CDN edge → player | 100-300ms | Geo + connection quality |
| Player buffering (jitter buffer) | 1-2 seconds | Aggressive buffer = lower latency, more rebuffer |
Sum at the bottom of the table: ~3-6 seconds for low-latency configs, ~6-10 seconds for safer configs. Below 3 seconds end-to-end requires WebRTC delivery (CMAF chunked over HLS bottoms out around 3 seconds because of CDN propagation alone).
The architectural decision: pick a latency target up-front and budget every layer against it. The teams that don't fix the budget end up with 12-second latency by accident and spend a quarter trying to figure out why.
Component walkthrough
Ingest gateway
The ingest gateway terminates contribution feeds and normalizes them into raw frames the encoder pool can consume.
SRT (Secure Reliable Transport) is the broadcast-grade contribution protocol. It runs over UDP with retransmission, encryption, and NAT traversal. Latency overhead is configurable (typically 120-300ms). For broadcast contribution, SRT is the standard.
RTMP (Real-Time Messaging Protocol) is the legacy protocol that most contribution gear (OBS, vMix, older encoder appliances) still defaults to. It runs over TCP, which means it's lossless but TCP retransmission stalls hurt latency. Plan for it; it's not going away.
WHIP/WHEP (WebRTC HTTP Ingestion/Egress Protocol) is the modern browser-contribution path. WebRTC's media stack handles encoding and transit; the ingest gateway accepts the WebRTC offer and bridges to the encoder pool. Sub-500ms contribution-to-ingest latency. The compromise: WebRTC clients renegotiate during connection events, so the ingest layer needs reconnect handling.
The gateway runs as a Deployment with HPA on incoming-stream count. Each pod handles roughly 50-100 concurrent ingests depending on protocol mix. Multi-protocol ingest pods are simpler operationally than per-protocol pods, but per-protocol pods scale more cleanly.
Live encoder pool
The encoder pool runs FFmpeg or vendored encoder binaries with NVENC for H.264/HEVC live encoding. CPU-only live encoding is feasible for low-rendition single-stream cases (think conference recording at 720p single-bitrate); for production multi-bitrate ABR fanout at 1080p60 or 4K, GPU encoding is the only economical path.
A live encoder pod consumes one ingest stream and emits a master H.264 (or HEVC) feed plus the ABR ladder renditions. NVENC on NVIDIA T4 handles 4-6 renditions at 1080p60 simultaneously per GPU; A10 handles the same workload at 4K.
The K8s primitive: Deployment of encoder pods with KEDA scaling on ingest-pool queue depth. Each pod is sized for one stream (1 ingest + N renditions). Scale to zero is feasible during off-hours; pre-warming pods 5 minutes before scheduled events is standard practice.
The full K8s + KEDA topology covers the operator-pattern coordination that makes this work.
LL-HLS packager
The packager consumes the encoder's ABR ladder and produces CMAF chunked segments suitable for LL-HLS delivery. Two configuration choices dominate latency:
Segment length (typical: 2-4 seconds). Shorter segments reduce latency (player can start playback sooner) but increase manifest churn. 2 seconds is the practical low end for LL-HLS; below that, manifest update rates become the bottleneck.
Chunk length (typical: 500ms-1000ms). LL-HLS uses HTTP chunked transfer encoding to send segment chunks before the segment is complete. Shorter chunks = lower latency but more HTTP overhead. 500ms is a reasonable default for low-latency configs.
The packager emits to the origin via HTTP push (CMAF EXT-X-PART directives in the manifest, with EXT-X-PRELOAD-HINT to signal upcoming chunks). The origin must support HTTP/2 server push (or HTTP/3 stream prioritization) for the latency math to work — without push, players poll on every chunk and add 200-400ms of polling latency.
Origin cache
The origin maintains a rolling window of segments (typically 30 seconds) for player join-ahead and rewind. We run nginx with the cache and slice modules tuned for short TTLs, with Varnish in front for object-level caching.
The origin is also where contractual delivery hooks live: SCTE-35 marker injection for ad insertion, manifest manipulation for blackout policies, and DAI (dynamic ad insertion) handoff. These layer onto the origin via a manifest manipulator stage; we keep them out of the critical path so a manipulator failure doesn't kill the live stream.
CDN handoff
LL-HLS at the CDN edge requires push-aware caching. Cloudflare's Stream product handles this well; Akamai Media Services Live and Fastly's streaming products also support LL-HLS push. The configuration that matters: chunk-level cache keys (so each LL-HLS part is cached independently) and short TTLs (1-2 seconds for chunks, 30 seconds for completed segments).
Geographic distribution of edge nodes determines the floor on player-side latency. For sub-5-second end-to-end latency you need a CDN edge within ~200km of the player. This is why CDN choice matters for live more than for VOD — VOD tolerates a 500ms cold-cache fetch; live cannot.
Capacity sizing
For a single 1080p60 live stream with 4 ABR renditions (1080p, 720p, 480p, 360p):
- 1 GPU (T4 or A10) for the live encoder pool
- 1 packager pod (CPU only, ~2 vCPU)
- ~50 Mbps egress per concurrent viewer (sum of all renditions)
For a 4K HDR live stream with 5 ABR renditions:
- 1 A10 GPU for the live encoder
- 1 packager pod (CPU only, ~4 vCPU)
- ~100 Mbps egress per concurrent viewer
The dominant cost at audience scale is CDN egress, not encode. A 100K-viewer live event at 5 Mbps average per viewer = 500 Gbps peak CDN throughput. CDN pricing for live egress is typically $0.04-0.10/GB at that scale; the encode pool cost is rounding error against the CDN bill.
For multi-event scaling (e.g., 100 simultaneous live streams), the encode pool dominates: 100 GPUs at $0.50-1.50/hour = $50-150/hour. KEDA scaling makes this efficient for events that don't run 24/7.
Failure modes and what they cost you
| Failure | Frequency | Customer-visible impact |
|---|---|---|
| Contribution loss (SRT/RTMP disconnect) | A few times per event for any stream | 1-3 seconds of player rebuffer; auto-recovery |
| Live encoder restart (mid-event) | Rare, hours to days | 2-5 seconds of black frames; player resumes from latest segment |
| Packager pacing failure | During traffic spikes | Latency drift up by 1-3 seconds; recovers within 30 seconds |
| Origin cache invalidation lag | During config changes | Stale segments served briefly (1-2 segments); recovers automatically |
| CDN regional event | A few times per year per CDN | Multi-CDN failover takes 30-60 seconds; players may need to refresh |
| GPU node-group unhealth | Rare | Encoder pool re-schedules; new pod takes 30-60 seconds to ready; affected stream loses ~1 minute of content |
The architecture is designed to degrade gracefully: contribution loss → encoder waits with last frame → packager pads with the held frame → players see a brief still image, not a hard error. No layer is designed to drop-and-fail; every layer is designed to hold-the-line until upstream recovers.
Multi-region considerations
For events with global audiences or contractual region-specific requirements, the architecture extends per-region:
- Ingest gateway in the contribution-source region (low latency from broadcaster to ingest)
- Encoder pool replicated per delivery region (avoids transcontinental egress from a single encoder)
- Packager + origin per region (minimizes manifest staleness)
- CDN handles cross-region distribution natively
The trade-off: per-region encoder pools multiply GPU costs. A typical compromise: one primary encoder pool in the contribution region, with regional packager + origin caches that pull from the primary. This keeps the encoder cost single-region while still delivering low-latency to global audiences.
The multi-region failover architecture covers the failover semantics for the VOD case; the live equivalent has the same shape with tighter timing constraints.
Companion architectures
- Kubernetes + KEDA deployment — the cluster topology this builds on (live encoder pools are the same shape as VOD encoder pools, just always-on)
- Strict-broker security — multi-tenant security model that applies to live ingest + encode pools the same way it applies to VOD
- Multi-region failover — failover semantics for the live case
- DRM packaging pipeline — how live + DRM combine when content protection is required
Scope and companion architectures
This pattern covers live ingest, encoding, packaging, and delivery for unprotected content. Adjacent concerns:
- DRM-protected live — pair with the DRM packaging architecture. The live packager additionally handles SPEKE key rotation per segment.
- Server-side ad insertion (SSAI) — manifest manipulation on the origin, beat per-event SCTE-35 markers from the contribution feed. Out of scope here; pair with established providers (Yospace, Brightcove SSAI).
- Captions (live) — typically delivered as a separate WebVTT/CEA-608 stream from the contribution feed, transmuxed by the packager into the manifest. Out of scope here; standard pattern in our broadcast partners' deployments.
- Recording for VOD replay — write the master encoder output to durable storage in parallel with packaging. The recording becomes the mezzanine for VOD pipelines after the live event ends.
Honest scope: where we are vs where we're going
Live encoding ships in MpegFlow's 2026 Q3 release. The architecture above is what we're building toward; today's beta is VOD-only. For teams running live infrastructure today, the practical advice is to pair with established live products (Wowza, AWS MediaLive, Cloudflare Stream Live) for the live path and migrate to MpegFlow's live when it ships — assuming the architecture above matches what your team needs.
The honest reason live ships later than VOD: VOD's failure modes are bounded (a job either succeeds or retries), live's are continuous (every second of the stream is a new opportunity for failure). Building VOD first lets us prove the operational layer (queues, retries, audit, multi-tenant security) before adding the latency budget that live demands.
If your team is evaluating live infrastructure now, the orchestration platform evaluation framework applies just as much to live vendors as to VOD vendors. Most of the seven questions get harder, not easier, in live.