A regional cloud outage at 3am Pacific takes down your encoder pool. Eleven minutes later your CDN starts serving stale manifests from edge caches. By minute thirty, customers in three time zones are seeing playback errors. You're paged, your incident channel is on fire, your runbook says "wait it out" because you never wired regional failover for the pipeline.
This document is for the team that wants to not be in that incident. The shape of a video pipeline that survives a regional outage — broadcast, OTT, or any operator whose business model includes "video keeps playing during incidents."
It's not the cheapest deployment shape. Multi-region adds 30–60% overhead vs single-region. Whether that overhead is the right trade-off depends on what an outage costs you (revenue impact, contractual SLA penalties, brand damage). For Tier-1 broadcasters, it's almost always worth it.
Use case in scope
You are a broadcaster, OTT platform, or major media operator with at least one of these properties:
- Contractual SLA on uptime — your enterprise customers have written guarantees of 99.9% or better, with financial penalties for breach
- Global delivery — viewers in multiple regions, where regional latency materially affects watch experience
- Regulatory data sovereignty — content for EU subjects must process in EU regions, etc.
- Brand-critical reliability — outages make Bloomberg / Variety / The Verge headlines, even if technically your customers' SLAs aren't breached
If none of these apply, this architecture is over-engineered for your situation. A single-region deployment with good observability is the right starting place.
Two patterns: active-active vs active-passive
The decision affects everything downstream. Pick deliberately.
Active-active
Both regions encode and emit simultaneously. CDN routes viewers to the nearer region. Outage = drop one region; the other absorbs the full load.
Pros: lowest viewer latency. Smallest failover blast (no warm-up). Capacity tested daily because both regions are live.
Cons: double the encode cost. Twice the operational complexity. Requires output deduplication so the same job doesn't get encoded twice (unless you want that — sometimes you do, see "deliberate dual-encode" below).
Active-passive (warm standby)
One region is primary; one is warm-standby with replicated state but no encode load. Outage = failover triggers; standby starts encoding within minutes.
Pros: single-region encode cost (passive region's compute is minimal). Simpler operational model.
Cons: failover is a real event, not a routine. Depending on warm-up time, some workload window is lost. Untested failover paths fail.
Most teams should run active-passive for primary and active-active for hot critical paths (live streams, premium content). Mix and match.
High-level deployment topology — active-active
graph TB
GLOBAL["Global coordination layer<br/>(regional health · traffic shaping · drain)"]
subgraph A["Region A (e.g. us-east-1)"]
APIA["MpegFlow control plane"]
EA["Encode pool"]
SA[("Object store")]
DBA[("Postgres + Redis")]
end
subgraph B["Region B (e.g. eu-west-1)"]
APIB["MpegFlow control plane"]
EB["Encode pool"]
SB[("Object store")]
DBB[("Postgres + Redis")]
end
CDN["Multi-CDN routing<br/>(Cloudflare · Fastly · Akamai)"]
VIEWERS["Viewers"]
GLOBAL -->|"health probe"| APIA
GLOBAL -->|"health probe"| APIB
GLOBAL -->|"weights"| CDN
APIA --> EA --> SA
APIA --> DBA
APIB --> EB --> SB
APIB --> DBB
SA <-.async replicate.-> SB
DBA <-.async replicate.-> DBB
SA --> CDN
SB --> CDN
CDN --> VIEWERS
Component-by-component
Global coordination layer
The piece most teams skip and regret. Even in active-active, you need something that knows the global state — which regions are healthy, what traffic share each is taking, when to drain a region.
What it does:
- Polls each regional control plane health endpoint
- Aggregates regional saturation metrics (queue depth, encoder pool utilization)
- Publishes routing weights to the multi-CDN layer
- Triggers drain procedures during planned outages
What it doesn't do:
- It doesn't move jobs between regions in real time. That's a different (much harder) problem.
- It doesn't replace per-region monitoring. Each region needs full local observability.
Implementation: small, stateless, deployed in a third region (so it can outlive a primary outage). Could be your existing service-mesh control plane, an enterprise traffic manager (NS1, Akamai GTM), or a small internal service.
Per-region MpegFlow control planes
Two completely independent MpegFlow installations in two regions. Each is a full deployment of the broadcast-grade VOD transcoding shape — control plane, probe pool, encode pool, package pool, audit DB.
Critical: regions don't share state during normal operation. Each region has its own Postgres, its own queue, its own audit DB. The cross-region replication is async and eventually consistent. If you try to share Postgres across regions you'll spend the next year debugging quorum-related encoder failures.
Job routing — submit-time vs encode-time
When a customer submits a job, where does it go?
Submit-time routing (recommended):
- Customer's API call hits a regional endpoint
- Routing decision made by the global coordination layer based on weights
- Job lives entirely in that region
- If that region fails mid-job, the job fails (and gets re-submitted, optionally to a different region)
Encode-time routing (more complex):
- All jobs hit a global queue
- Workers in any region pull from the queue
- Each job tagged with its execution region for audit
- Cross-region coordination needed to prevent duplicate execution
For most teams, submit-time routing wins on simplicity. Encode-time routing is only worth it if you have very unbalanced regional capacity (e.g., GPU pool only in one region) and need cross-region scheduling.
Object storage replication
Multi-region storage is non-negotiable for failover. Three patterns, in order of how teams adopt them:
1. Same-region storage with cross-region replication for outputs:
- Inputs land in the customer's chosen region
- Outputs replicated async to the secondary region
- Failover scenario: secondary region serves stale outputs from the last successful replication
- RPO (recovery point objective): typically 5–60 seconds
- Cost: ~50% more than single-region storage
- Most cloud vendors support this natively (S3 Cross-Region Replication, GCS Multi-Regional, etc.)
2. Multi-region storage as primary:
- All writes go to a multi-region bucket (S3 Multi-Region Access Points, Cloudflare R2)
- More expensive but RPO is essentially zero
- Right answer for premium content with strict SLA on availability
3. Per-region object stores with explicit dual-write:
- Application writes outputs to both regions on every job completion
- Most operationally heavy but gives strongest consistency
- Right for regulatory or contractual reasons where you need auditable per-region state
Multi-CDN routing layer
A single CDN's regional outage cascades to your viewers. Multi-CDN provides:
- Health-based routing — automatic failover when a CDN's edge in your viewer's region is having issues
- Load distribution — split between CDNs based on cost / performance
- Geographic targeting — route China traffic via a domestic CDN, Europe via Cloudflare, US via Fastly, etc.
Common stacks:
- NS1 / DNSimple smart DNS + multiple CDN origins
- Cedexis / Citrix ADC (now NetScaler) intelligent routing
- Fastly + Cloudflare active-active with manual failover
The cost of multi-CDN is real (typically ~30% more than single-CDN at similar quality), but for Tier-1 operators the alternative is "single CDN regional outage = your service is down."
Manifest delivery and TTLs
The CDN cache TTL on your HLS / DASH manifests is one of the most-impactful settings during a failover. Three patterns:
| TTL setting | Failover behavior | Cost |
|---|---|---|
| Long TTL (60+ seconds) | Stale manifests served during failover; viewers see playback errors | Lowest CDN bill |
| Short TTL (5–10 seconds) | Faster failover propagation; viewers see brief buffering | ~2–5x more origin requests |
| TTL=0 with origin shielding | Near-instant failover; high origin load | Most expensive but most resilient |
For premium / live content, short TTL with origin shielding is standard. For VOD where buffering is acceptable, longer TTL is fine.
Failover scenarios — what actually breaks
The four scenarios you'll experience over a multi-year operation, in rough frequency order:
Single-AZ failure within a region
Most common. A single availability zone goes down — typically network or power. Within a region, your MpegFlow deployment should already be multi-AZ — encode pool spread across AZs, control plane redundant. This isn't really "regional failover," just standard intra-region HA.
Detection: your regional health monitoring catches it. Action: none from the global layer; the region absorbs the loss internally. Customer impact: typically <5 minutes of degraded throughput.
Full regional outage
Rare but devastating. AWS/GCP/Azure regions do go down (full outages roughly once or twice per year per major provider). Your regional MpegFlow stops responding entirely.
Detection: global coordination layer's health probes start failing. Action: weight that region to 0 in the multi-CDN routing layer. Active-active partners absorb the traffic; active-passive standbys begin warming up. Customer impact: depends on the warm-up time of the partner region. For active-active, ~zero. For active-passive, 5–30 minutes.
Encoder pool starvation in one region
Half-failure: control plane is up, but the encode pool can't keep up (could be quota issue, spot capacity exhaustion, GPU pool failure). Region appears healthy at the API level but jobs are queueing without progress.
Detection: queue-depth metrics. Per-region time-in-queue threshold breached. Action: drain the region for new submits; let existing jobs in flight finish; move new traffic elsewhere. Customer impact: for new submits, none (they get the partner region). For in-flight jobs, possibly delayed completion.
Storage replication lag
The output bucket replication is async; during an outage the secondary region might be 30 seconds behind. Customers reading from the secondary see slightly stale outputs.
Detection: replication lag metrics on the storage layer. Action: if lag exceeds tolerance, fail loudly to the customer's API ("this region's outputs may be stale; consider waiting") rather than serve possibly-incorrect data. Customer impact: depends on how forgiving their workflow is.
Capacity sizing — worked example
For a mid-size global broadcaster running 500K transcoded minutes/month across two regions in active-active:
| Component | Per region | Cross-region | Annual cost (rough) |
|---|---|---|---|
| Control plane × 2 + LB | 2 instances | 2 regions | $10K |
| Postgres managed × 2 | medium tier | 2 regions | $10K |
| Encode pool | 35-40 workers | 2 regions = 70-80 total | $200K – 280K |
| Package pool | 4-5 workers | 2 regions | $15K – 20K |
| Object storage | per region + cross-region replication | sync overhead | $50K – 80K |
| Multi-CDN | percentage of egress | n/a | varies wildly by traffic shape |
| Global coordination layer | small | 1 region (third) | $3K |
| Network egress (cross-region) | replication traffic | n/a | $20K – 50K |
| Total annual | ~$310K – 460K |
Compare to single-region equivalent: roughly $180K – 250K annual. Multi-region overhead is ~50–80%. Whether it's worth it depends on what an outage costs you in revenue + SLA penalties + brand impact.
Compliance — when multi-region is forced by sovereignty
Several regulatory contexts effectively require per-region deployments rather than just enable failover:
- GDPR for EU viewers — content / metadata processing must occur in EU regions for EU subjects (Art. 44 onwards on data transfers). Cross-region replication of user metadata is restricted.
- China data residency — content for Chinese viewers can't legally process via US-based regions
- Russia data localization — limited applicability today but historically a requirement
- Sector-specific — broadcast contracts with regional licensing often forbid content from leaving certain geographies
In these cases the architecture isn't "multi-region for failover" — it's "regional islands with shared metadata." Failover might not even be possible across boundaries; you're running fundamentally separate stacks per legal jurisdiction.
Operational runbook
The five things that hurt during a real failover:
Failover untested in production
The most common failure mode. You have a documented failover procedure that you've never actually triggered. When you actually need it at 3am, the runbook has stale assumptions. Quarterly chaos drills where you intentionally drain a region during business hours are the only fix.
DNS propagation slowness
Even with low TTLs, smart-DNS failover takes time to propagate across the global DNS infrastructure. Some viewers may continue hitting the failed region for 30–120 seconds. Anycast-based routing (Cloudflare, Cloudfront) is faster than DNS-based.
Postgres replication lag during failover
If your audit DB is async-replicated cross-region, audit records from the moments before failover might be lost. For most use cases this is acceptable; for regulatory audit trails it's not. Run synchronous cross-region replication for audit tables only, async for everything else.
Encoder pool warmup time
Even active-passive standbys take time to fully warm up. Spot instances need to provision; container images need to pull; configuration needs to validate. 5–15 minutes is typical from "trigger failover" to "target region at full capacity."
Cost surge during a failover
When traffic shifts from a failed region to a partner, that partner's spot capacity may be insufficient and on-demand fills the gap. Your bill spikes 2–3× during the incident. Make sure your AWS / GCP burst quotas are pre-approved before you need them.
Scope and adjacent concerns
This document covers regional failover within a single cloud provider — the most common and highest-leverage case. Adjacent concerns each have their own answer:
- Worldwide cloud-provider degradation → multi-cloud deployment, a separate scaling tier. Available for enterprise customers via custom engagement; not part of the standard reference architecture.
- Application-layer bugs → mitigated by canary deploys and per-region progressive rollout. Standard practice on the MpegFlow Operator (rolling updates with
maxUnavailable: 1); covered in the Kubernetes deployment architecture. - DDoS / coordinated attacks → CDN-level protection (Cloudflare, Akamai Bot Manager) is the right layer. MpegFlow's origin shielding helps, but the perimeter is your CDN's responsibility.
- Live streaming failover → different topology (synchronous, latency-anchored to ingest geography). Companion architecture ships alongside MpegFlow's live primitives in 2026 Q3.
- Single-tenant isolation → see strict-broker multi-tenant security for the per-tenant credential model that runs underneath this deployment.
How to evaluate this architecture for your team
If you're considering multi-region for video infrastructure:
- Calculate your actual outage cost. Revenue per minute during peak. Contractual SLA penalties. Brand impact weighting. If the math says <$10K/hour, multi-region is overkill.
- Map your regulatory constraints. Multi-region for failover and multi-region for sovereignty are different problems. Don't conflate them.
- Pick active-active vs active-passive deliberately. Active-active is simpler operationally but doubles cost. Hybrid (active-active for premium, active-passive for VOD library) is what most large operators end up with.
- Test your failover. A documented runbook you've never executed is theatrical, not operational. Plan quarterly drills before you need them.
- Be honest about what you're not solving. Multi-region is incident insurance, not a guarantee of perfection. Application bugs ship globally; design for that too.
If the architecture maps to where your business needs to be, apply to the design partner program — multi-region deployments are exactly the workload we want to onboard ahead of GA, because the operational complexity is where MpegFlow's audit-first design pays for itself.