Multi-region failover for video pipelines on MpegFlow

MpegFlow

Reference architecture for global broadcasters and OTT operators who can't afford a regional outage to take their video pipeline down. Active-active vs active-passive, manifest replication, CDN routing, regional sovereignty.

A regional cloud outage at 3am Pacific takes down your encoder pool. Eleven minutes later your CDN starts serving stale manifests from edge caches. By minute thirty, customers in three time zones are seeing playback errors. You're paged, your incident channel is on fire, your runbook says "wait it out" because you never wired regional failover for the pipeline.

This document is for the team that wants to not be in that incident. The shape of a video pipeline that survives a regional outage — broadcast, OTT, or any operator whose business model includes "video keeps playing during incidents."

It's not the cheapest deployment shape. Multi-region adds 30–60% overhead vs single-region. Whether that overhead is the right trade-off depends on what an outage costs you (revenue impact, contractual SLA penalties, brand damage). For Tier-1 broadcasters, it's almost always worth it.

Use case in scope

You are a broadcaster, OTT platform, or major media operator with at least one of these properties:

Contractual SLA on uptime — your enterprise customers have written guarantees of 99.9% or better, with financial penalties for breach
Global delivery — viewers in multiple regions, where regional latency materially affects watch experience
Regulatory data sovereignty — content for EU subjects must process in EU regions, etc.
Brand-critical reliability — outages make Bloomberg / Variety / The Verge headlines, even if technically your customers' SLAs aren't breached

If none of these apply, this architecture is over-engineered for your situation. A single-region deployment with good observability is the right starting place.

Two patterns: active-active vs active-passive

The decision affects everything downstream. Pick deliberately.

Active-active

Both regions encode and emit simultaneously. CDN routes viewers to the nearer region. Outage = drop one region; the other absorbs the full load.

Pros: lowest viewer latency. Smallest failover blast (no warm-up). Capacity tested daily because both regions are live.

Cons: double the encode cost. Twice the operational complexity. Requires output deduplication so the same job doesn't get encoded twice (unless you want that — sometimes you do, see "deliberate dual-encode" below).

Active-passive (warm standby)

One region is primary; one is warm-standby with replicated state but no encode load. Outage = failover triggers; standby starts encoding within minutes.

Pros: single-region encode cost (passive region's compute is minimal). Simpler operational model.

Cons: failover is a real event, not a routine. Depending on warm-up time, some workload window is lost. Untested failover paths fail.

Most teams should run active-passive for primary and active-active for hot critical paths (live streams, premium content). Mix and match.

High-level deployment topology — active-active

graph TB
    GLOBAL["Global coordination layer<br/>(regional health · traffic shaping · drain)"]

    subgraph A["Region A (e.g. us-east-1)"]
        APIA["MpegFlow control plane"]
        EA["Encode pool"]
        SA[("Object store")]
        DBA[("Postgres + Redis")]
    end

    subgraph B["Region B (e.g. eu-west-1)"]
        APIB["MpegFlow control plane"]
        EB["Encode pool"]
        SB[("Object store")]
        DBB[("Postgres + Redis")]
    end

    CDN["Multi-CDN routing<br/>(Cloudflare · Fastly · Akamai)"]
    VIEWERS["Viewers"]

    GLOBAL -->|"health probe"| APIA
    GLOBAL -->|"health probe"| APIB
    GLOBAL -->|"weights"| CDN

    APIA --> EA --> SA
    APIA --> DBA
    APIB --> EB --> SB
    APIB --> DBB

    SA <-.async replicate.-> SB
    DBA <-.async replicate.-> DBB

    SA --> CDN
    SB --> CDN
    CDN --> VIEWERS

Component-by-component

Global coordination layer

The piece most teams skip and regret. Even in active-active, you need something that knows the global state — which regions are healthy, what traffic share each is taking, when to drain a region.

What it does:

Polls each regional control plane health endpoint
Aggregates regional saturation metrics (queue depth, encoder pool utilization)
Publishes routing weights to the multi-CDN layer
Triggers drain procedures during planned outages

What it doesn't do:

It doesn't move jobs between regions in real time. That's a different (much harder) problem.
It doesn't replace per-region monitoring. Each region needs full local observability.

Implementation: small, stateless, deployed in a third region (so it can outlive a primary outage). Could be your existing service-mesh control plane, an enterprise traffic manager (NS1, Akamai GTM), or a small internal service.

Per-region MpegFlow control planes

Two completely independent MpegFlow installations in two regions. Each is a full deployment of the broadcast-grade VOD transcoding shape — control plane, probe pool, encode pool, package pool, audit DB.

Critical: regions don't share state during normal operation. Each region has its own Postgres, its own queue, its own audit DB. The cross-region replication is async and eventually consistent. If you try to share Postgres across regions you'll spend the next year debugging quorum-related encoder failures.

Job routing — submit-time vs encode-time

When a customer submits a job, where does it go?

Submit-time routing (recommended):

Customer's API call hits a regional endpoint
Routing decision made by the global coordination layer based on weights
Job lives entirely in that region
If that region fails mid-job, the job fails (and gets re-submitted, optionally to a different region)

Encode-time routing (more complex):

All jobs hit a global queue
Workers in any region pull from the queue
Each job tagged with its execution region for audit
Cross-region coordination needed to prevent duplicate execution

For most teams, submit-time routing wins on simplicity. Encode-time routing is only worth it if you have very unbalanced regional capacity (e.g., GPU pool only in one region) and need cross-region scheduling.

Object storage replication

Multi-region storage is non-negotiable for failover. Three patterns, in order of how teams adopt them:

1. Same-region storage with cross-region replication for outputs:

Inputs land in the customer's chosen region
Outputs replicated async to the secondary region
Failover scenario: secondary region serves stale outputs from the last successful replication
RPO (recovery point objective): typically 5–60 seconds
Cost: ~50% more than single-region storage
Most cloud vendors support this natively (S3 Cross-Region Replication, GCS Multi-Regional, etc.)

2. Multi-region storage as primary:

All writes go to a multi-region bucket (S3 Multi-Region Access Points, Cloudflare R2)
More expensive but RPO is essentially zero
Right answer for premium content with strict SLA on availability

3. Per-region object stores with explicit dual-write:

Application writes outputs to both regions on every job completion
Most operationally heavy but gives strongest consistency
Right for regulatory or contractual reasons where you need auditable per-region state

Multi-CDN routing layer

A single CDN's regional outage cascades to your viewers. Multi-CDN provides:

Health-based routing — automatic failover when a CDN's edge in your viewer's region is having issues
Load distribution — split between CDNs based on cost / performance
Geographic targeting — route China traffic via a domestic CDN, Europe via Cloudflare, US via Fastly, etc.

Common stacks:

NS1 / DNSimple smart DNS + multiple CDN origins
Cedexis / Citrix ADC (now NetScaler) intelligent routing
Fastly + Cloudflare active-active with manual failover

The cost of multi-CDN is real (typically ~30% more than single-CDN at similar quality), but for Tier-1 operators the alternative is "single CDN regional outage = your service is down."

Manifest delivery and TTLs

The CDN cache TTL on your HLS / DASH manifests is one of the most-impactful settings during a failover. Three patterns:

TTL setting	Failover behavior	Cost
Long TTL (60+ seconds)	Stale manifests served during failover; viewers see playback errors	Lowest CDN bill
Short TTL (5–10 seconds)	Faster failover propagation; viewers see brief buffering	~2–5x more origin requests
TTL=0 with origin shielding	Near-instant failover; high origin load	Most expensive but most resilient

For premium / live content, short TTL with origin shielding is standard. For VOD where buffering is acceptable, longer TTL is fine.

Failover scenarios — what actually breaks

The four scenarios you'll experience over a multi-year operation, in rough frequency order:

Single-AZ failure within a region

Most common. A single availability zone goes down — typically network or power. Within a region, your MpegFlow deployment should already be multi-AZ — encode pool spread across AZs, control plane redundant. This isn't really "regional failover," just standard intra-region HA.

Detection: your regional health monitoring catches it. Action: none from the global layer; the region absorbs the loss internally. Customer impact: typically <5 minutes of degraded throughput.

Full regional outage

Rare but devastating. AWS/GCP/Azure regions do go down (full outages roughly once or twice per year per major provider). Your regional MpegFlow stops responding entirely.

Detection: global coordination layer's health probes start failing. Action: weight that region to 0 in the multi-CDN routing layer. Active-active partners absorb the traffic; active-passive standbys begin warming up. Customer impact: depends on the warm-up time of the partner region. For active-active, ~zero. For active-passive, 5–30 minutes.

Encoder pool starvation in one region

Half-failure: control plane is up, but the encode pool can't keep up (could be quota issue, spot capacity exhaustion, GPU pool failure). Region appears healthy at the API level but jobs are queueing without progress.

Detection: queue-depth metrics. Per-region time-in-queue threshold breached. Action: drain the region for new submits; let existing jobs in flight finish; move new traffic elsewhere. Customer impact: for new submits, none (they get the partner region). For in-flight jobs, possibly delayed completion.

Storage replication lag

The output bucket replication is async; during an outage the secondary region might be 30 seconds behind. Customers reading from the secondary see slightly stale outputs.

Detection: replication lag metrics on the storage layer. Action: if lag exceeds tolerance, fail loudly to the customer's API ("this region's outputs may be stale; consider waiting") rather than serve possibly-incorrect data. Customer impact: depends on how forgiving their workflow is.

Capacity sizing — worked example

For a mid-size global broadcaster running 500K transcoded minutes/month across two regions in active-active:

Component	Per region	Cross-region	Annual cost (rough)
Control plane × 2 + LB	2 instances	2 regions	$10K
Postgres managed × 2	medium tier	2 regions	$10K
Encode pool	35-40 workers	2 regions = 70-80 total	$200K – 280K
Package pool	4-5 workers	2 regions	$15K – 20K
Object storage	per region + cross-region replication	sync overhead	$50K – 80K
Multi-CDN	percentage of egress	n/a	varies wildly by traffic shape
Global coordination layer	small	1 region (third)	$3K
Network egress (cross-region)	replication traffic	n/a	$20K – 50K
Total annual			~$310K – 460K

Compare to single-region equivalent: roughly $180K – 250K annual. Multi-region overhead is ~50–80%. Whether it's worth it depends on what an outage costs you in revenue + SLA penalties + brand impact.

Compliance — when multi-region is forced by sovereignty

Several regulatory contexts effectively require per-region deployments rather than just enable failover:

GDPR for EU viewers — content / metadata processing must occur in EU regions for EU subjects (Art. 44 onwards on data transfers). Cross-region replication of user metadata is restricted.
China data residency — content for Chinese viewers can't legally process via US-based regions
Russia data localization — limited applicability today but historically a requirement
Sector-specific — broadcast contracts with regional licensing often forbid content from leaving certain geographies

In these cases the architecture isn't "multi-region for failover" — it's "regional islands with shared metadata." Failover might not even be possible across boundaries; you're running fundamentally separate stacks per legal jurisdiction.

Operational runbook

The five things that hurt during a real failover:

Failover untested in production

The most common failure mode. You have a documented failover procedure that you've never actually triggered. When you actually need it at 3am, the runbook has stale assumptions. Quarterly chaos drills where you intentionally drain a region during business hours are the only fix.

DNS propagation slowness

Even with low TTLs, smart-DNS failover takes time to propagate across the global DNS infrastructure. Some viewers may continue hitting the failed region for 30–120 seconds. Anycast-based routing (Cloudflare, Cloudfront) is faster than DNS-based.

Postgres replication lag during failover

If your audit DB is async-replicated cross-region, audit records from the moments before failover might be lost. For most use cases this is acceptable; for regulatory audit trails it's not. Run synchronous cross-region replication for audit tables only, async for everything else.

Encoder pool warmup time

Even active-passive standbys take time to fully warm up. Spot instances need to provision; container images need to pull; configuration needs to validate. 5–15 minutes is typical from "trigger failover" to "target region at full capacity."

Cost surge during a failover

When traffic shifts from a failed region to a partner, that partner's spot capacity may be insufficient and on-demand fills the gap. Your bill spikes 2–3× during the incident. Make sure your AWS / GCP burst quotas are pre-approved before you need them.

Scope and adjacent concerns

This document covers regional failover within a single cloud provider — the most common and highest-leverage case. Adjacent concerns each have their own answer:

Worldwide cloud-provider degradation → multi-cloud deployment, a separate scaling tier. Available for enterprise customers via custom engagement; not part of the standard reference architecture.
Application-layer bugs → mitigated by canary deploys and per-region progressive rollout. Standard practice on the MpegFlow Operator (rolling updates with maxUnavailable: 1); covered in the Kubernetes deployment architecture.
DDoS / coordinated attacks → CDN-level protection (Cloudflare, Akamai Bot Manager) is the right layer. MpegFlow's origin shielding helps, but the perimeter is your CDN's responsibility.
Live streaming failover → different topology (synchronous, latency-anchored to ingest geography). Companion architecture ships alongside MpegFlow's live primitives in 2026 Q3.
Single-tenant isolation → see strict-broker multi-tenant security for the per-tenant credential model that runs underneath this deployment.

How to evaluate this architecture for your team

If you're considering multi-region for video infrastructure:

Calculate your actual outage cost. Revenue per minute during peak. Contractual SLA penalties. Brand impact weighting. If the math says <$10K/hour, multi-region is overkill.
Map your regulatory constraints. Multi-region for failover and multi-region for sovereignty are different problems. Don't conflate them.
Pick active-active vs active-passive deliberately. Active-active is simpler operationally but doubles cost. Hybrid (active-active for premium, active-passive for VOD library) is what most large operators end up with.
Test your failover. A documented runbook you've never executed is theatrical, not operational. Plan quarterly drills before you need them.
Be honest about what you're not solving. Multi-region is incident insurance, not a guarantee of perfection. Application bugs ship globally; design for that too.

If the architecture maps to where your business needs to be, apply to the design partner program — multi-region deployments are exactly the workload we want to onboard ahead of GA, because the operational complexity is where MpegFlow's audit-first design pays for itself.

Use case in scope

You are a broadcaster, OTT platform, or major media operator with at least one of these properties:

Contractual SLA on uptime — your enterprise customers have written guarantees of 99.9% or better, with financial penalties for breach
Global delivery — viewers in multiple regions, where regional latency materially affects watch experience
Regulatory data sovereignty — content for EU subjects must process in EU regions, etc.
Brand-critical reliability — outages make Bloomberg / Variety / The Verge headlines, even if technically your customers' SLAs aren't breached

If none of these apply, this architecture is over-engineered for your situation. A single-region deployment with good observability is the right starting place.

Two patterns: active-active vs active-passive

The decision affects everything downstream. Pick deliberately.

Active-active

Both regions encode and emit simultaneously. CDN routes viewers to the nearer region. Outage = drop one region; the other absorbs the full load.

Pros: lowest viewer latency. Smallest failover blast (no warm-up). Capacity tested daily because both regions are live.

Active-passive (warm standby)

One region is primary; one is warm-standby with replicated state but no encode load. Outage = failover triggers; standby starts encoding within minutes.

Pros: single-region encode cost (passive region's compute is minimal). Simpler operational model.

Cons: failover is a real event, not a routine. Depending on warm-up time, some workload window is lost. Untested failover paths fail.

Most teams should run active-passive for primary and active-active for hot critical paths (live streams, premium content). Mix and match.

High-level deployment topology — active-active

graph TB
    GLOBAL["Global coordination layer<br/>(regional health · traffic shaping · drain)"]

    subgraph A["Region A (e.g. us-east-1)"]
        APIA["MpegFlow control plane"]
        EA["Encode pool"]
        SA[("Object store")]
        DBA[("Postgres + Redis")]
    end

    subgraph B["Region B (e.g. eu-west-1)"]
        APIB["MpegFlow control plane"]
        EB["Encode pool"]
        SB[("Object store")]
        DBB[("Postgres + Redis")]
    end

    CDN["Multi-CDN routing<br/>(Cloudflare · Fastly · Akamai)"]
    VIEWERS["Viewers"]

    GLOBAL -->|"health probe"| APIA
    GLOBAL -->|"health probe"| APIB
    GLOBAL -->|"weights"| CDN

    APIA --> EA --> SA
    APIA --> DBA
    APIB --> EB --> SB
    APIB --> DBB

    SA <-.async replicate.-> SB
    DBA <-.async replicate.-> DBB

    SA --> CDN
    SB --> CDN
    CDN --> VIEWERS

Component-by-component

Global coordination layer

The piece most teams skip and regret. Even in active-active, you need something that knows the global state — which regions are healthy, what traffic share each is taking, when to drain a region.

What it does:

Polls each regional control plane health endpoint
Aggregates regional saturation metrics (queue depth, encoder pool utilization)
Publishes routing weights to the multi-CDN layer
Triggers drain procedures during planned outages

What it doesn't do:

It doesn't move jobs between regions in real time. That's a different (much harder) problem.
It doesn't replace per-region monitoring. Each region needs full local observability.

Per-region MpegFlow control planes

Job routing — submit-time vs encode-time

When a customer submits a job, where does it go?

Submit-time routing (recommended):

Customer's API call hits a regional endpoint
Routing decision made by the global coordination layer based on weights
Job lives entirely in that region
If that region fails mid-job, the job fails (and gets re-submitted, optionally to a different region)

Encode-time routing (more complex):

All jobs hit a global queue
Workers in any region pull from the queue
Each job tagged with its execution region for audit
Cross-region coordination needed to prevent duplicate execution

Object storage replication

Multi-region storage is non-negotiable for failover. Three patterns, in order of how teams adopt them:

1. Same-region storage with cross-region replication for outputs:

Inputs land in the customer's chosen region
Outputs replicated async to the secondary region
Failover scenario: secondary region serves stale outputs from the last successful replication
RPO (recovery point objective): typically 5–60 seconds
Cost: ~50% more than single-region storage
Most cloud vendors support this natively (S3 Cross-Region Replication, GCS Multi-Regional, etc.)

2. Multi-region storage as primary:

All writes go to a multi-region bucket (S3 Multi-Region Access Points, Cloudflare R2)
More expensive but RPO is essentially zero
Right answer for premium content with strict SLA on availability

3. Per-region object stores with explicit dual-write:

Application writes outputs to both regions on every job completion
Most operationally heavy but gives strongest consistency
Right for regulatory or contractual reasons where you need auditable per-region state

Multi-CDN routing layer

A single CDN's regional outage cascades to your viewers. Multi-CDN provides:

Health-based routing — automatic failover when a CDN's edge in your viewer's region is having issues
Load distribution — split between CDNs based on cost / performance
Geographic targeting — route China traffic via a domestic CDN, Europe via Cloudflare, US via Fastly, etc.

Common stacks:

NS1 / DNSimple smart DNS + multiple CDN origins
Cedexis / Citrix ADC (now NetScaler) intelligent routing
Fastly + Cloudflare active-active with manual failover

The cost of multi-CDN is real (typically ~30% more than single-CDN at similar quality), but for Tier-1 operators the alternative is "single CDN regional outage = your service is down."

Manifest delivery and TTLs

The CDN cache TTL on your HLS / DASH manifests is one of the most-impactful settings during a failover. Three patterns:

TTL setting	Failover behavior	Cost
Long TTL (60+ seconds)	Stale manifests served during failover; viewers see playback errors	Lowest CDN bill
Short TTL (5–10 seconds)	Faster failover propagation; viewers see brief buffering	~2–5x more origin requests
TTL=0 with origin shielding	Near-instant failover; high origin load	Most expensive but most resilient

For premium / live content, short TTL with origin shielding is standard. For VOD where buffering is acceptable, longer TTL is fine.

Failover scenarios — what actually breaks

The four scenarios you'll experience over a multi-year operation, in rough frequency order:

Single-AZ failure within a region

Full regional outage

Rare but devastating. AWS/GCP/Azure regions do go down (full outages roughly once or twice per year per major provider). Your regional MpegFlow stops responding entirely.

Encoder pool starvation in one region

Storage replication lag

The output bucket replication is async; during an outage the secondary region might be 30 seconds behind. Customers reading from the secondary see slightly stale outputs.

Capacity sizing — worked example

For a mid-size global broadcaster running 500K transcoded minutes/month across two regions in active-active:

Component	Per region	Cross-region	Annual cost (rough)
Control plane × 2 + LB	2 instances	2 regions	$10K
Postgres managed × 2	medium tier	2 regions	$10K
Encode pool	35-40 workers	2 regions = 70-80 total	$200K – 280K
Package pool	4-5 workers	2 regions	$15K – 20K
Object storage	per region + cross-region replication	sync overhead	$50K – 80K
Multi-CDN	percentage of egress	n/a	varies wildly by traffic shape
Global coordination layer	small	1 region (third)	$3K
Network egress (cross-region)	replication traffic	n/a	$20K – 50K
Total annual			~$310K – 460K

Compliance — when multi-region is forced by sovereignty

Several regulatory contexts effectively require per-region deployments rather than just enable failover:

GDPR for EU viewers — content / metadata processing must occur in EU regions for EU subjects (Art. 44 onwards on data transfers). Cross-region replication of user metadata is restricted.
China data residency — content for Chinese viewers can't legally process via US-based regions
Russia data localization — limited applicability today but historically a requirement
Sector-specific — broadcast contracts with regional licensing often forbid content from leaving certain geographies

Operational runbook

The five things that hurt during a real failover:

Failover untested in production

DNS propagation slowness

Postgres replication lag during failover

Encoder pool warmup time

Cost surge during a failover

Scope and adjacent concerns

This document covers regional failover within a single cloud provider — the most common and highest-leverage case. Adjacent concerns each have their own answer:

Worldwide cloud-provider degradation → multi-cloud deployment, a separate scaling tier. Available for enterprise customers via custom engagement; not part of the standard reference architecture.
Application-layer bugs → mitigated by canary deploys and per-region progressive rollout. Standard practice on the MpegFlow Operator (rolling updates with maxUnavailable: 1); covered in the Kubernetes deployment architecture.
DDoS / coordinated attacks → CDN-level protection (Cloudflare, Akamai Bot Manager) is the right layer. MpegFlow's origin shielding helps, but the perimeter is your CDN's responsibility.
Live streaming failover → different topology (synchronous, latency-anchored to ingest geography). Companion architecture ships alongside MpegFlow's live primitives in 2026 Q3.
Single-tenant isolation → see strict-broker multi-tenant security for the per-tenant credential model that runs underneath this deployment.

How to evaluate this architecture for your team

If you're considering multi-region for video infrastructure:

Calculate your actual outage cost. Revenue per minute during peak. Contractual SLA penalties. Brand impact weighting. If the math says <$10K/hour, multi-region is overkill.
Map your regulatory constraints. Multi-region for failover and multi-region for sovereignty are different problems. Don't conflate them.
Pick active-active vs active-passive deliberately. Active-active is simpler operationally but doubles cost. Hybrid (active-active for premium, active-passive for VOD library) is what most large operators end up with.
Test your failover. A documented runbook you've never executed is theatrical, not operational. Plan quarterly drills before you need them.
Be honest about what you're not solving. Multi-region is incident insurance, not a guarantee of perfection. Application bugs ship globally; design for that too.

Multi-region failover for video pipelines on MpegFlow

#Use case in scope

#Two patterns: active-active vs active-passive

#Active-active

#Active-passive (warm standby)

#High-level deployment topology — active-active

#Component-by-component

#Global coordination layer

#Per-region MpegFlow control planes

#Job routing — submit-time vs encode-time

#Object storage replication

#Multi-CDN routing layer

#Manifest delivery and TTLs

#Failover scenarios — what actually breaks

#Single-AZ failure within a region

#Full regional outage

#Encoder pool starvation in one region

#Storage replication lag

#Capacity sizing — worked example

#Compliance — when multi-region is forced by sovereignty

#Operational runbook

#Failover untested in production

#DNS propagation slowness

#Postgres replication lag during failover

#Encoder pool warmup time

#Cost surge during a failover

#Scope and adjacent concerns

#How to evaluate this architecture for your team

Related architectures and reading

Multi-region failover for video pipelines on MpegFlow

#Use case in scope

#Two patterns: active-active vs active-passive

#Active-active

#Active-passive (warm standby)

#High-level deployment topology — active-active

#Component-by-component

#Global coordination layer

#Per-region MpegFlow control planes

#Job routing — submit-time vs encode-time

#Object storage replication

#Multi-CDN routing layer

#Manifest delivery and TTLs

#Failover scenarios — what actually breaks

#Single-AZ failure within a region

#Full regional outage

#Encoder pool starvation in one region

#Storage replication lag

#Capacity sizing — worked example

#Compliance — when multi-region is forced by sovereignty

#Operational runbook

#Failover untested in production

#DNS propagation slowness

#Postgres replication lag during failover

#Encoder pool warmup time

#Cost surge during a failover

#Scope and adjacent concerns

#How to evaluate this architecture for your team

Related architectures and reading

Use case in scope

Two patterns: active-active vs active-passive

Active-active

Active-passive (warm standby)

High-level deployment topology — active-active

Component-by-component

Global coordination layer

Per-region MpegFlow control planes

Job routing — submit-time vs encode-time

Object storage replication

Multi-CDN routing layer

Manifest delivery and TTLs

Failover scenarios — what actually breaks

Single-AZ failure within a region

Full regional outage

Encoder pool starvation in one region

Storage replication lag

Capacity sizing — worked example

Compliance — when multi-region is forced by sovereignty

Operational runbook

Failover untested in production

DNS propagation slowness

Postgres replication lag during failover

Encoder pool warmup time

Cost surge during a failover

Scope and adjacent concerns

How to evaluate this architecture for your team

Use case in scope

Two patterns: active-active vs active-passive

Active-active

Active-passive (warm standby)

High-level deployment topology — active-active

Component-by-component

Global coordination layer

Per-region MpegFlow control planes

Job routing — submit-time vs encode-time

Object storage replication

Multi-CDN routing layer

Manifest delivery and TTLs

Failover scenarios — what actually breaks

Single-AZ failure within a region

Full regional outage

Encoder pool starvation in one region

Storage replication lag

Capacity sizing — worked example

Compliance — when multi-region is forced by sovereignty

Operational runbook

Failover untested in production

DNS propagation slowness

Postgres replication lag during failover

Encoder pool warmup time

Cost surge during a failover

Scope and adjacent concerns

How to evaluate this architecture for your team