A 2-petabyte archive of MPEG-2 broadcast masters from the 1990s–2010s, sitting on aging LTO tape and slowly-degrading spinning disk. The mandate from above: "modernize the archive — we want H.264 / HEVC mezzanines, all of it, by Q4 next year." The budget is finite, the timeline is finite, and the original-format readers are themselves becoming obsolete.
This is one of the most common large-scale video infrastructure projects in modern media operations, and it's structurally different from primary VOD encoding. Different throughput pattern, different cost optimization profile, different failure semantics. This document covers how MpegFlow handles it.
Use case in scope
You have:
- A legacy archive of 100K – 10M+ video assets, 100 TB to 10+ PB total
- A modernization mandate: target format(s), retention requirements, success criteria
- A finite budget and a finite timeline (typically 6–24 months)
- Variable input quality and format coverage (broadcast masters, daily news rushes, sports replays, archival film transfers)
You don't have:
- Predictable real-time throughput requirements (this is bulk; latency doesn't matter)
- An expectation that every input will succeed (some tapes are damaged, some files are corrupt)
- Infinite cloud budget (cost-per-asset is a real constraint)
This architecture is wrong for a primary VOD pipeline (see broadcast-grade VOD transcoding for that). It's right for time-bounded, asset-volume-driven, cost-sensitive bulk migrations.
High-level deployment topology
graph TB
subgraph SCHED["Archive enumeration + scheduling (your code)"]
ENUM["Source enumerator<br/>(LTO indexer / NAS walker)"]
DEDUP["Asset deduplicator"]
PRI["Priority scheduler<br/>(P0 urgent / P1 target / P2 deferrable)"]
end
subgraph CP["MpegFlow control plane"]
API["REST + gRPC + WS"]
EVENT["EventBus<br/>(audit + webhooks)"]
DB[("PostgreSQL")]
end
subgraph STAGE["Stage-in pool"]
S1["LTO reader / NAS puller"]
S2["LTO reader / NAS puller"]
end
subgraph ENCODE["Encode pool (spot-heavy)"]
E1["FFmpeg worker (spot)"]
E2["FFmpeg worker (spot)"]
EX["FFmpeg worker (on-demand, fallback)"]
end
subgraph TIERS["Object storage — tiered"]
HOT["Hot: in-flight (S3 Standard)"]
WARM["Warm: completed 0–6mo (S3 IA)"]
COLD["Cold: completed 6mo+ (Glacier)"]
end
ENUM --> DEDUP --> PRI --> API
API --> EVENT --> DB
S1 --> HOT
HOT --> E1
E1 --> WARM
WARM -.lifecycle.-> COLD
The structural differences from VOD transcoding:
- An enumeration / scheduling layer in front of MpegFlow handles the "which 10 million assets do we transcode in what order" problem — this is its own engineering task, not something the encoder layer solves
- Encode pool is spot-heavy — bulk archive workloads are highly preemption-tolerant
- Stage-in is its own bottleneck — pulling from LTO is slow; the architecture has to overlap stage-in with encode
- Output storage tiers — completed mezzanines move to cold storage on a schedule
Component-by-component
Archive enumeration layer
This is your code, not MpegFlow's. A scanner that:
- Walks the source archive (LTO library API, NAS filesystem, legacy DAM database)
- Produces a normalized inventory: asset_id, location, original_format, estimated_duration, priority
- Deduplicates (large archives often have 5–15% duplicates from re-ingests over the years)
- Hands the inventory to the scheduler
For LTO, this can take weeks just to complete. The good news: it's parallelizable across drives.
Priority scheduler
Why a separate layer: not every asset has the same urgency. Common priority bands:
- P0 (urgent): active-rights content, current-season programs, contractually-deadline-bound assets — these must complete first
- P1 (target): the bulk of the archive
- P2 (deferrable): rarely-accessed long-tail content; transcode if budget/time allows, drop otherwise
The scheduler feeds MpegFlow at a controlled rate, so encode pool doesn't bottle-up on a flood of P2 work while P0 sits in the queue.
Stage-in pool
The bottleneck most teams underestimate.
Reading from LTO is sequential and slow — typical sustained read is 200–400 MB/s per drive, with seek penalties when jumping between assets on the same tape. A 200 GB master can take 8–15 minutes just to land on local NVMe.
Architecture pattern that works:
- Stage-in workers read several assets in parallel, batched by tape (avoid seek thrash)
- Asset lands on local NVMe of a stage-in worker
- Pushed to S3 Standard ("hot" tier) for handoff to encode pool
- MpegFlow job kicked off referencing the staged S3 object
- After encode completes, S3 hot copy is deleted (intermediate state, not durable)
For pure-disk archives (NAS, on-prem ZFS pool), this layer is simpler and faster — but the same architectural shape applies.
Sizing: stage-in throughput, not encode capacity, often determines the migration timeline. For a 2 PB archive over 12 months, you need ~5 GB/s sustained stage-in. That's 12–25 LTO drives running in parallel, or a 50 Gbps NAS link well-tuned.
Encode pool — spot-first
Unlike primary VOD, archive encode is highly preemption-tolerant. A spot/preemptible instance loses its job? Re-queue. Cost over time is dominated by hardware-hours; the workload itself is non-urgent.
Heuristic:
- 80–90% of capacity on spot/preemptible
- 10–20% on-demand for "long tail" jobs that have been preempted multiple times (pin them on stable instances after N preemptions)
- Mix CPU and GPU workers — old broadcast formats are often easier to GPU-transcode than premium-VOD outputs (lower quality bar)
Sizing for 2 PB / 12-month migration:
- Roughly 10M minutes of input (depends heavily on average duration)
- Single-rendition mezzanine output (no ABR ladder for archive — that's a downstream concern)
- Per-worker (16-core CPU): ~50K minutes/month if running near-constantly
- Required pool size: ~200 average concurrent workers, with 2x burst capacity for catch-up windows
- On spot at 60–80% discount: cost roughly $5K–$10K/month for the encode pool
Output storage tiering
Mezzanines that just finished encoding are useful for QC and immediate access (1–4 weeks). After that, they're long-tail.
Tiering policy:
| Age | Tier | Cost / GB / month | Access SLA |
|---|---|---|---|
| 0–30 days | S3 Standard / GCS Standard | ~$0.023 | Immediate |
| 31–180 days | S3 IA / GCS Nearline | ~$0.013 | Immediate (with retrieval fee) |
| 180+ days | S3 Glacier Instant / GCS Coldline | ~$0.004 | Immediate (with retrieval fee) |
| Optional: very long tail | Glacier Deep Archive / GCS Archive | ~$0.001 | 12-hour retrieval |
For 2 PB of finished mezzanines, the cost difference between "keep all on S3 Standard" and "tier appropriately" is roughly $45K/month vs $7K/month. Tiering matters.
Capacity sizing — worked example
For a 2 PB archive migrated over 12 months:
| Component | Sizing | Rough monthly cost (cloud) |
|---|---|---|
| Enumeration / scheduling (your code) | 1–2 small services + a Postgres | $300 |
| MpegFlow control plane | 2 instances + LB | $400 |
| Postgres (managed) | medium tier | $400 |
| Stage-in pool (LTO + NAS read) | 25 workers + drive licenses | varies wildly by source |
| Encode pool (spot-heavy) | ~200 avg / 400 peak workers | $5,000 – 10,000 |
| Object storage hot tier | ~200 TB working set | $4,500 |
| Object storage warm tier | ~600 TB completed last 6mo | $7,800 |
| Object storage cold tier | growing toward 2 PB completed | $4,000 (start) → $8,000 (end of project) |
| Egress | minimal — outputs stay in cloud archive | $200 – 500 |
| Total cloud cost (rough) | $22,000 – 32,000 / month |
Over 12 months: roughly $300K – $400K in cloud costs for a full 2 PB migration. The same workload on per-minute-priced managed transcoding is in the $1.5M–$3M band, depending on tier and rendition count.
What breaks at scale
The failure modes specific to multi-petabyte migrations:
Damaged tapes / corrupt inputs
Roughly 0.5–3% of legacy archive content has some corruption. Stage-in workers must classify by failure mode:
- Read error on tape: retry with different drive, escalate to operator if persistent
- Partial file read: keep what was retrieved, mark
partial_recoveryin audit, hand to operator for review - File reads but transcode fails: classify the FFmpeg failure (same taxonomy as our scale post), don't burn encode hours retrying genuinely broken files
You will never get to 100% migration. Plan the scheduler around it: 97% completion is success.
LTO drive availability
Drives are physical, finite, and sometimes break during long projects. The stage-in scheduler needs to model drive availability and rebalance assets across remaining drives if one dies. For the 12-drive minimum we sized above, allocate budget for at least 2 drive replacements during the project.
Catalog inconsistency
The archive's metadata catalog is often wrong. Asset 123 says it's a 10-minute clip; the file is 90 minutes. The duration field in your DAM disagrees with the actual file by 30%. This is normal and almost universal. The probe stage of the MpegFlow pipeline catches it and re-records the truth — but downstream systems that trust the original metadata will be wrong about half your assets.
The fix: emit a "metadata-correction" event for every asset where probe disagrees with catalog, and feed it back into the source DAM.
Spot preemption death spirals
Long jobs (>4 hours) on spot instances that get preempted twice in a row will eat their entire saved cost in retry overhead. The scheduler should track per-job preemption count and pin re-tries to on-demand pools after 2 preemptions.
Cost runaway via stage-in egress
A subtle one: if stage-in reads from on-prem NAS and your encoders are in cloud, every byte crosses the cloud egress meter on the way in. For a 2 PB archive, that's 2 PB × $0.09/GB = $180,000 in network egress alone if you're reading from an on-prem NAS to AWS. This is often the single largest line item.
Mitigation: do the migration to a colocated cloud region (Direct Connect, ExpressRoute, dedicated interconnect) or run encode on-prem. The numbers force one or the other for archives at this scale.
Operational rhythm for a 12-month migration
Real cadence we've seen work for projects this size:
Month 1 — enumeration
- Stand up enumeration workers, scan the entire archive, build the asset inventory
- Probe a 1–2% sample to characterize the input distribution
- Refine the priority bands based on what you actually find
Month 2 — encoder calibration
- Tune presets against the sample for quality vs. throughput trade-offs
- Pilot with 1,000 assets end-to-end. Verify outputs pass downstream QC
- Lock encoder version and presets
Months 3–11 — bulk migration
- Constant operation. Daily standups while the project is running.
- Weekly review of: completion rate, cost trend, error rate, deferred-bucket size
- Monthly review of catalog corrections (emit corrections back to source DAM)
Month 12 — long tail + cleanup
- The last 5% always takes longer than the first 95% — damaged inputs, weird formats, contractual edge cases
- Final audit reconciliation: every asset accounted for, even if as "skipped: corrupted_input" or "skipped: out_of_scope"
- Decommission stage-in workers, reduce encode pool to maintenance levels
Compliance considerations
Most archive content has stricter compliance than primary VOD because it includes:
- Pre-release footage that was never aired
- Outtakes, interviews, b-roll with NDA implications
- Unreleased / shelved productions
The relevant patterns:
- Encryption at rest end-to-end (CMK / customer-managed keys)
- Air-gapped variants for sensitive subsets (run a separate self-hosted MpegFlow cluster for these)
- Detailed audit trail — every asset's transcoding event, including who initiated, which encoder version, what output hash. Some contracts require 7+ years retention of these records.
- Watermarking — for content that requires perceptual or forensic watermarking. Pluggable in MpegFlow's pipeline; pair with a partner like NexGuard or Verimatrix.
For pre-release content the architecture above runs in a dedicated, network-isolated VPC. For the broader bulk migration of already-aired content, the standard architecture is fine.
How this architecture differs from "just FFmpeg with a queue"
You can absolutely run a 2 PB migration with a Python script wrapping FFmpeg and SQS. Some teams have. The places it hurts at scale:
- No audit trail per asset — when ops asks "why is asset 4,532,012 still failing?" you have to grep stderr files. With MpegFlow, the audit table answers in one query.
- No encoder version pinning — you upgrade FFmpeg mid-project, half your archive is encoded with v6.0, half with v6.1. QC might catch differences. They might not.
- Single-stage thinking — each asset goes through probe → encode → emit, but with a bare script the stages aren't independently retryable. Failure in emit means re-encoding from scratch.
- No DAG, no parallelism guarantee — script-based bulk processing tends to serialize where it shouldn't.
These aren't dealbreakers; they're tax. Over a 12-month project, the tax compounds.
How to evaluate this architecture for your team
If you're planning an archive migration in the petabyte band:
- Estimate your input distribution (formats, durations, sizes)
- Estimate your stage-in throughput ceiling (LTO drive count, NAS bandwidth, network)
- Pick your target output format(s) and run the per-asset cost math
- Pick your storage tiering policy
- Decide your acceptance threshold for non-recoverable assets (97%, 99%, 99.5% — be honest, 100% isn't a thing for legacy archives)
If the architecture maps to where you'd want to be, apply to the design partner program — archive migration is one of the highest-value workloads we'd onboard ahead of GA.