Petabyte-scale archive migration on MpegFlow

MpegFlow

How to migrate a multi-petabyte legacy media archive to modern formats — throughput patterns, scheduling, cost optimization, deferral strategies, and what breaks at scale.

A 2-petabyte archive of MPEG-2 broadcast masters from the 1990s–2010s, sitting on aging LTO tape and slowly-degrading spinning disk. The mandate from above: "modernize the archive — we want H.264 / HEVC mezzanines, all of it, by Q4 next year." The budget is finite, the timeline is finite, and the original-format readers are themselves becoming obsolete.

This is one of the most common large-scale video infrastructure projects in modern media operations, and it's structurally different from primary VOD encoding. Different throughput pattern, different cost optimization profile, different failure semantics. This document covers how MpegFlow handles it.

Use case in scope

You have:

A legacy archive of 100K – 10M+ video assets, 100 TB to 10+ PB total
A modernization mandate: target format(s), retention requirements, success criteria
A finite budget and a finite timeline (typically 6–24 months)
Variable input quality and format coverage (broadcast masters, daily news rushes, sports replays, archival film transfers)

You don't have:

Predictable real-time throughput requirements (this is bulk; latency doesn't matter)
An expectation that every input will succeed (some tapes are damaged, some files are corrupt)
Infinite cloud budget (cost-per-asset is a real constraint)

This architecture is wrong for a primary VOD pipeline (see broadcast-grade VOD transcoding for that). It's right for time-bounded, asset-volume-driven, cost-sensitive bulk migrations.

High-level deployment topology

graph TB
    subgraph SCHED["Archive enumeration + scheduling (your code)"]
        ENUM["Source enumerator<br/>(LTO indexer / NAS walker)"]
        DEDUP["Asset deduplicator"]
        PRI["Priority scheduler<br/>(P0 urgent / P1 target / P2 deferrable)"]
    end

    subgraph CP["MpegFlow control plane"]
        API["REST + gRPC + WS"]
        EVENT["EventBus<br/>(audit + webhooks)"]
        DB[("PostgreSQL")]
    end

    subgraph STAGE["Stage-in pool"]
        S1["LTO reader / NAS puller"]
        S2["LTO reader / NAS puller"]
    end

    subgraph ENCODE["Encode pool (spot-heavy)"]
        E1["FFmpeg worker (spot)"]
        E2["FFmpeg worker (spot)"]
        EX["FFmpeg worker (on-demand, fallback)"]
    end

    subgraph TIERS["Object storage — tiered"]
        HOT["Hot: in-flight (S3 Standard)"]
        WARM["Warm: completed 0–6mo (S3 IA)"]
        COLD["Cold: completed 6mo+ (Glacier)"]
    end

    ENUM --> DEDUP --> PRI --> API
    API --> EVENT --> DB
    S1 --> HOT
    HOT --> E1
    E1 --> WARM
    WARM -.lifecycle.-> COLD

The structural differences from VOD transcoding:

An enumeration / scheduling layer in front of MpegFlow handles the "which 10 million assets do we transcode in what order" problem — this is its own engineering task, not something the encoder layer solves
Encode pool is spot-heavy — bulk archive workloads are highly preemption-tolerant
Stage-in is its own bottleneck — pulling from LTO is slow; the architecture has to overlap stage-in with encode
Output storage tiers — completed mezzanines move to cold storage on a schedule

Component-by-component

Archive enumeration layer

This is your code, not MpegFlow's. A scanner that:

Walks the source archive (LTO library API, NAS filesystem, legacy DAM database)
Produces a normalized inventory: asset_id, location, original_format, estimated_duration, priority
Deduplicates (large archives often have 5–15% duplicates from re-ingests over the years)
Hands the inventory to the scheduler

For LTO, this can take weeks just to complete. The good news: it's parallelizable across drives.

Priority scheduler

Why a separate layer: not every asset has the same urgency. Common priority bands:

P0 (urgent): active-rights content, current-season programs, contractually-deadline-bound assets — these must complete first
P1 (target): the bulk of the archive
P2 (deferrable): rarely-accessed long-tail content; transcode if budget/time allows, drop otherwise

The scheduler feeds MpegFlow at a controlled rate, so encode pool doesn't bottle-up on a flood of P2 work while P0 sits in the queue.

Stage-in pool

The bottleneck most teams underestimate.

Reading from LTO is sequential and slow — typical sustained read is 200–400 MB/s per drive, with seek penalties when jumping between assets on the same tape. A 200 GB master can take 8–15 minutes just to land on local NVMe.

Architecture pattern that works:

Stage-in workers read several assets in parallel, batched by tape (avoid seek thrash)
Asset lands on local NVMe of a stage-in worker
Pushed to S3 Standard ("hot" tier) for handoff to encode pool
MpegFlow job kicked off referencing the staged S3 object
After encode completes, S3 hot copy is deleted (intermediate state, not durable)

For pure-disk archives (NAS, on-prem ZFS pool), this layer is simpler and faster — but the same architectural shape applies.

Sizing: stage-in throughput, not encode capacity, often determines the migration timeline. For a 2 PB archive over 12 months, you need ~5 GB/s sustained stage-in. That's 12–25 LTO drives running in parallel, or a 50 Gbps NAS link well-tuned.

Encode pool — spot-first

Unlike primary VOD, archive encode is highly preemption-tolerant. A spot/preemptible instance loses its job? Re-queue. Cost over time is dominated by hardware-hours; the workload itself is non-urgent.

Heuristic:

80–90% of capacity on spot/preemptible
10–20% on-demand for "long tail" jobs that have been preempted multiple times (pin them on stable instances after N preemptions)
Mix CPU and GPU workers — old broadcast formats are often easier to GPU-transcode than premium-VOD outputs (lower quality bar)

Sizing for 2 PB / 12-month migration:

Roughly 10M minutes of input (depends heavily on average duration)
Single-rendition mezzanine output (no ABR ladder for archive — that's a downstream concern)
Per-worker (16-core CPU): ~50K minutes/month if running near-constantly
Required pool size: ~200 average concurrent workers, with 2x burst capacity for catch-up windows
On spot at 60–80% discount: cost roughly $5K–$10K/month for the encode pool

Output storage tiering

Mezzanines that just finished encoding are useful for QC and immediate access (1–4 weeks). After that, they're long-tail.

Tiering policy:

Age	Tier	Cost / GB / month	Access SLA
0–30 days	S3 Standard / GCS Standard	~$0.023	Immediate
31–180 days	S3 IA / GCS Nearline	~$0.013	Immediate (with retrieval fee)
180+ days	S3 Glacier Instant / GCS Coldline	~$0.004	Immediate (with retrieval fee)
Optional: very long tail	Glacier Deep Archive / GCS Archive	~$0.001	12-hour retrieval

For 2 PB of finished mezzanines, the cost difference between "keep all on S3 Standard" and "tier appropriately" is roughly $45K/month vs $7K/month. Tiering matters.

Capacity sizing — worked example

For a 2 PB archive migrated over 12 months:

Component	Sizing	Rough monthly cost (cloud)
Enumeration / scheduling (your code)	1–2 small services + a Postgres	$300
MpegFlow control plane	2 instances + LB	$400
Postgres (managed)	medium tier	$400
Stage-in pool (LTO + NAS read)	25 workers + drive licenses	varies wildly by source
Encode pool (spot-heavy)	~200 avg / 400 peak workers	$5,000 – 10,000
Object storage hot tier	~200 TB working set	$4,500
Object storage warm tier	~600 TB completed last 6mo	$7,800
Object storage cold tier	growing toward 2 PB completed	$4,000 (start) → $8,000 (end of project)
Egress	minimal — outputs stay in cloud archive	$200 – 500
Total cloud cost (rough)		$22,000 – 32,000 / month

Over 12 months: roughly $300K – $400K in cloud costs for a full 2 PB migration. The same workload on per-minute-priced managed transcoding is in the $1.5M–$3M band, depending on tier and rendition count.

What breaks at scale

The failure modes specific to multi-petabyte migrations:

Damaged tapes / corrupt inputs

Roughly 0.5–3% of legacy archive content has some corruption. Stage-in workers must classify by failure mode:

Read error on tape: retry with different drive, escalate to operator if persistent
Partial file read: keep what was retrieved, mark partial_recovery in audit, hand to operator for review
File reads but transcode fails: classify the FFmpeg failure (same taxonomy as our scale post), don't burn encode hours retrying genuinely broken files

You will never get to 100% migration. Plan the scheduler around it: 97% completion is success.

LTO drive availability

Drives are physical, finite, and sometimes break during long projects. The stage-in scheduler needs to model drive availability and rebalance assets across remaining drives if one dies. For the 12-drive minimum we sized above, allocate budget for at least 2 drive replacements during the project.

Catalog inconsistency

The archive's metadata catalog is often wrong. Asset 123 says it's a 10-minute clip; the file is 90 minutes. The duration field in your DAM disagrees with the actual file by 30%. This is normal and almost universal. The probe stage of the MpegFlow pipeline catches it and re-records the truth — but downstream systems that trust the original metadata will be wrong about half your assets.

The fix: emit a "metadata-correction" event for every asset where probe disagrees with catalog, and feed it back into the source DAM.

Spot preemption death spirals

Long jobs (>4 hours) on spot instances that get preempted twice in a row will eat their entire saved cost in retry overhead. The scheduler should track per-job preemption count and pin re-tries to on-demand pools after 2 preemptions.

Cost runaway via stage-in egress

A subtle one: if stage-in reads from on-prem NAS and your encoders are in cloud, every byte crosses the cloud egress meter on the way in. For a 2 PB archive, that's 2 PB × $0.09/GB = $180,000 in network egress alone if you're reading from an on-prem NAS to AWS. This is often the single largest line item.

Mitigation: do the migration to a colocated cloud region (Direct Connect, ExpressRoute, dedicated interconnect) or run encode on-prem. The numbers force one or the other for archives at this scale.

Operational rhythm for a 12-month migration

Real cadence we've seen work for projects this size:

Month 1 — enumeration

Stand up enumeration workers, scan the entire archive, build the asset inventory
Probe a 1–2% sample to characterize the input distribution
Refine the priority bands based on what you actually find

Month 2 — encoder calibration

Tune presets against the sample for quality vs. throughput trade-offs
Pilot with 1,000 assets end-to-end. Verify outputs pass downstream QC
Lock encoder version and presets

Months 3–11 — bulk migration

Constant operation. Daily standups while the project is running.
Weekly review of: completion rate, cost trend, error rate, deferred-bucket size
Monthly review of catalog corrections (emit corrections back to source DAM)

Month 12 — long tail + cleanup

The last 5% always takes longer than the first 95% — damaged inputs, weird formats, contractual edge cases
Final audit reconciliation: every asset accounted for, even if as "skipped: corrupted_input" or "skipped: out_of_scope"
Decommission stage-in workers, reduce encode pool to maintenance levels

Compliance considerations

Most archive content has stricter compliance than primary VOD because it includes:

Pre-release footage that was never aired
Outtakes, interviews, b-roll with NDA implications
Unreleased / shelved productions

The relevant patterns:

Encryption at rest end-to-end (CMK / customer-managed keys)
Air-gapped variants for sensitive subsets (run a separate self-hosted MpegFlow cluster for these)
Detailed audit trail — every asset's transcoding event, including who initiated, which encoder version, what output hash. Some contracts require 7+ years retention of these records.
Watermarking — for content that requires perceptual or forensic watermarking. Pluggable in MpegFlow's pipeline; pair with a partner like NexGuard or Verimatrix.

For pre-release content the architecture above runs in a dedicated, network-isolated VPC. For the broader bulk migration of already-aired content, the standard architecture is fine.

How this architecture differs from "just FFmpeg with a queue"

You can absolutely run a 2 PB migration with a Python script wrapping FFmpeg and SQS. Some teams have. The places it hurts at scale:

No audit trail per asset — when ops asks "why is asset 4,532,012 still failing?" you have to grep stderr files. With MpegFlow, the audit table answers in one query.
No encoder version pinning — you upgrade FFmpeg mid-project, half your archive is encoded with v6.0, half with v6.1. QC might catch differences. They might not.
Single-stage thinking — each asset goes through probe → encode → emit, but with a bare script the stages aren't independently retryable. Failure in emit means re-encoding from scratch.
No DAG, no parallelism guarantee — script-based bulk processing tends to serialize where it shouldn't.

These aren't dealbreakers; they're tax. Over a 12-month project, the tax compounds.

How to evaluate this architecture for your team

If you're planning an archive migration in the petabyte band:

Estimate your input distribution (formats, durations, sizes)
Estimate your stage-in throughput ceiling (LTO drive count, NAS bandwidth, network)
Pick your target output format(s) and run the per-asset cost math
Pick your storage tiering policy
Decide your acceptance threshold for non-recoverable assets (97%, 99%, 99.5% — be honest, 100% isn't a thing for legacy archives)

If the architecture maps to where you'd want to be, apply to the design partner program — archive migration is one of the highest-value workloads we'd onboard ahead of GA.

Use case in scope

You have:

A legacy archive of 100K – 10M+ video assets, 100 TB to 10+ PB total
A modernization mandate: target format(s), retention requirements, success criteria
A finite budget and a finite timeline (typically 6–24 months)
Variable input quality and format coverage (broadcast masters, daily news rushes, sports replays, archival film transfers)

You don't have:

Predictable real-time throughput requirements (this is bulk; latency doesn't matter)
An expectation that every input will succeed (some tapes are damaged, some files are corrupt)
Infinite cloud budget (cost-per-asset is a real constraint)

This architecture is wrong for a primary VOD pipeline (see broadcast-grade VOD transcoding for that). It's right for time-bounded, asset-volume-driven, cost-sensitive bulk migrations.

High-level deployment topology

graph TB
    subgraph SCHED["Archive enumeration + scheduling (your code)"]
        ENUM["Source enumerator<br/>(LTO indexer / NAS walker)"]
        DEDUP["Asset deduplicator"]
        PRI["Priority scheduler<br/>(P0 urgent / P1 target / P2 deferrable)"]
    end

    subgraph CP["MpegFlow control plane"]
        API["REST + gRPC + WS"]
        EVENT["EventBus<br/>(audit + webhooks)"]
        DB[("PostgreSQL")]
    end

    subgraph STAGE["Stage-in pool"]
        S1["LTO reader / NAS puller"]
        S2["LTO reader / NAS puller"]
    end

    subgraph ENCODE["Encode pool (spot-heavy)"]
        E1["FFmpeg worker (spot)"]
        E2["FFmpeg worker (spot)"]
        EX["FFmpeg worker (on-demand, fallback)"]
    end

    subgraph TIERS["Object storage — tiered"]
        HOT["Hot: in-flight (S3 Standard)"]
        WARM["Warm: completed 0–6mo (S3 IA)"]
        COLD["Cold: completed 6mo+ (Glacier)"]
    end

    ENUM --> DEDUP --> PRI --> API
    API --> EVENT --> DB
    S1 --> HOT
    HOT --> E1
    E1 --> WARM
    WARM -.lifecycle.-> COLD

The structural differences from VOD transcoding:

An enumeration / scheduling layer in front of MpegFlow handles the "which 10 million assets do we transcode in what order" problem — this is its own engineering task, not something the encoder layer solves
Encode pool is spot-heavy — bulk archive workloads are highly preemption-tolerant
Stage-in is its own bottleneck — pulling from LTO is slow; the architecture has to overlap stage-in with encode
Output storage tiers — completed mezzanines move to cold storage on a schedule

Component-by-component

Archive enumeration layer

This is your code, not MpegFlow's. A scanner that:

Walks the source archive (LTO library API, NAS filesystem, legacy DAM database)
Produces a normalized inventory: asset_id, location, original_format, estimated_duration, priority
Deduplicates (large archives often have 5–15% duplicates from re-ingests over the years)
Hands the inventory to the scheduler

For LTO, this can take weeks just to complete. The good news: it's parallelizable across drives.

Priority scheduler

Why a separate layer: not every asset has the same urgency. Common priority bands:

P0 (urgent): active-rights content, current-season programs, contractually-deadline-bound assets — these must complete first
P1 (target): the bulk of the archive
P2 (deferrable): rarely-accessed long-tail content; transcode if budget/time allows, drop otherwise

The scheduler feeds MpegFlow at a controlled rate, so encode pool doesn't bottle-up on a flood of P2 work while P0 sits in the queue.

Stage-in pool

The bottleneck most teams underestimate.

Architecture pattern that works:

Stage-in workers read several assets in parallel, batched by tape (avoid seek thrash)
Asset lands on local NVMe of a stage-in worker
Pushed to S3 Standard ("hot" tier) for handoff to encode pool
MpegFlow job kicked off referencing the staged S3 object
After encode completes, S3 hot copy is deleted (intermediate state, not durable)

For pure-disk archives (NAS, on-prem ZFS pool), this layer is simpler and faster — but the same architectural shape applies.

Encode pool — spot-first

Heuristic:

80–90% of capacity on spot/preemptible
10–20% on-demand for "long tail" jobs that have been preempted multiple times (pin them on stable instances after N preemptions)
Mix CPU and GPU workers — old broadcast formats are often easier to GPU-transcode than premium-VOD outputs (lower quality bar)

Sizing for 2 PB / 12-month migration:

Roughly 10M minutes of input (depends heavily on average duration)
Single-rendition mezzanine output (no ABR ladder for archive — that's a downstream concern)
Per-worker (16-core CPU): ~50K minutes/month if running near-constantly
Required pool size: ~200 average concurrent workers, with 2x burst capacity for catch-up windows
On spot at 60–80% discount: cost roughly $5K–$10K/month for the encode pool

Output storage tiering

Mezzanines that just finished encoding are useful for QC and immediate access (1–4 weeks). After that, they're long-tail.

Tiering policy:

Age	Tier	Cost / GB / month	Access SLA
0–30 days	S3 Standard / GCS Standard	~$0.023	Immediate
31–180 days	S3 IA / GCS Nearline	~$0.013	Immediate (with retrieval fee)
180+ days	S3 Glacier Instant / GCS Coldline	~$0.004	Immediate (with retrieval fee)
Optional: very long tail	Glacier Deep Archive / GCS Archive	~$0.001	12-hour retrieval

For 2 PB of finished mezzanines, the cost difference between "keep all on S3 Standard" and "tier appropriately" is roughly $45K/month vs $7K/month. Tiering matters.

Capacity sizing — worked example

For a 2 PB archive migrated over 12 months:

Component	Sizing	Rough monthly cost (cloud)
Enumeration / scheduling (your code)	1–2 small services + a Postgres	$300
MpegFlow control plane	2 instances + LB	$400
Postgres (managed)	medium tier	$400
Stage-in pool (LTO + NAS read)	25 workers + drive licenses	varies wildly by source
Encode pool (spot-heavy)	~200 avg / 400 peak workers	$5,000 – 10,000
Object storage hot tier	~200 TB working set	$4,500
Object storage warm tier	~600 TB completed last 6mo	$7,800
Object storage cold tier	growing toward 2 PB completed	$4,000 (start) → $8,000 (end of project)
Egress	minimal — outputs stay in cloud archive	$200 – 500
Total cloud cost (rough)		$22,000 – 32,000 / month

What breaks at scale

The failure modes specific to multi-petabyte migrations:

Damaged tapes / corrupt inputs

Roughly 0.5–3% of legacy archive content has some corruption. Stage-in workers must classify by failure mode:

Read error on tape: retry with different drive, escalate to operator if persistent
Partial file read: keep what was retrieved, mark partial_recovery in audit, hand to operator for review
File reads but transcode fails: classify the FFmpeg failure (same taxonomy as our scale post), don't burn encode hours retrying genuinely broken files

You will never get to 100% migration. Plan the scheduler around it: 97% completion is success.

LTO drive availability

Catalog inconsistency

The fix: emit a "metadata-correction" event for every asset where probe disagrees with catalog, and feed it back into the source DAM.

Spot preemption death spirals

Cost runaway via stage-in egress

Mitigation: do the migration to a colocated cloud region (Direct Connect, ExpressRoute, dedicated interconnect) or run encode on-prem. The numbers force one or the other for archives at this scale.

Operational rhythm for a 12-month migration

Real cadence we've seen work for projects this size:

Month 1 — enumeration

Stand up enumeration workers, scan the entire archive, build the asset inventory
Probe a 1–2% sample to characterize the input distribution
Refine the priority bands based on what you actually find

Month 2 — encoder calibration

Tune presets against the sample for quality vs. throughput trade-offs
Pilot with 1,000 assets end-to-end. Verify outputs pass downstream QC
Lock encoder version and presets

Months 3–11 — bulk migration

Constant operation. Daily standups while the project is running.
Weekly review of: completion rate, cost trend, error rate, deferred-bucket size
Monthly review of catalog corrections (emit corrections back to source DAM)

Month 12 — long tail + cleanup

The last 5% always takes longer than the first 95% — damaged inputs, weird formats, contractual edge cases
Final audit reconciliation: every asset accounted for, even if as "skipped: corrupted_input" or "skipped: out_of_scope"
Decommission stage-in workers, reduce encode pool to maintenance levels

Compliance considerations

Most archive content has stricter compliance than primary VOD because it includes:

Pre-release footage that was never aired
Outtakes, interviews, b-roll with NDA implications
Unreleased / shelved productions

The relevant patterns:

Encryption at rest end-to-end (CMK / customer-managed keys)
Air-gapped variants for sensitive subsets (run a separate self-hosted MpegFlow cluster for these)
Detailed audit trail — every asset's transcoding event, including who initiated, which encoder version, what output hash. Some contracts require 7+ years retention of these records.
Watermarking — for content that requires perceptual or forensic watermarking. Pluggable in MpegFlow's pipeline; pair with a partner like NexGuard or Verimatrix.

For pre-release content the architecture above runs in a dedicated, network-isolated VPC. For the broader bulk migration of already-aired content, the standard architecture is fine.

How this architecture differs from "just FFmpeg with a queue"

You can absolutely run a 2 PB migration with a Python script wrapping FFmpeg and SQS. Some teams have. The places it hurts at scale:

No audit trail per asset — when ops asks "why is asset 4,532,012 still failing?" you have to grep stderr files. With MpegFlow, the audit table answers in one query.
No encoder version pinning — you upgrade FFmpeg mid-project, half your archive is encoded with v6.0, half with v6.1. QC might catch differences. They might not.
Single-stage thinking — each asset goes through probe → encode → emit, but with a bare script the stages aren't independently retryable. Failure in emit means re-encoding from scratch.
No DAG, no parallelism guarantee — script-based bulk processing tends to serialize where it shouldn't.

These aren't dealbreakers; they're tax. Over a 12-month project, the tax compounds.

How to evaluate this architecture for your team

If you're planning an archive migration in the petabyte band:

Estimate your input distribution (formats, durations, sizes)
Estimate your stage-in throughput ceiling (LTO drive count, NAS bandwidth, network)
Pick your target output format(s) and run the per-asset cost math
Pick your storage tiering policy
Decide your acceptance threshold for non-recoverable assets (97%, 99%, 99.5% — be honest, 100% isn't a thing for legacy archives)

If the architecture maps to where you'd want to be, apply to the design partner program — archive migration is one of the highest-value workloads we'd onboard ahead of GA.

Petabyte-scale archive migration on MpegFlow

#Use case in scope

#High-level deployment topology

#Component-by-component

#Archive enumeration layer

#Priority scheduler

#Stage-in pool

#Encode pool — spot-first

#Output storage tiering

#Capacity sizing — worked example

#What breaks at scale

#Damaged tapes / corrupt inputs

#LTO drive availability

#Catalog inconsistency

#Spot preemption death spirals

#Cost runaway via stage-in egress

#Operational rhythm for a 12-month migration

#Compliance considerations

#How this architecture differs from "just FFmpeg with a queue"

#How to evaluate this architecture for your team

Related architectures and reading

Petabyte-scale archive migration on MpegFlow

#Use case in scope

#High-level deployment topology

#Component-by-component

#Archive enumeration layer

#Priority scheduler

#Stage-in pool

#Encode pool — spot-first

#Output storage tiering

#Capacity sizing — worked example

#What breaks at scale

#Damaged tapes / corrupt inputs

#LTO drive availability

#Catalog inconsistency

#Spot preemption death spirals

#Cost runaway via stage-in egress

#Operational rhythm for a 12-month migration

#Compliance considerations

#How this architecture differs from "just FFmpeg with a queue"

#How to evaluate this architecture for your team

Related architectures and reading

Use case in scope

High-level deployment topology

Component-by-component

Archive enumeration layer

Priority scheduler

Stage-in pool

Encode pool — spot-first

Output storage tiering

Capacity sizing — worked example

What breaks at scale

Damaged tapes / corrupt inputs

LTO drive availability

Catalog inconsistency

Spot preemption death spirals

Cost runaway via stage-in egress

Operational rhythm for a 12-month migration

Compliance considerations

How this architecture differs from "just FFmpeg with a queue"

How to evaluate this architecture for your team

Use case in scope

High-level deployment topology

Component-by-component

Archive enumeration layer

Priority scheduler

Stage-in pool

Encode pool — spot-first

Output storage tiering

Capacity sizing — worked example

What breaks at scale

Damaged tapes / corrupt inputs

LTO drive availability

Catalog inconsistency

Spot preemption death spirals

Cost runaway via stage-in egress

Operational rhythm for a 12-month migration

Compliance considerations

How this architecture differs from "just FFmpeg with a queue"

How to evaluate this architecture for your team