Choosing a video orchestration platform: 7 questions to ask

MpegFlow

Choosing a video orchestration platform: 7 questions to ask

A buyer's checklist for evaluating any video orchestration platform — the seven questions that surface what's actually going to bite you in production. Vendor-neutral, decision-tree style, written from the perspective of teams running real broadcast and OTT workloads.

ByMpegFlow Engineering Team

·May 9, 2026·8 min read·1,658 words

You're evaluating video orchestration platforms — the layer between your application and your encoders. Maybe it's MpegFlow. Maybe it's a hand-rolled queue + workers your team is considering replacing. Maybe it's a vendor that promises "AI-powered video workflows" and a glossy demo.

The vendor pitches will cover features. The vendor pitches will not cover the things that decide whether the platform survives your production traffic. This post is the seven questions that surface those things — vendor-neutral, written so you can ask them in any sales call and any RFP, and recognize the warning signs in the answers.

If you're the buyer, print this and bring it to the next vendor demo. If you're the vendor, these questions are coming.

Question 1: Where does my data live?

The seemingly simple question that exposes the most architectural difference.

The right answer: "Your mezzanine assets and outputs live in your storage. Our workers receive presigned URLs to read and write. We never relay your bytes through our infrastructure. Our control plane stores job metadata, audit logs, and pipeline definitions only." That's the strict-broker pattern — workers carry zero credentials, and a worker compromise can read only the in-flight job's mezzanine, nothing else.

Yellow flag: "We replicate your data to our managed storage during processing." Translation: your bytes flow through their infrastructure. Now you're trusting their security posture, their incident response, their data-residency story, and their billing relationship for storage costs.

Red flag: "Our managed storage is included." Translation: lock-in via data gravity. Migration off the platform requires a multi-petabyte data transfer.

The follow-up: ask for the network diagram. If they can't show you where every byte of customer data flows on a single page, the answer to "where does my data live" is "wherever the implementation happens to put it that day."

Question 2: What happens when a worker dies mid-encode?

The question that exposes operational maturity.

The right answer: "The job is re-enqueued. The previous worker's partial output is GC'd. The retry runs on a fresh worker, with the same job spec but a different correlation ID. The audit log records both attempts. If the failure class indicates a deterministic error (bad input, missing codec) we don't retry — we mark failed_user so you don't burn cycles on a job that will fail again."

Yellow flag: "The Kubernetes Job resource handles retries automatically." Means the platform is leaning on K8s primitive retries, which restart the entire encode from scratch — fine for a 30-second job, terrible for a 4-hour archive transcode where you've burned 3 hours of compute that you'll repeat.

Red flag: "We have a Slack channel where you can ping us if a job is stuck." That's not a retry semantic — that's escalation theater.

The follow-up: ask for failure-class taxonomy. A real platform will have a documented list (oom, bad_input, network_timeout, codec_unsupported, user_cancelled) with retry behavior per class. We covered the operational shape in running FFmpeg at scale. If they can't give you the taxonomy, retries are heuristic.

Question 3: Can I see what FFmpeg actually did?

The encoder-visibility question.

The right answer: "Every job emits an audit log entry with the full FFmpeg invocation, the encoder version (or container hash), the input/output hashes, the runtime, the resource usage, and any retry attempts. You can correlate any output back to the exact bits that produced it." Encoder-version pinning is non-negotiable for broadcast or contractual delivery — when QC asks "did the same encoder produce these two ladders?" you need an answer.

Yellow flag: "We provide CloudWatch logs for every job." Translation: you can correlate by parsing logs. Possible but operationally exhausting.

Red flag: "The encoder is abstracted away — we handle that for you." Translation: you cannot see what produced your output. For broadcast, archive, or compliance work this disqualifies the vendor.

The follow-up: ask to see a real audit log entry. The fields that matter: encoder_version, command_args, input_hash, output_hash, start_time, end_time, retry_count, worker_id, failure_class (if applicable). Anything less is partial provenance.

Question 4: What's the multi-tenant security failure mode?

The honest version of "is the platform secure."

The right answer: "If a worker process is compromised through an FFmpeg CVE, a malicious input, or a container escape, the blast radius is limited to the in-flight job's mezzanine. The worker has no IAM credentials, no DB password, no service-mesh identity — so there's nothing for the attacker to pivot to." That's the architectural answer, not the process answer. We documented our version in the strict-broker security architecture.

Yellow flag: "Our security is SOC 2 Type II certified." Compliance certificates describe process. They do not describe what happens when the FFmpeg-of-the-month CVE drops next Tuesday.

Red flag: "Workers have IAM roles for performance reasons." A compromised worker → every tenant's storage. The attacker uses your encoder fleet to exfiltrate every customer's mezzanine assets while encoding their own. The vendor finds out a week later when AWS Cost Anomaly fires.

The follow-up: ask how webhook signing works. HMAC-SHA256 with replay-prevention timestamps is the right answer. "We sign with our private key" without timestamp is replay-vulnerable. "We don't sign webhooks" means anyone who learns your endpoint URL can forge job-completion events.

Question 5: What's the self-hosted vs managed parity?

The compliance / sovereignty question, which often arrives later than it should.

The right answer: "Same binary, same APIs, same primitives. Self-hosted is a deployment choice — managed and self-hosted run identical control planes. You can validate on managed and graduate to self-hosted (or the reverse) by changing config, not by rewriting your pipelines."

Yellow flag: "Our on-premise product is feature-complete with the cloud product, with some lag on the latest features." Means you'll be 6-12 months behind the cloud version. For most workloads tolerable; for broadcast, where workflow features ship in response to customer pain, the lag is real friction.

Red flag: "Self-hosted is a different product with a different name." Means it's a different codebase. Migration between them is a re-platform, not a config change.

The follow-up: ask if their managed service runs the same binary as the on-prem one. If the answer is "of course," the next follow-up is "can I see the version dashboard for both?" That's the verifiable test.

Question 6: How do I migrate off?

The lock-in question, which honest vendors don't dodge.

The right answer: "Your job specs export as JSON. Your audit logs export as a structured event stream. Your storage was always yours. Your webhook receivers stay in place — they need a one-line signature-verification update if you swap to a different vendor's signing key. Migration is a business decision; it's not blocked by data we hold hostage."

Yellow flag: "We have an export tool." Translation: it works for the simple cases and breaks on the complex ones, and you'll find out which is which on the migration weekend.

Red flag: "Most customers don't migrate." Not an answer. Could mean the product is sticky in good ways (you love it) or sticky in bad ways (the migration cost approximates rebuilding from scratch).

The follow-up: ask if the platform's job-spec format is documented or vendor-shaped. Documented = portable. Vendor-shaped = bespoke translation work to migrate. We have migration notes from MediaConvert / Bitmovin / Mux on each comparison page, and the test is "could a competent engineer write the migration parser in a week."

Question 7: What's the worst-day failure mode you've actually seen?

The question that filters serious vendors from theater.

The right answer is a story. "Last summer, AWS us-east-1 had a 4-hour partial outage that affected our Redis primary. Workers couldn't dequeue. We failed over to us-west-2 in 8 minutes. 14% of in-flight jobs needed re-enqueue from the dead-letter audit log; the rest resumed transparently. Customer-visible impact: ~12 minutes of elevated latency on new submissions, no failed encodes that wouldn't have failed anyway." Specific, factual, includes the parts they didn't handle perfectly.

Yellow flag: "We've had 99.95% uptime over the last 12 months." Generic SLA recitation. Tells you nothing about how they handle real failure, which is what you actually want to know.

Red flag: "We haven't had any major incidents." Either they're new (fine, just say so) or they're hiding incidents (alarming).

The follow-up: ask for their incident response runbook. Real vendors will share at least the high-level shape (escalation tiers, communication SLAs, post-mortem cadence) under NDA. Vendors who can't will hide behind "that's confidential" — which usually means it doesn't exist in writing.

Bonus question: what doesn't this platform do?

The honesty test.

A serious vendor will name 3-5 things their platform doesn't do well, and tell you when to pair with someone else. We don't ship a player; we don't bundle CDN; we're pre-GA on live; our DRM packaging is roadmap. That answer earns more trust than any feature checklist because it tells you the vendor is honest about scope.

If a vendor claims to do everything well, either they're lying, they're spread thin, or they're describing roadmap as if it's product. Any of the three is a deal-breaker for production-serious teams.

Closing

These seven questions surface the engineering posture, the architecture, and the business model behind any orchestration platform. They map closely onto how MpegFlow is built — that's not a coincidence; we wrote them from the position of having watched teams hit each question the hard way and wishing they'd asked it before signing.

If you're using this checklist to evaluate MpegFlow specifically, our answers are documented in the trust page, the strict-broker security architecture, the running FFmpeg at scale post, and the alternatives pages — every one of those exists because a real buyer asked some version of these questions and we didn't want the answer to be "trust us."

If you're using this checklist for a different vendor, good luck. Most of them will pass questions 1-3 and stumble on questions 4-7. The vendors that pass all seven are worth what they charge.

Topics

Choosing a video orchestration platform: 7 questions to ask

Question 1: Where does my data live?

Question 2: What happens when a worker dies mid-encode?

Question 3: Can I see what FFmpeg actually did?

Question 4: What's the multi-tenant security failure mode?

Question 5: What's the self-hosted vs managed parity?

Question 6: How do I migrate off?

Question 7: What's the worst-day failure mode you've actually seen?

Bonus question: what doesn't this platform do?

Closing

Related reading

Choosing a video orchestration platform: 7 questions to ask

Question 1: Where does my data live?

Question 2: What happens when a worker dies mid-encode?

Question 3: Can I see what FFmpeg actually did?

Question 4: What's the multi-tenant security failure mode?

Question 5: What's the self-hosted vs managed parity?

Question 6: How do I migrate off?

Question 7: What's the worst-day failure mode you've actually seen?

Bonus question: what doesn't this platform do?

Closing

Related reading

Choosing a video orchestration platform: 7 questions to ask

#Question 1: Where does my data live?

#Question 2: What happens when a worker dies mid-encode?

#Question 3: Can I see what FFmpeg actually did?

#Question 4: What's the multi-tenant security failure mode?

#Question 5: What's the self-hosted vs managed parity?

#Question 6: How do I migrate off?

#Question 7: What's the worst-day failure mode you've actually seen?

#Bonus question: what doesn't this platform do?

#Closing

Related reading

Choosing a video orchestration platform: 7 questions to ask

#Question 1: Where does my data live?

#Question 2: What happens when a worker dies mid-encode?

#Question 3: Can I see what FFmpeg actually did?

#Question 4: What's the multi-tenant security failure mode?

#Question 5: What's the self-hosted vs managed parity?

#Question 6: How do I migrate off?

#Question 7: What's the worst-day failure mode you've actually seen?

#Bonus question: what doesn't this platform do?

#Closing

Related reading

Question 1: Where does my data live?

Question 2: What happens when a worker dies mid-encode?

Question 3: Can I see what FFmpeg actually did?

Question 4: What's the multi-tenant security failure mode?

Question 5: What's the self-hosted vs managed parity?

Question 6: How do I migrate off?

Question 7: What's the worst-day failure mode you've actually seen?

Bonus question: what doesn't this platform do?

Closing

Question 1: Where does my data live?

Question 2: What happens when a worker dies mid-encode?

Question 3: Can I see what FFmpeg actually did?

Question 4: What's the multi-tenant security failure mode?

Question 5: What's the self-hosted vs managed parity?

Question 6: How do I migrate off?

Question 7: What's the worst-day failure mode you've actually seen?

Bonus question: what doesn't this platform do?

Closing