A worker pod is running customer FFmpeg commands. The customer's input file might be a hand-crafted MP4 designed to exploit a libavformat bug. If that worker has any credential — a database password, an S3 key, a service-mesh certificate — a successful exploit means the attacker can read other tenants' data.
The standard response is "harden the worker." Better libavformat patches, better seccomp, better sandboxing. All worth doing.
The MpegFlow response is more aggressive: the worker has no credentials at all. Not stripped-down credentials, not least-privileged credentials. Zero. The strict-broker pattern isn't "the worker has slightly fewer secrets than before"; it's the worker is structurally incapable of accessing anything beyond its current job's inputs and outputs, because it doesn't know how.
This document covers the multi-tenant security architecture in production.
Use case in scope
Any deployment where:
- More than one customer / tenant / org runs jobs on the same physical fleet
- Customer-supplied input files run through FFmpeg (which has had real CVEs over the years —
CVE-2023-49502,CVE-2024-7272, etc.) - A breach of tenant isolation would be material — contractual penalty, reputational damage, regulatory exposure
This is most video infrastructure. Even single-tenant SaaS (your customers each get a dedicated cluster) often runs co-tenant pipelines for cost reasons.
The threat model
We model an adversary who:
- Submits jobs as a legitimate customer (has valid API credentials for their organization)
- Crafts input files designed to exploit FFmpeg vulnerabilities — buffer overflow, integer underflow, malformed container parsing, etc.
- Achieves arbitrary code execution within a worker process
- Wants to: read other tenants' files, modify other tenants' jobs, exfiltrate credentials, persist on the host, pivot laterally
The defenses are layered:
graph TB
A["Adversary submits<br/>malicious input"]
A --> B["Worker fetches input<br/>via presigned URL only"]
B --> C{"Input parses<br/>cleanly?"}
C -->|"No"| F["FFmpeg fails fast<br/>(no exec)"]
C -->|"Yes (or compromised)"| D["FFmpeg runs<br/>under seccomp +<br/>resource limits"]
D --> E{"Compromised?"}
E -->|"No"| OK1["Job completes<br/>normally"]
E -->|"Yes — RCE"| G["Worker has<br/>no credentials"]
G --> H{"Try to access<br/>other tenant data?"}
H -->|"Network: blocked"| BLOCK1["No path to DB"]
H -->|"S3 keys: none"| BLOCK2["No credentials<br/>to assume"]
H -->|"Lateral: blocked"| BLOCK3["No service-mesh<br/>identity"]
BLOCK1 --> CONTAIN["Worker process killed<br/>by health-check timeout"]
BLOCK2 --> CONTAIN
BLOCK3 --> CONTAIN
The critical claim: even with arbitrary code execution inside the worker, the attacker cannot read another tenant's data. Defense-in-depth, but the structural defense is the credential vacuum.
How the strict-broker works
Step 1: Worker boots with **no** credentials
Every other process in your stack has something. A database password. An S3 IAM role. A service mesh cert. A vault token.
A MpegFlow worker has:
- A worker authentication token (shared secret, scoped only to "this is a real worker") — used to authenticate to the gRPC coordinator. Cannot read or write anything else.
- A container image hash (so you can verify what version it ran)
- That's it. No DB credentials. No S3 IAM role. No service-mesh identity.
Step 2: Coordinator generates per-job presigned URLs
When the coordinator assigns a job to a worker, it generates short-lived presigned URLs for only the assets that job needs:
- Input URLs: presigned
GETURLs for each input asset. TTL: 1 hour. Single-use semantics not enforceable in S3, but timing makes replay attacks meaningless. - Output URLs: presigned
PUTURLs for each output the job will produce. TTL: 1 hour.
The presigned URLs encode the exact resource path. They are not credentials in the lateral-movement sense — even if exfiltrated, they only authorize access to this specific asset, for this short window.
sequenceDiagram
participant Coord as Coordinator<br/>(has S3 signing key)
participant Worker as Worker<br/>(no credentials)
participant S3 as S3 / MinIO
Worker->>Coord: gRPC GetWorkAssignment(worker_token)
Coord->>Coord: Authenticate worker
Coord->>Coord: Pick a pending job for this pool
Coord->>S3: GeneratePresignedURL(input.mp4, GET, 1h)
Coord->>S3: GeneratePresignedURL(output.mp4, PUT, 1h)
Coord-->>Worker: JobAssignment {<br/>input_urls: [...],<br/>output_urls: [...],<br/>ffmpeg_cmd: ...<br/>}
Worker->>S3: HTTP GET input.mp4 (presigned URL only)
S3-->>Worker: input bytes
Worker->>Worker: Run FFmpeg (sandboxed)
Worker->>S3: HTTP PUT output.mp4 (presigned URL only)
S3-->>Worker: 200 OK
Worker->>Coord: gRPC ReportJobStatus(completed)
Step 3: Worker's network is locked down
Workers can reach:
- The gRPC coordinator endpoint (mTLS optional, worker token required)
- The presigned-URL host (S3, MinIO) — over public TLS, no IAM headers from the worker
Workers cannot reach:
- The Postgres database directly
- The Redis instance directly
- Other tenants' worker pods (NetworkPolicy denies pod-to-pod traffic)
- The host's metadata service (IMDSv2 blocked at the pod-network level — no AWS role fishing)
- Any internal service (no service-mesh sidecar with internal cert)
This is enforced at the Kubernetes NetworkPolicy and node-level firewall, not at the application level. A compromised worker can't bypass it.
Step 4: FFmpeg runs under sandboxing
Inside the worker container:
- Read-only root filesystem — write access only to the working directory
- Memory cap — kills jobs that balloon (often the symptom of a malformed input)
- CPU cgroup — limits one job's CPU steal to other jobs in the pool
- Seccomp profile — restricts syscalls to the set FFmpeg actually needs
- Drop all capabilities — no
CAP_NET_ADMIN, noCAP_SYS_PTRACE, etc. - Non-root UID — worker process and FFmpeg both run as a non-root user
These don't prevent code execution within the seccomp-allowed surface, but they prevent escape into the broader host context.
Strict-broker vs the alternatives
The pattern is unusual in video infrastructure. More common alternatives:
Pattern: workers have IAM roles ("trusted workers")
The default for most teams. Worker pods run with an IAM role granting S3 read/write to the input/output buckets. Simple, fast, the obvious thing.
Problem: the IAM role grants access to the bucket, not the specific object. A compromised worker reads any tenant's data in the bucket. Mitigation requires per-tenant bucket separation, which adds operational complexity (and most teams don't actually do it).
Pattern: per-tenant clusters ("dedicated tenancy")
Each customer gets their own cluster. No shared workers, no co-tenancy. Strongest isolation.
Problem: expensive. A 100-customer SaaS becomes 100 clusters, each underutilized. Most providers don't offer this except at very high tiers.
Pattern: VM-level isolation per job
Each FFmpeg invocation runs in a fresh VM (Firecracker, Kata Containers). Hardware-level isolation between jobs.
Problem: cold-start latency. VMs take 100ms-2s to provision; for short jobs that doubles total latency.
Pattern: **Strict-broker (MpegFlow's choice)**
Workers run shared, but they have no credentials. Compromise of the worker process can't reach beyond the current job.
Trade-off: every file transfer goes through presigned URL generation (one extra round-trip per asset). Adds ~10-50ms per job in coordinator load. For broadcast workloads where jobs run minutes-to-hours, this is invisible.
Webhook security
Outbound webhooks from MpegFlow to your application are HMAC-SHA256 signed. The signature header (X-MpegFlow-Signature) is verifiable using the webhook secret you configured at webhook creation time.
sequenceDiagram
participant Bus as EventBus
participant WH as WebhookService
participant Exec as Webhook Executor
participant App as Your Application
Bus->>WH: emit(JobCompleted)
WH->>WH: Build payload (resource snapshot)
WH->>WH: Sign with HMAC-SHA256(secret, body)
WH->>Exec: Queue webhook delivery
Exec->>App: POST /your/webhook<br/>Headers:<br/> X-MpegFlow-Signature: hmac=...<br/> X-MpegFlow-Timestamp: ...<br/> X-MpegFlow-Event: job.completed
App->>App: Verify signature<br/>(reject if invalid)
App-->>Exec: 200 OK
alt Failure
Exec->>Exec: Backoff (1m → 5m → 30m)
alt 10+ consecutive failures
Exec->>WH: Disable webhook<br/>(circuit breaker)
end
end
Notes:
- The signature includes a timestamp to prevent replay attacks. Reject any delivery older than 5 minutes.
- Failed deliveries retry with exponential backoff. After 10 consecutive failures, the webhook is automatically disabled (circuit breaker) — your application stays healthy even when your webhook handler is broken.
- Failed deliveries persist in
webhook_deliveriestable; you can replay them via API.
Audit trail — what's recorded per job
The strict-broker pattern is only useful if you can verify it worked. Every job records:
| Field | Why it matters |
|---|---|
job.id, job.org_id, job.workflow_id |
Tenant isolation auditability |
| Worker assignment timestamp + worker_id | Which physical worker ran this job |
| FFmpeg command (full) | Exact reproducibility — can replay any job |
| FFmpeg version + container hash | Know which binary actually ran |
| Input asset hashes (SHA-256) | Confirm exactly which file was processed |
| Output asset hashes | Confirm exactly what was produced |
| Stage-by-stage timestamps | Validation, encode, package each tracked |
| Exit code + stderr (full, compressed) | Debug failures + flag suspicious patterns |
| Retry history (if any) | Each attempt's worker, params, outcome |
These records are append-only in PostgreSQL (audit_logs table), backed up nightly, retained per your contractual requirements (often 7+ years for broadcast).
For regulated industries, this trail satisfies most "we need to know exactly what happened to this content" questions without additional tooling.
Defense-in-depth checklist
When deploying for a multi-tenant production environment, verify:
| Layer | Check |
|---|---|
| Worker credentials | Worker pod IAM role: empty. Service account: bound to worker-only Role with zero access to tenant data. |
| Network | Kubernetes NetworkPolicy denies pod-to-pod between tenants. Egress allowlist: only coordinator gRPC + S3 endpoints. IMDS blocked. |
| Container | Read-only rootfs. Drop all capabilities. Run as non-root. Seccomp profile restricting syscalls to FFmpeg-needed set. Memory + CPU cgroups. |
| Image | Worker image scanned on every deploy (Trivy, Snyk, etc.). Signed via cosign or equivalent. Pinned to specific FFmpeg version. |
| Coordinator | Presigned URL TTL ≤ 1 hour. URL generation logs every request. Signing key rotation cadence documented. |
| Webhooks | HMAC-SHA256 signing enforced on every delivery. Timestamp window ≤ 5 minutes. Circuit breaker disabling failed endpoints. |
| Audit | Full provenance per job (encoder version, params, input/output hashes). Append-only table. Backups. Retention matched to contracts. |
| Observability | Worker process resource exhaustion alerts. Suspicious-syscall alerts (auditd or eBPF). Coordinator authentication failure rate. |
Defense-in-depth — what this architecture mitigates and where the next layer lives
Strict-broker is one layer of a multi-layer defense. Top-tier security teams expect to see how the whole stack composes; here's how.
| Concern | Strict-broker contribution | Where the next defense layer is |
|---|---|---|
| Worker compromise via FFmpeg exploit | Worker has no credentials → blast radius capped at the current job | Container hardening (seccomp, drop capabilities, read-only rootfs) — applied today |
| Coordinator compromise | Coordinator is the trust anchor — runs in a hardened tier with strict ingress controls | Coordinator-tier hardening: limited egress, mTLS, per-deploy audit, separated infrastructure-as-code |
| PostgreSQL compromise | Audit log + metadata co-located by design (operational simplicity) | Encryption at rest with customer-managed KMS, access logging, periodic rotation. For highest-threat workloads: per-tenant database isolation (Enterprise tier) |
| Side channels (Spectre/Meltdown class) | Multi-tenant workers run on shared nodes; structurally vulnerable to CPU-level side channels | For high-threat workloads, request dedicated nodes per tier — supported in Enterprise plans |
| Denial of service | Per-pool resource caps prevent a single tenant from exhausting shared workers | Application-level rate limiting + quotas (Free/Starter/Pro plan tiers); CDN-level DDoS protection at the perimeter |
| Insider threats | All admin actions logged to append-only audit table | SOC 2 Type II controls (least privilege, periodic access review, separation of duties). Audit window opens 2026 Q4 |
Each row reads as: "MpegFlow's structural choice + the standard production hardening that goes with it." For procurement security questionnaires, this is the framing your security review will use anyway — we're just being explicit about it up front.
How to evaluate this architecture for your team
If you're at a team where multi-tenant security is a procurement gate:
- Ask vendors directly: "Do your workers have IAM roles?" If yes, ask how they isolate at the bucket level.
- Ask: "What does the worker know after a successful exploit?" If the answer is anything more than "the URL of the input file currently being processed," you have lateral-movement exposure.
- Ask for the audit trail per job. The depth of provenance tells you how seriously they took multi-tenancy.
- Run the defense-in-depth checklist above against any platform you evaluate. Most fail at least 3 of those rows.
If our approach matches what your security team needs, apply to the design partner program — multi-tenant security is one of the conversations we have most often with broadcast and OTT operators.