MpegFlow with Datadog: metrics, APM, log aggregation
How MpegFlow integrates with Datadog — OpenMetrics scraping, distributed tracing via OTLP, log aggregation, and the dashboards that matter for video pipeline ops.
Datadog is the standard observability platform across most B2B SaaS — metrics, APM, log aggregation, and infrastructure monitoring in one product. MpegFlow integrates via OpenMetrics-format Prometheus scraping (Datadog's OpenMetrics check), OTLP-format traces (Datadog APM), and log forwarding via the Datadog Agent.
How the integration works
Datadog Agent runs on every K8s node (DaemonSet). The Agent's OpenMetrics check scrapes MpegFlow's /metrics endpoint at 30-second intervals. Traces export via OTLP to the Agent on UDP 4317. Logs are tailed from container stdout via the Agent's log integration. No code-side Datadog SDK; OpenMetrics + OTLP are the standard interfaces.
Common patterns
Standard metrics dashboard
A Datadog dashboard for video pipelines monitors: per-pool queue depth, per-pool active workers, jobs/min throughput, p50/p95/p99 job duration, retry rate by failure class, webhook delivery success rate. We provide a sample dashboard JSON in the Helm chart.
Per-tenant cost attribution
For multi-tenant deployments, tag every metric + trace + log with customer_id. Datadog's tagging dimensions let you slice cost by customer for accurate billing and capacity planning.
Anomaly detection on encode duration
Datadog's anomaly detection on p99 encode duration catches regressions early. A 4× spike in p99 against the baseline is usually an upstream issue (worker OOM, bad input), and Datadog alerts before SLA breaches.
Trace-correlated logs
OTLP traces include trace_id; MpegFlow's logs include the same trace_id. Datadog correlates them automatically. From any error log, click through to the full distributed trace across coordinator + worker + webhook receiver.
Pitfalls
- Datadog can be expensive at video-pipeline scale: high-cardinality tags (per-job_id) explode metric volume. Use tags sparingly; per-customer is usually right, per-job is usually wrong.
- Log volume from FFmpeg stderr can dwarf operational logs. Either parse stderr into structured events, or sample raw FFmpeg output (e.g., 1% sampling for completed jobs, 100% for errors).
- Datadog APM's default sampling can miss interesting traces. For low-volume but high-importance traces (job failures, webhook delivery failures), use trace-rules to keep them at 100%.
- OTLP via UDP can drop spans during high traffic — use OTLP/gRPC for guaranteed delivery, but it costs Agent CPU.
- Datadog Agent on K8s nodes consumes ~200-500MB RAM per node — budget node-group sizing accordingly.
At production scale
Datadog at video-pipeline scale typically lands at 5-15% of total infrastructure cost — meaningful but acceptable. The cost optimization that matters: tag hygiene (don't per-job tag), log sampling (don't store every FFmpeg log), and metric-cardinality budgets (hosts × tag values, watch the multiplication). For workloads above ~100M minutes/month with full Datadog observability, expect $30-50K/month in Datadog costs alone.
- datadog
- observability
- metrics
- apm
- integration