MpegFlow with Prometheus + Grafana: open-source observability
How MpegFlow integrates with Prometheus + Grafana — the open-source observability stack. Native OpenMetrics, recording rules, the dashboards that work, and when this beats Datadog.
Prometheus + Grafana is the open-source observability stack — typically self-hosted or run via managed services like Grafana Cloud, Amazon Managed Prometheus, or Google Managed Service for Prometheus. MpegFlow exports metrics in OpenMetrics format (Prometheus-native), so the integration is direct: add scrape config, deploy a Grafana dashboard, you're done.
How the integration works
MpegFlow exposes /metrics on each coordinator + worker pod in OpenMetrics format. Prometheus scrapes via standard ServiceMonitor / PodMonitor (kube-prometheus-stack pattern). Grafana queries Prometheus + renders dashboards. The whole stack is self-hostable for sovereign-cloud requirements; or runs as a managed service for operational simplicity.
Common patterns
kube-prometheus-stack deployment
The standard K8s pattern: install kube-prometheus-stack via Helm (one chart deploys Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics). Add a ServiceMonitor pointing at MpegFlow's metrics endpoints. Done. Capacity for ~10M metric samples per second per Prometheus node.
Recording rules for expensive queries
Some queries (p99 over 7 days across all pools) are expensive to compute on every Grafana refresh. Use Prometheus recording rules to pre-compute them at 1-minute intervals. Dashboards query the recorded series instead of the raw histogram_quantile expression.
Long-term storage with Thanos / Mimir
Prometheus retention is typically 15-90 days locally. For longer retention (compliance, capacity planning over years), pair with Thanos or Grafana Mimir for object-storage-backed long-term storage. A small additional layer; pays back for any workload needing trend analysis beyond the local retention window.
Sovereign-cloud / on-prem deployments
For air-gapped or sovereign-cloud requirements, Prometheus + Grafana runs entirely in your perimeter. No outbound calls to a SaaS observability vendor. Pair with self-hosted MpegFlow + MinIO + on-prem Postgres for a fully-isolated stack.
Pitfalls
- Prometheus is single-write — high cardinality (per-job tags) blows up memory. Use recording rules to aggregate before storing; never tag metrics with high-cardinality job IDs.
- Grafana dashboard queries can be slow on large time ranges. Use $__interval and downsampling to keep dashboards responsive.
- Prometheus federation between clusters introduces lag and complexity. Most multi-cluster deployments use Thanos or Mimir instead of native federation.
- Long-term storage in Thanos/Mimir is operationally non-trivial. Object storage + sidecar pattern works but requires SRE attention.
- The Prometheus + Grafana stack is your responsibility to operate — backup, version upgrades, capacity planning. Managed Datadog removes that operational burden at higher cost.
At production scale
Prometheus at MpegFlow production scale handles ~100K-1M active series per node. Above 1M series, sharding via federation or Thanos becomes necessary. For sovereign-cloud deployments where Datadog isn't an option, Prometheus + Grafana + Thanos handles 10M+ minutes/month workloads with proper sharding. Operational cost: ~0.5 SRE-FTE for a production-grade Prometheus stack across multiple clusters.
- prometheus
- grafana
- observability
- metrics
- integration
- Self-hosted