VMAF — Netflix's quality metric and the modern reference for video quality measurement

MpegFlow

Practical reference on Video Multi-Method Assessment Fusion — Netflix's perceptual quality metric, training methodology, libvmaf usage, BD-rate calculation, and the limits of automated metrics.

VMAF — Video Multi-Method Assessment Fusion — is the quality metric Netflix open-sourced in 2016 and that has become the de facto modern standard for automated video quality measurement. Where PSNR and SSIM optimize for mathematical signal fidelity, VMAF was trained on actual human perceptual judgments to predict what viewers will rate as "high quality." For codec evaluation, encoder tuning, and ABR ladder design at most production streaming services in 2026, VMAF is the metric that matters. This page is the engineering reference.

What VMAF is

VMAF is a perceptual video quality metric — it predicts a Mean Opinion Score (MOS) on a 0-100 scale that approximates how human viewers would rate a given video's quality. The internal architecture combines several lower-level metrics:

VIF (Visual Information Fidelity) at multiple scales — measures how much information from the reference is preserved in the distorted version.
DLM (Detail Loss Metric) — measures the loss of fine detail.
Motion features — captures temporal aspects (motion compensation quality, frame-rate effects).

These features are combined via a Support Vector Machine (SVM) trained on human MOS data. The training set is the "VMAF dataset" — a corpus of source/distorted video pairs with collected human ratings. The trained model maps low-level features to a predicted MOS.

The numerical scale: 0-100, where 100 is "indistinguishable from source" and 0 is "completely degraded." Practical interpretation:

VMAF 95+ — typically perceptually transparent. Viewers can't reliably tell difference from source on consumer playback.
VMAF 85-95 — high quality. Some difference from source visible to expert reviewers; consumer-quality acceptable.
VMAF 70-85 — acceptable quality. Visible quality reduction but watchable. Mid-tier streaming.
VMAF 50-70 — noticeable quality issues. Lower-tier streaming, fallback content.
VMAF below 50 — clearly degraded. Emergency fallback or downlevel-only.

These ranges are guides, not strict thresholds. A VMAF of 92 on action content is different from VMAF 92 on talking-head content; both qualify as "high" but the user-experience implications differ.

libvmaf — the production tool

libvmaf is the reference open-source implementation, maintained by Netflix. It's a C library with bindings for most languages and integration into ffmpeg via --enable-libvmaf.

The CLI invocation:

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi libvmaf -f null -

This computes VMAF score by frame and aggregates. Output:

VMAF score: 87.523412

For more detailed output (per-frame scores, intermediate features):

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi libvmaf=log_path=vmaf.json:log_fmt=json -f null -

The JSON log contains per-frame VMAF, per-frame VIF, DLM, motion features. Useful for analysis when aggregate scores hide variance.

Performance characteristics: libvmaf can compute VMAF at roughly real-time on a single CPU core for 1080p content in 2026 (depends on the model used). For batch evaluation of large content libraries, this is fast enough to be practical.

Training and model variants

VMAF is a trained model; multiple trained models exist:

vmaf_v0.6.1 — the original 2016 model, trained on Netflix's source quality dataset. The default in libvmaf for years; widely cited in academic papers.
vmaf_v0.6.1neg — the "negative training" variant. Better at predicting quality at the lower end of the scale where the original model is less calibrated.
vmaf_4k_v0.6.1 — trained specifically for 4K content. Different optimal viewing distance than HD content; the 4K model accounts for this.
vmaf_v0.6.1neg_4k — combination.

The model choice matters when you're comparing across publications or running standardized benchmarks. The default vmaf_v0.6.1 is what most contemporary work uses; switching models changes scores by 1-3 points typically.

For production use, pick one model and stay with it across your encoding decisions. Comparing VMAF scores across models is meaningless; comparing within a single model is the operational use case.

VMAF for encoder evaluation

The most common production use: comparing encoder decisions. The pattern:

Encode the same source content with two different encoder configurations (different presets, different rate-control methods, different codecs).
Decode both back to YUV.
Compute VMAF of each against the original source.
Compare scores.

For BD-rate (Bjontegaard delta-rate) analysis, encode each configuration at multiple bitrates, compute VMAF at each, and use the BD-rate calculation to derive "what bitrate savings does configuration A provide at equivalent VMAF to configuration B." BD-rate VMAF is the standard codec-comparison output in 2026 academic and industry work.

VMAF for live ABR optimization

In live streaming, you can't iteratively tune encoding decisions per-content. But VMAF still has uses:

Stream-level monitoring — periodic VMAF spot-checks of live output against a reference (delayed by some seconds) to detect quality regressions.
Rate-control tuning — A/B testing of rate-control modes (CBR, VBV-constrained ABR, capped CRF) on representative content, then deploying the winner.
Ladder optimization — for live ABR, computing VMAF of each rung against a high-quality reference helps identify rungs that aren't pulling their weight.

Real-time VMAF computation during live encoding is bandwidth-expensive (you're processing the source, the encode, and the comparison simultaneously). Most live use cases compute VMAF post-hoc on representative samples rather than real-time on every frame.

VMAF vs PSNR vs SSIM

The three major automated quality metrics, compared:

Metric	What it measures	Computation	Production use
PSNR	Pixel-level signal-to-noise ratio	Cheap, fast	Encoder benchmarking, especially for codec research
SSIM	Structural similarity (luma + contrast + structure)	Modest cost	Quality monitoring; less frequently in modern production
VMAF	Perceptual quality via trained model	Higher cost	Production codec selection, ABR ladder design

Why VMAF tends to win for production decisions: PSNR optimizes for mathematical fidelity, which doesn't always match perceptual quality. SSIM is better than PSNR but still based on local statistics. VMAF is trained on actual human judgments, so it captures perceptual aspects (texture preservation, motion artifacts, banding) that the others miss.

The cases where VMAF doesn't win:

Very fast computation needed — PSNR is much faster.
Established benchmarks — codec research papers typically use BD-rate PSNR for historical comparison.
VMAF's training distribution doesn't match your content — if your content (e.g., screen recordings, anime) is meaningfully different from VMAF's training set, the metric may not generalize well.

The limits of automated quality metrics

VMAF is the best automated metric available, but it's still an automated metric. The honest limitations:

Trained on Netflix-style content — the training corpus skews toward high-production-value live action and animation. Less representative for screen content, news graphics, or sports.
Average viewing condition — VMAF assumes a typical viewing distance and display. Mobile-heavy distribution may have different optimal scoring.
Can be optimized against — encoders that explicitly optimize for VMAF can game the metric. SSIM-tuned encoders historically did this; VMAF-tuned encoders have started to.
Misses some perceptual issues — banding, color shifts, audio sync. VMAF measures video quality; comprehensive quality monitoring needs additional measurements.
High score variance per shot — a 90-second clip might have shots scoring 95 and shots scoring 70 due to content complexity. The aggregate score hides variance that may matter for user experience.

The right way to use VMAF: as one input to encoding decisions, supplemented by spot-check golden-eyes review (expert visual inspection), and validated against actual user quality-of-experience telemetry from production deployments. VMAF correlates with viewer perception, but it's a model — not the truth.

VMAF in BD-rate analysis

BD-rate (Bjontegaard delta-rate) is the standard codec-comparison output. The procedure:

Encode the same source at multiple bitrates with codec A.
Encode the same source at multiple bitrates with codec B.
Compute VMAF (or PSNR / SSIM) for each encoded version vs source.
Plot bitrate vs VMAF for both codecs.
Compute the area between the curves; this is the BD-rate.

A negative BD-rate VMAF (e.g., "codec B has -25% BD-rate VMAF vs codec A") means codec B achieves the same VMAF at 25% lower bitrate than codec A. This is the standard way to express "codec B is 25% more efficient than codec A at perceptual quality parity."

Most published codec comparisons use BD-rate PSNR (historical), BD-rate SSIM, and BD-rate VMAF. The VMAF version is the most perceptually relevant; the PSNR version is the most directly comparable to historical literature.

What MpegFlow does with VMAF

VMAF runs as a discrete measurement stage in MpegFlow's DAG runtime via the FFmpeg libvmaf filter (worker images compiled with --enable-libvmaf). The quality-analysis node accepts a model_path parameter that the worker resolves on its filesystem; today the customer specifies which model to use (HD vs 4K vs no-enhancement-gain), and the worker must have that model file present. Auto-selection by output resolution is on the roadmap; today it's an explicit configuration field.

Cross-stage data flow wires the encode output and reference source into the measurement stage's inputs; per-stage retry handles transient failures. Results land as structured per-(rung, bitrate) VMAF scores in the workflow's metadata storage, suitable for downstream stages that read them or for operator review.

For customers running per-title encoding (see per-title-encoding), VMAF is the primary signal driving the per-title bitrate selection. The pipeline encodes a content asset at multiple bitrates as parallel sibling stages, runs libvmaf measurement on each, and lands per-bitrate VMAF scores that an operator-configured threshold (e.g., VMAF 93 for top tier, VMAF 80 for mid tier) selects against. The selection step's full automation (decision node that selects without operator review) is on the roadmap; the measurement and per-rung scoring is shipping today.

VMAF computation is exposed as a discrete stage rather than baked into encoding, so customers can opt in per-content-class. For high-volume ingest where per-asset analysis would dominate compute cost, sampling-based VMAF (compute on a fraction of content) is the operational pattern.

The cluster of tools around VMAF — per-rung VMAF dashboards, BD-rate analysis for encoder A/B testing, occasional MOS validation — is where serious ladder optimization happens. We help customers stand up these tools when they're ready to move from default ladders to content-aware ladders. The transition is meaningful: per-title encoding typically saves 20-40% bandwidth at equivalent VMAF, which compounds significantly at meaningful streaming scale.

What VMAF is

VIF (Visual Information Fidelity) at multiple scales — measures how much information from the reference is preserved in the distorted version.
DLM (Detail Loss Metric) — measures the loss of fine detail.
Motion features — captures temporal aspects (motion compensation quality, frame-rate effects).

The numerical scale: 0-100, where 100 is "indistinguishable from source" and 0 is "completely degraded." Practical interpretation:

VMAF 95+ — typically perceptually transparent. Viewers can't reliably tell difference from source on consumer playback.
VMAF 85-95 — high quality. Some difference from source visible to expert reviewers; consumer-quality acceptable.
VMAF 70-85 — acceptable quality. Visible quality reduction but watchable. Mid-tier streaming.
VMAF 50-70 — noticeable quality issues. Lower-tier streaming, fallback content.
VMAF below 50 — clearly degraded. Emergency fallback or downlevel-only.

These ranges are guides, not strict thresholds. A VMAF of 92 on action content is different from VMAF 92 on talking-head content; both qualify as "high" but the user-experience implications differ.

libvmaf — the production tool

libvmaf is the reference open-source implementation, maintained by Netflix. It's a C library with bindings for most languages and integration into ffmpeg via --enable-libvmaf.

The CLI invocation:

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi libvmaf -f null -

This computes VMAF score by frame and aggregates. Output:

VMAF score: 87.523412

For more detailed output (per-frame scores, intermediate features):

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi libvmaf=log_path=vmaf.json:log_fmt=json -f null -

The JSON log contains per-frame VMAF, per-frame VIF, DLM, motion features. Useful for analysis when aggregate scores hide variance.

Training and model variants

VMAF is a trained model; multiple trained models exist:

vmaf_v0.6.1 — the original 2016 model, trained on Netflix's source quality dataset. The default in libvmaf for years; widely cited in academic papers.
vmaf_v0.6.1neg — the "negative training" variant. Better at predicting quality at the lower end of the scale where the original model is less calibrated.
vmaf_4k_v0.6.1 — trained specifically for 4K content. Different optimal viewing distance than HD content; the 4K model accounts for this.
vmaf_v0.6.1neg_4k — combination.

For production use, pick one model and stay with it across your encoding decisions. Comparing VMAF scores across models is meaningless; comparing within a single model is the operational use case.

VMAF for encoder evaluation

The most common production use: comparing encoder decisions. The pattern:

Encode the same source content with two different encoder configurations (different presets, different rate-control methods, different codecs).
Decode both back to YUV.
Compute VMAF of each against the original source.
Compare scores.

VMAF for live ABR optimization

In live streaming, you can't iteratively tune encoding decisions per-content. But VMAF still has uses:

Stream-level monitoring — periodic VMAF spot-checks of live output against a reference (delayed by some seconds) to detect quality regressions.
Rate-control tuning — A/B testing of rate-control modes (CBR, VBV-constrained ABR, capped CRF) on representative content, then deploying the winner.
Ladder optimization — for live ABR, computing VMAF of each rung against a high-quality reference helps identify rungs that aren't pulling their weight.

VMAF vs PSNR vs SSIM

The three major automated quality metrics, compared:

Metric	What it measures	Computation	Production use
PSNR	Pixel-level signal-to-noise ratio	Cheap, fast	Encoder benchmarking, especially for codec research
SSIM	Structural similarity (luma + contrast + structure)	Modest cost	Quality monitoring; less frequently in modern production
VMAF	Perceptual quality via trained model	Higher cost	Production codec selection, ABR ladder design

The cases where VMAF doesn't win:

Very fast computation needed — PSNR is much faster.
Established benchmarks — codec research papers typically use BD-rate PSNR for historical comparison.
VMAF's training distribution doesn't match your content — if your content (e.g., screen recordings, anime) is meaningfully different from VMAF's training set, the metric may not generalize well.

The limits of automated quality metrics

VMAF is the best automated metric available, but it's still an automated metric. The honest limitations:

Trained on Netflix-style content — the training corpus skews toward high-production-value live action and animation. Less representative for screen content, news graphics, or sports.
Average viewing condition — VMAF assumes a typical viewing distance and display. Mobile-heavy distribution may have different optimal scoring.
Can be optimized against — encoders that explicitly optimize for VMAF can game the metric. SSIM-tuned encoders historically did this; VMAF-tuned encoders have started to.
Misses some perceptual issues — banding, color shifts, audio sync. VMAF measures video quality; comprehensive quality monitoring needs additional measurements.
High score variance per shot — a 90-second clip might have shots scoring 95 and shots scoring 70 due to content complexity. The aggregate score hides variance that may matter for user experience.

VMAF in BD-rate analysis

BD-rate (Bjontegaard delta-rate) is the standard codec-comparison output. The procedure:

Encode the same source at multiple bitrates with codec A.
Encode the same source at multiple bitrates with codec B.
Compute VMAF (or PSNR / SSIM) for each encoded version vs source.
Plot bitrate vs VMAF for both codecs.
Compute the area between the curves; this is the BD-rate.

VMAF — Netflix's quality metric and the modern reference for video quality measurement

What VMAF is

libvmaf — the production tool

Training and model variants

VMAF for encoder evaluation

VMAF for live ABR optimization

VMAF vs PSNR vs SSIM

The limits of automated quality metrics

VMAF in BD-rate analysis

What MpegFlow does with VMAF

Related topics and reading

VMAF — Netflix's quality metric and the modern reference for video quality measurement

What VMAF is

libvmaf — the production tool

Training and model variants

VMAF for encoder evaluation

VMAF for live ABR optimization

VMAF vs PSNR vs SSIM

The limits of automated quality metrics

VMAF in BD-rate analysis

What MpegFlow does with VMAF

Related topics and reading

VMAF — Netflix's quality metric and the modern reference for video quality measurement

#What VMAF is

#libvmaf — the production tool

#Training and model variants

#VMAF for encoder evaluation

#VMAF for live ABR optimization

#VMAF vs PSNR vs SSIM

#The limits of automated quality metrics

#VMAF in BD-rate analysis

#What MpegFlow does with VMAF

Related topics and reading

VMAF — Netflix's quality metric and the modern reference for video quality measurement

#What VMAF is

#libvmaf — the production tool

#Training and model variants

#VMAF for encoder evaluation

#VMAF for live ABR optimization

#VMAF vs PSNR vs SSIM

#The limits of automated quality metrics

#VMAF in BD-rate analysis

#What MpegFlow does with VMAF

Related topics and reading

What VMAF is

libvmaf — the production tool

Training and model variants

VMAF for encoder evaluation

VMAF for live ABR optimization

VMAF vs PSNR vs SSIM

The limits of automated quality metrics

VMAF in BD-rate analysis

What MpegFlow does with VMAF

What VMAF is

libvmaf — the production tool

Training and model variants

VMAF for encoder evaluation

VMAF for live ABR optimization

VMAF vs PSNR vs SSIM

The limits of automated quality metrics

VMAF in BD-rate analysis

What MpegFlow does with VMAF