VMAF cross-validation with MOS — calibrating quality metrics against viewers

MpegFlow

Practical guide to validating VMAF against subjective MOS testing — ITU-R BT.500 methodology, golden viewer panel selection, when VMAF disagrees with MOS, calibration procedures.

VMAF is a model trained to predict human quality perception, but it's still a model. For premium streaming where quality matters, periodic cross-validation against actual subjective MOS (Mean Opinion Score) testing keeps VMAF aligned with real viewer perception. The validation procedure follows ITU-R BT.500 methodology — controlled viewing environments, calibrated displays, golden-eye panels of trained reviewers. This page is the engineering reference for how to validate VMAF against MOS in production.

What MOS is

MOS — Mean Opinion Score — is the average quality rating from a panel of human viewers. Typically scored on a 5-point scale:

5: Excellent — quality indistinguishable from source.
4: Good — quality slightly worse than source, but still good.
3: Fair — quality acceptable; visible degradation.
2: Poor — quality clearly degraded.
1: Bad — unacceptable quality.

For a video sequence, multiple viewers each give their score. The mean across viewers is the MOS.

MOS values fall into ranges based on quality:

MOS 4.5+ — premium streaming target.
MOS 4.0-4.5 — high-quality streaming.
MOS 3.5-4.0 — acceptable mainstream streaming.
MOS 3.0-3.5 — lower-tier streaming.
MOS below 3.0 — poor; user complaints expected.

The scale is subjective; calibration requires controlled testing methodology.

ITU-R BT.500 methodology

ITU-R BT.500 (currently version 14, 2023) is the international standard for subjective video quality testing. Key requirements:

Display setup:

Reference monitor with calibrated colorimetry.
Specific peak luminance (typically 100 cd/m² for SDR).
Specific viewing distance (typically 3-4× picture height).
Controlled ambient lighting.

Viewer selection:

Trained viewers (golden eyes) for expert evaluation.
Non-experts for naive viewer evaluation.
15-30 viewers minimum per study for statistical significance.
Vision testing (visual acuity, color vision) for golden-eye panels.

Test methodology:

Double-Stimulus Continuous Quality Scale (DSCQS) — show reference and test sequence, viewer scores both.
Single-Stimulus Continuous Quality Evaluation (SSCQE) — show test only, viewer scores in real-time.
Stimulus Comparison (PC) — show pairs of test sequences, viewer picks better.

Content selection:

Standard test sequences (specific clips chosen for stress testing).
Custom content matching production characteristics.

The full methodology is dense; ITU-R BT.500 is hundreds of pages. For practical pipeline use, the protocol is simplified while preserving the core (controlled environment, calibrated display, multiple viewers, structured scoring).

Golden-eye panels

Golden-eye panels are trained viewers used for premium quality evaluation. Characteristics:

Visual acuity — verified normal or corrected-to-normal vision.
Color vision — Ishihara plate testing for color vision; reject color-deficient viewers for chromatic content evaluation.
Trained on artifact recognition — taught to identify blocking, ringing, banding, color shifts, blurring.
Calibrated baseline — known scoring patterns from prior studies; consistency across studies.

For production pipelines, maintaining a stable golden-eye panel pays dividends — same viewers, same scoring conventions, comparable results across studies.

For one-off or naive testing, broader panels (general public viewers) provide the "average viewer" perspective. Both types of testing have value.

VMAF vs MOS correlation

VMAF was trained on a large MOS dataset and produces scores that approximate MOS:

VMAF 95-100 → MOS 4.5-5.0
VMAF 85-95 → MOS 4.0-4.5
VMAF 75-85 → MOS 3.5-4.0
VMAF 65-75 → MOS 3.0-3.5
VMAF 50-65 → MOS 2.5-3.0
VMAF below 50 → MOS below 2.5

The mapping isn't exact — content, viewing condition, and viewer differ from VMAF's training distribution. For specific production content, the actual VMAF-to-MOS mapping may differ.

For premium streaming, validating VMAF against MOS on YOUR content is worth doing. The validation answers: "is VMAF 90 actually MOS 4.5 for our content, or is it something else?"

When VMAF disagrees with MOS

VMAF can disagree with MOS in specific cases:

Case 1: Content outside VMAF's training distribution.

VMAF was trained on Netflix-style content (live action, animation, sports). For genres VMAF didn't see during training (terminal recordings, pixel art, retro game footage), VMAF may not predict MOS well.

Case 2: Specific perceptual artifacts.

VMAF measures certain artifact types better than others. Banding in flat regions, color shifts, certain temporal artifacts — VMAF's sensitivity is content-dependent.

Case 3: Codec-specific artifacts.

If a codec produces unusual artifacts (AV1 film grain synthesis, Dolby Vision-specific renderings), VMAF may score them differently than a viewer perceives.

Case 4: Viewing condition differences.

VMAF assumes a specific viewing condition. For mobile playback (smaller screens, brighter ambient), the perception differs from VMAF's assumed conditions.

For these cases, MOS validation reveals the disagreement; you can then either accept it (calibrate VMAF thresholds for your content) or supplement VMAF with subjective spot-checks.

Validation procedure

A production validation procedure:

Step 1: Select test content.

Pick a representative subset of your production content. 10-20 clips, 30-60 seconds each. Cover the variety of genres, technical characteristics, and quality ranges you produce.

Step 2: Encode at quality range.

For each clip, encode at multiple bitrates spanning low to high quality. The same range you'd use for BD-rate analysis (~75 to ~95 VMAF).

Step 3: Compute VMAF for each.

Run VMAF measurement on each encoded version. Record values.

Step 4: Run subjective study.

Run a structured MOS evaluation following ITU-R BT.500 (or simplified protocol). Multiple viewers; controlled environment; structured scoring.

Step 5: Compute correlation.

Calculate Pearson correlation coefficient between VMAF scores and MOS scores. Strong correlation (r > 0.85) indicates VMAF tracks MOS well for your content. Lower correlation indicates calibration issues.

Step 6: Identify outliers.

Specific (clip, bitrate) combinations where VMAF and MOS disagree most. Investigate these for systematic patterns.

Step 7: Calibrate VMAF thresholds.

If VMAF and MOS disagree systematically, adjust the VMAF thresholds you use for production decisions. E.g., target VMAF 92 (instead of 90) for top-tier content if that's what produces MOS 4.5.

Sample test design

A practical small-scale validation study:

15 viewers (mix of golden-eye and naive).
10 source clips representing your production content.
5 quality levels per clip (very low to very high).
Total trials: 50 (10 clips × 5 quality levels).
Duration per trial: 30 seconds clip + 10 seconds scoring = ~40 seconds.
Total session length: 50 × 40s + breaks = ~45 minutes per viewer.

15 viewers × 45 minutes = ~11 hours of viewing time.

Plus setup, training viewers, processing data: budget 1-2 days for a study of this size.

Conduct quarterly or per-major-encoder-change.

Outsourcing subjective testing

For pipelines without in-house panel infrastructure, outsourcing options:

Specialized labs — companies like Witbe, Fraunhofer, BBC R&D do subjective video quality testing as a service. Premium pricing; expert results.

Crowd-sourced platforms — Subjectify, Crowd4Test, others. Lower cost; less controlled environment; appropriate for naive-viewer studies.

University collaborations — research labs sometimes do subjective testing for streaming services.

For premium streaming, occasional subjective testing is worth the investment. For mass-market or budget streaming, automated metrics with periodic spot-checks may be sufficient.

VMAF calibration to your content

After validation, you may calibrate VMAF for your specific content:

Threshold adjustment:

If your content's VMAF systematically reads ~3 points lower than equivalent MOS content (perhaps due to content characteristics):

Original threshold: VMAF 90 = MOS 4.5 (production target).
Adjusted threshold: VMAF 87 = MOS 4.5 (your content's mapping).
Production decisions use the adjusted threshold.

Per-genre thresholds:

Different content types may have different VMAF-MOS mappings:

Animation: VMAF 92 = MOS 4.5.
Sports: VMAF 90 = MOS 4.5.
Talking heads: VMAF 88 = MOS 4.5.

For per-content-type calibration, VMAF thresholds vary by content category. The pipeline routes each piece of content with appropriate threshold.

Custom VMAF model training:

For specific content niches (e.g., Asian language content, gaming content, classical music recordings), train a custom VMAF model on your content with collected MOS data. Custom models produce more accurate quality predictions for the specific content domain.

This is significant investment (collecting MOS data for thousands of clips); only worth it for specific high-stakes content.

When subjective testing isn't worth it

For some pipelines, subjective testing is overkill:

Free / ad-supported tier — quality threshold is "good enough"; subtle MOS differences don't drive customer behavior.
Internal video — corporate communications, e-learning. Quality matters less.
Volume-driven services — UGC platforms, social video. Per-content quality testing isn't economical.

For these, automated metrics (VMAF without subjective validation) is sufficient.

For premium streaming, premium broadcast distribution, content licensing where quality matters contractually — subjective validation is worth doing periodically.

Operational considerations

Things that matter for VMAF validation in production:

Cadence — quarterly minimum for premium services; more frequently if encoder configurations change often.
Documentation — record viewers, conditions, scores, methodology. Replicable studies are more useful than one-off.
Content rotation — refresh test content periodically so panel doesn't memorize specific clips.
Threshold updates — when calibration changes, document and communicate to pipeline operators.
Cross-team alignment — subjective testing results inform encoder, ladder, and pipeline decisions. Communicate findings broadly.

What MpegFlow does with VMAF/MOS validation

MpegFlow's pipeline runs VMAF natively as a discrete measurement stage via the FFmpeg libvmaf filter (see /topics/quality/vmaf for the architecture). MOS validation is run as periodic engineering exercises rather than continuous pipeline behavior — there's no "human gate" decision node in the DAG runtime today, so MOS comparison happens outside the pipeline boundary on data the pipeline produces.

For internal engineering, we run periodic subjective validation studies on representative content corpus. The pipeline produces the test encodes (parallel sibling stages with different parameter sets) and the VMAF measurements; the subjective testing itself runs in external tooling; findings inform default encoder configurations and quality thresholds via editorial updates to workflow YAML.

For customers running their own quality programs, we provide guidance on study design and recommend tooling for subjective testing integration. The pipeline supports custom VMAF thresholds per workflow when calibration data is available — those flow in through the quality-analysis node configuration as operator-set values.

The strict-broker security model handles subjective testing materials with the same discipline as any content — workers receive content via short-lived presigned URLs, processing follows workflow spec, no special considerations.

The general guidance: VMAF is a great metric; periodic validation against MOS keeps it honest. For premium streaming, invest in occasional subjective testing; for everything else, well-instrumented VMAF with reasonable thresholds is sufficient. Don't trust any single metric without occasional sanity checks.

What MOS is

MOS — Mean Opinion Score — is the average quality rating from a panel of human viewers. Typically scored on a 5-point scale:

5: Excellent — quality indistinguishable from source.
4: Good — quality slightly worse than source, but still good.
3: Fair — quality acceptable; visible degradation.
2: Poor — quality clearly degraded.
1: Bad — unacceptable quality.

For a video sequence, multiple viewers each give their score. The mean across viewers is the MOS.

MOS values fall into ranges based on quality:

MOS 4.5+ — premium streaming target.
MOS 4.0-4.5 — high-quality streaming.
MOS 3.5-4.0 — acceptable mainstream streaming.
MOS 3.0-3.5 — lower-tier streaming.
MOS below 3.0 — poor; user complaints expected.

The scale is subjective; calibration requires controlled testing methodology.

ITU-R BT.500 methodology

ITU-R BT.500 (currently version 14, 2023) is the international standard for subjective video quality testing. Key requirements:

Display setup:

Reference monitor with calibrated colorimetry.
Specific peak luminance (typically 100 cd/m² for SDR).
Specific viewing distance (typically 3-4× picture height).
Controlled ambient lighting.

Viewer selection:

Trained viewers (golden eyes) for expert evaluation.
Non-experts for naive viewer evaluation.
15-30 viewers minimum per study for statistical significance.
Vision testing (visual acuity, color vision) for golden-eye panels.

Test methodology:

Double-Stimulus Continuous Quality Scale (DSCQS) — show reference and test sequence, viewer scores both.
Single-Stimulus Continuous Quality Evaluation (SSCQE) — show test only, viewer scores in real-time.
Stimulus Comparison (PC) — show pairs of test sequences, viewer picks better.

Content selection:

Standard test sequences (specific clips chosen for stress testing).
Custom content matching production characteristics.

Golden-eye panels

Golden-eye panels are trained viewers used for premium quality evaluation. Characteristics:

Visual acuity — verified normal or corrected-to-normal vision.
Color vision — Ishihara plate testing for color vision; reject color-deficient viewers for chromatic content evaluation.
Trained on artifact recognition — taught to identify blocking, ringing, banding, color shifts, blurring.
Calibrated baseline — known scoring patterns from prior studies; consistency across studies.

For production pipelines, maintaining a stable golden-eye panel pays dividends — same viewers, same scoring conventions, comparable results across studies.

For one-off or naive testing, broader panels (general public viewers) provide the "average viewer" perspective. Both types of testing have value.

VMAF vs MOS correlation

VMAF was trained on a large MOS dataset and produces scores that approximate MOS:

VMAF 95-100 → MOS 4.5-5.0
VMAF 85-95 → MOS 4.0-4.5
VMAF 75-85 → MOS 3.5-4.0
VMAF 65-75 → MOS 3.0-3.5
VMAF 50-65 → MOS 2.5-3.0
VMAF below 50 → MOS below 2.5

The mapping isn't exact — content, viewing condition, and viewer differ from VMAF's training distribution. For specific production content, the actual VMAF-to-MOS mapping may differ.

For premium streaming, validating VMAF against MOS on YOUR content is worth doing. The validation answers: "is VMAF 90 actually MOS 4.5 for our content, or is it something else?"

When VMAF disagrees with MOS

VMAF can disagree with MOS in specific cases:

Case 1: Content outside VMAF's training distribution.

Case 2: Specific perceptual artifacts.

VMAF measures certain artifact types better than others. Banding in flat regions, color shifts, certain temporal artifacts — VMAF's sensitivity is content-dependent.

Case 3: Codec-specific artifacts.

If a codec produces unusual artifacts (AV1 film grain synthesis, Dolby Vision-specific renderings), VMAF may score them differently than a viewer perceives.

Case 4: Viewing condition differences.

VMAF assumes a specific viewing condition. For mobile playback (smaller screens, brighter ambient), the perception differs from VMAF's assumed conditions.

For these cases, MOS validation reveals the disagreement; you can then either accept it (calibrate VMAF thresholds for your content) or supplement VMAF with subjective spot-checks.

Validation procedure

A production validation procedure:

Step 1: Select test content.

Pick a representative subset of your production content. 10-20 clips, 30-60 seconds each. Cover the variety of genres, technical characteristics, and quality ranges you produce.

Step 2: Encode at quality range.

For each clip, encode at multiple bitrates spanning low to high quality. The same range you'd use for BD-rate analysis (~75 to ~95 VMAF).

Step 3: Compute VMAF for each.

Run VMAF measurement on each encoded version. Record values.

Step 4: Run subjective study.

Run a structured MOS evaluation following ITU-R BT.500 (or simplified protocol). Multiple viewers; controlled environment; structured scoring.

Step 5: Compute correlation.

Step 6: Identify outliers.

Specific (clip, bitrate) combinations where VMAF and MOS disagree most. Investigate these for systematic patterns.

Step 7: Calibrate VMAF thresholds.

If VMAF and MOS disagree systematically, adjust the VMAF thresholds you use for production decisions. E.g., target VMAF 92 (instead of 90) for top-tier content if that's what produces MOS 4.5.

Sample test design

A practical small-scale validation study:

15 viewers (mix of golden-eye and naive).
10 source clips representing your production content.
5 quality levels per clip (very low to very high).
Total trials: 50 (10 clips × 5 quality levels).
Duration per trial: 30 seconds clip + 10 seconds scoring = ~40 seconds.
Total session length: 50 × 40s + breaks = ~45 minutes per viewer.

15 viewers × 45 minutes = ~11 hours of viewing time.

Plus setup, training viewers, processing data: budget 1-2 days for a study of this size.

Conduct quarterly or per-major-encoder-change.

Outsourcing subjective testing

For pipelines without in-house panel infrastructure, outsourcing options:

Specialized labs — companies like Witbe, Fraunhofer, BBC R&D do subjective video quality testing as a service. Premium pricing; expert results.

Crowd-sourced platforms — Subjectify, Crowd4Test, others. Lower cost; less controlled environment; appropriate for naive-viewer studies.

University collaborations — research labs sometimes do subjective testing for streaming services.

For premium streaming, occasional subjective testing is worth the investment. For mass-market or budget streaming, automated metrics with periodic spot-checks may be sufficient.

VMAF calibration to your content

After validation, you may calibrate VMAF for your specific content:

Threshold adjustment:

If your content's VMAF systematically reads ~3 points lower than equivalent MOS content (perhaps due to content characteristics):

Original threshold: VMAF 90 = MOS 4.5 (production target).
Adjusted threshold: VMAF 87 = MOS 4.5 (your content's mapping).
Production decisions use the adjusted threshold.

Per-genre thresholds:

Different content types may have different VMAF-MOS mappings:

Animation: VMAF 92 = MOS 4.5.
Sports: VMAF 90 = MOS 4.5.
Talking heads: VMAF 88 = MOS 4.5.

For per-content-type calibration, VMAF thresholds vary by content category. The pipeline routes each piece of content with appropriate threshold.

Custom VMAF model training:

This is significant investment (collecting MOS data for thousands of clips); only worth it for specific high-stakes content.

When subjective testing isn't worth it

For some pipelines, subjective testing is overkill:

Free / ad-supported tier — quality threshold is "good enough"; subtle MOS differences don't drive customer behavior.
Internal video — corporate communications, e-learning. Quality matters less.
Volume-driven services — UGC platforms, social video. Per-content quality testing isn't economical.

For these, automated metrics (VMAF without subjective validation) is sufficient.

For premium streaming, premium broadcast distribution, content licensing where quality matters contractually — subjective validation is worth doing periodically.

Operational considerations

Things that matter for VMAF validation in production:

Cadence — quarterly minimum for premium services; more frequently if encoder configurations change often.
Documentation — record viewers, conditions, scores, methodology. Replicable studies are more useful than one-off.
Content rotation — refresh test content periodically so panel doesn't memorize specific clips.
Threshold updates — when calibration changes, document and communicate to pipeline operators.
Cross-team alignment — subjective testing results inform encoder, ladder, and pipeline decisions. Communicate findings broadly.

VMAF cross-validation with MOS — calibrating quality metrics against viewers

What MOS is

ITU-R BT.500 methodology

Golden-eye panels

VMAF vs MOS correlation

When VMAF disagrees with MOS

Validation procedure

Sample test design

Outsourcing subjective testing

VMAF calibration to your content

When subjective testing isn't worth it

Operational considerations

What MpegFlow does with VMAF/MOS validation

Related topics and reading

VMAF cross-validation with MOS — calibrating quality metrics against viewers

What MOS is

ITU-R BT.500 methodology

Golden-eye panels

VMAF vs MOS correlation

When VMAF disagrees with MOS

Validation procedure

Sample test design

Outsourcing subjective testing

VMAF calibration to your content

When subjective testing isn't worth it

Operational considerations

What MpegFlow does with VMAF/MOS validation

Related topics and reading

VMAF cross-validation with MOS — calibrating quality metrics against viewers

#What MOS is

#ITU-R BT.500 methodology

#Golden-eye panels

#VMAF vs MOS correlation

#When VMAF disagrees with MOS

#Validation procedure

#Sample test design

#Outsourcing subjective testing

#VMAF calibration to your content

#When subjective testing isn't worth it

#Operational considerations

#What MpegFlow does with VMAF/MOS validation

Related topics and reading

VMAF cross-validation with MOS — calibrating quality metrics against viewers

#What MOS is

#ITU-R BT.500 methodology

#Golden-eye panels

#VMAF vs MOS correlation

#When VMAF disagrees with MOS

#Validation procedure

#Sample test design

#Outsourcing subjective testing

#VMAF calibration to your content

#When subjective testing isn't worth it

#Operational considerations

#What MpegFlow does with VMAF/MOS validation

Related topics and reading

What MOS is

ITU-R BT.500 methodology

Golden-eye panels

VMAF vs MOS correlation

When VMAF disagrees with MOS

Validation procedure

Sample test design

Outsourcing subjective testing

VMAF calibration to your content

When subjective testing isn't worth it

Operational considerations

What MpegFlow does with VMAF/MOS validation

What MOS is

ITU-R BT.500 methodology

Golden-eye panels

VMAF vs MOS correlation

When VMAF disagrees with MOS

Validation procedure

Sample test design

Outsourcing subjective testing

VMAF calibration to your content

When subjective testing isn't worth it

Operational considerations

What MpegFlow does with VMAF/MOS validation