SSIM — the structural similarity metric and its multi-scale variants

MpegFlow

Practical reference on the Structural Similarity Index — luminance, contrast, and structure components, MS-SSIM and SSIMplus variants, ITU-T standardization, and SSIM vs PSNR vs VMAF.

SSIM — Structural Similarity Index — is the perceptual quality metric that bridges PSNR (mathematically simple, perceptually weak) and VMAF (perceptually strong, mathematically complex). Wang, Bovik, Sheikh, and Simoncelli published the original SSIM paper in 2004; the metric has been widely adopted in codec research and production quality monitoring. ITU-T standardized SSIM and its multi-scale variant MS-SSIM in 2008. In 2026, SSIM is the metric that fits the gap between PSNR-as-baseline and VMAF-as-perceptual-truth — modest computational cost, better-than-PSNR perceptual relevance, well-understood. This page is the engineering reference.

What SSIM is

SSIM measures structural similarity between two images. Unlike PSNR which compares pixel-by-pixel, SSIM compares local statistics in sliding windows across the image. The intuition: human vision is more sensitive to changes in image structure (edges, textures, patterns) than to absolute pixel-value differences.

The SSIM formula combines three components:

Luminance similarity (l) — comparison of mean intensity in the windows.
Contrast similarity (c) — comparison of standard deviation of intensity.
Structure similarity (s) — comparison of cross-covariance, normalized.

The combined formula:

SSIM(x, y) = [l(x, y)^α] * [c(x, y)^β] * [s(x, y)^γ]

Where x and y are corresponding windows in the reference and distorted images, and α, β, γ are weighting parameters (typically all set to 1 in standard SSIM). The window-level SSIM is computed across the image and aggregated.

The numerical scale: -1 to 1, where 1 is identical and -1 is anti-correlated. In practice, video quality SSIM is in the 0.7-1.0 range:

SSIM 0.99+ — perceptually transparent. Indistinguishable from source on consumer playback.
SSIM 0.97-0.99 — high quality. Visible difference only on careful expert review.
SSIM 0.93-0.97 — acceptable quality. Standard streaming.
SSIM 0.85-0.93 — visible quality issues. Mid-tier streaming.
SSIM below 0.85 — significant degradation.

Note that SSIM scores are content-dependent; different content can produce different aggregate SSIM at perceptually equivalent quality. Comparisons within a single content asset are meaningful; across-content comparisons are tricky.

How SSIM is computed in practice

SSIM is computed per color plane (Y, U, V) separately, then aggregated. Most production use cites SSIM-Y (luma only) as the primary number.

ffmpeg's SSIM filter:

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi ssim -f null -

Output:

SSIM Y:0.97123 U:0.98234 V:0.98345 All:0.97456

The All: value is the per-plane weighted combination. Per-frame SSIM is computed and aggregated; ffmpeg outputs the mean.

Computational cost: SSIM is more expensive than PSNR (window-based statistics rather than per-pixel arithmetic) but much cheaper than VMAF. For 1080p content in 2026, SSIM computes at faster than real-time on a single CPU core; VMAF on the same content is closer to real-time.

MS-SSIM (multi-scale SSIM)

The standard SSIM operates on a single scale — windows of fixed size at the original image resolution. MS-SSIM (Multi-Scale SSIM) computes SSIM at multiple scales and combines them, which better matches human visual processing of multi-resolution detail.

The MS-SSIM procedure:

Compute SSIM at the original resolution.
Downsample the images by 2x.
Compute SSIM at the downsampled resolution.
Repeat (typically 5 levels total).
Combine using empirically-derived per-scale weights.

MS-SSIM correlates better with human perception than single-scale SSIM, especially for content with both fine detail and large structures. It's the variant most modern codec evaluation uses when "SSIM" is reported.

The numerical scale is the same as SSIM (0 to 1). The same threshold guidance applies, with MS-SSIM being slightly more discriminating in the high-quality range (0.97-1.0).

ffmpeg supports MS-SSIM via:

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi ssim,scale=540:270,ssim,scale=270:135,ssim ...

This is awkward; in practice MS-SSIM is more commonly computed via dedicated tools (the vmaf tool actually includes MS-SSIM as one of its supported metrics).

SSIMplus

SSIMplus is a variant developed by SSIMWAVE (now a Crunchyroll/IMAX subsidiary) that extends SSIM with additional perceptual considerations:

Display-aware — accounts for display size and viewing distance.
Color-space-aware — better handling of HDR and wide color gamut.
Cross-resolution comparison — designed to compare quality across different resolutions, not just same-resolution distortion.

SSIMplus is proprietary; SSIMWAVE provides commercial tools. It's used in some broadcast workflows and contribution-quality monitoring. For most streaming pipeline use, the open-source SSIM and MS-SSIM are sufficient and the licensing/vendor-dependency of SSIMplus is unwarranted.

SSIM vs PSNR

The case for SSIM over PSNR is well-established:

Better perceptual correlation — SSIM scores correlate more strongly with human MOS ratings than PSNR scores, by most measurement.
Sensitivity to structural artifacts — blocking, ringing, blurring all show up more clearly in SSIM than in PSNR.
Less sensitive to global shifts — small uniform brightness or contrast changes don't dramatically affect SSIM the way they would PSNR.
Comparable computational cost for production analysis (a few times PSNR cost, well within budget for non-real-time analysis).

The case for keeping PSNR alongside:

Historical comparability — codec literature uses PSNR; SSIM doesn't replace PSNR for that use.
Different failure modes — PSNR catches some things SSIM misses (extreme outlier pixels, certain compression artifacts that don't change local structure).
Codec rate-distortion — codecs internally optimize for PSNR-like metrics; reporting PSNR shows what the codec is doing.

In 2026 codec research and production evaluation, both PSNR and SSIM (or MS-SSIM) are typically reported. They're cheap to compute together; they sometimes disagree informatively.

SSIM vs VMAF

The cleaner comparison in modern production:

Dimension	SSIM	VMAF
Perceptual correlation	Good	Better
Computational cost	Modest (~2-3x PSNR)	High (~10-20x PSNR)
Implementation variants	SSIM, MS-SSIM, SSIMplus	vmaf_v0.6.1, vmaf_v0.6.1neg, vmaf_4k
Trained vs derived	Mathematically derived	Trained on human MOS
Generalization	Across content types reasonably well	Better in training distribution; worse outside
Standard body status	ITU-T standard	Netflix open-source, de facto standard
Production use today	Quality monitoring, some encoder eval	Production codec selection, ABR optimization

For most production decisions in 2026, VMAF is the right primary metric. SSIM is the right secondary metric — fast enough to compute frequently, reliable enough to catch most issues, well-understood enough to debug when it disagrees with VMAF.

Where SSIM is still the right tool

The cases where SSIM is the right metric:

Real-time quality monitoring — for live streams, computing VMAF per-frame is bandwidth-expensive. SSIM-Y per-frame is feasible with modest CPU cost. Useful as a "is this stream still healthy?" health metric.

Content-aware encoding decisions where VMAF is overkill — for mid-tier content where you don't need the perceptual precision of VMAF, SSIM gives 80% of the perceptual signal at 20% of the compute cost.

Cross-codec sanity checks — when comparing codecs, having SSIM alongside PSNR and VMAF helps identify cases where the metrics disagree (often perceptually meaningful signal).

Budget-constrained pipelines — startup-tier and mid-scale pipelines often can't justify VMAF computation across all content. SSIM is the upgrade from PSNR that's affordable.

SSIM in BD-rate analysis

BD-rate SSIM (or BD-rate MS-SSIM) is computed the same way as BD-rate PSNR — encode at multiple bitrates, plot bitrate vs SSIM, compute area between curves. The interpretation is analogous: "codec B has -25% BD-rate MS-SSIM vs codec A" means codec B reaches the same MS-SSIM at 25% lower bitrate.

For codec research papers, BD-rate MS-SSIM is increasingly cited alongside BD-rate PSNR. It provides a perceptually-relevant complement to the PSNR baseline. BD-rate VMAF is the modern third metric.

SSIM implementation gotchas

A few details that matter when computing SSIM in production:

Window size and shape — the original SSIM paper used 8×8 sliding windows; the ITU-T standardization uses 11×11 with Gaussian weighting. Different implementations may default to different choices, producing slightly different absolute SSIM values for the same content. Pick one implementation and stay with it across your evaluations.
Color plane weighting — when reporting an aggregate SSIM, the per-plane (Y, U, V) weighting matters. Standard weighting matches chroma sub-sampling (6:1:1 for 4:2:0). Some tools report unweighted average; verify what your tool is doing.
Luma-only vs full-color — SSIM-Y is the most-cited single number, but for chroma-heavy artifacts (banding in skies, color compression artifacts) SSIM-U or SSIM-V can be more discriminating. Report all three for diagnostic purposes; report Y for headline numbers.
Bit-depth handling — SSIM with 10-bit or 12-bit content needs to account for the larger dynamic range. Some implementations normalize to 0-1 internally; others operate in the native bit depth. Verify behavior on HDR content specifically.
Sub-pixel artifacts — SSIM at the pixel level can miss sub-pixel positional artifacts that are visible to viewers. For content with chromatic aberration concerns or sub-pixel rendering, supplement SSIM with golden-eyes review.

What MpegFlow does with SSIM

SSIM (or MS-SSIM) runs as a discrete measurement stage in MpegFlow's DAG runtime via the FFmpeg ssim filter, exposed through the quality-analysis node alongside PSNR and VMAF. The stage executes on an FfmpegExecutor worker; cross-stage data flow wires the encode output and reference source into measurement input. Customers running comprehensive quality analysis configure all three metrics as parallel sibling measurement stages in the same workflow.

The operational pattern: PSNR for fast regression detection, SSIM for perceptual sanity checks, VMAF for production decision-making. PSNR catches encoder-version regressions cheaply; SSIM catches perceptual issues PSNR misses; VMAF makes the production calls. Each metric has a role; running them together is cheap enough that the redundancy is worth the cross-validation.

The strict-broker security model treats quality computation like any other workflow stage — workers receive content via short-lived presigned URLs, compute the metric, write results to customer-controlled metadata, and dispose of access. Quality metrics aren't sensitive in the security sense, but the discipline is consistent.

For customers building their quality programs, the standing recommendation is: compute all three (PSNR, SSIM, VMAF) where compute budget allows; use VMAF for production decisions; use SSIM as the cross-validation signal that catches perceptual issues VMAF might miss; use PSNR for fast regression sanity checks. The metrics together are stronger than any one alone.

What SSIM is

The SSIM formula combines three components:

Luminance similarity (l) — comparison of mean intensity in the windows.
Contrast similarity (c) — comparison of standard deviation of intensity.
Structure similarity (s) — comparison of cross-covariance, normalized.

The combined formula:

SSIM(x, y) = [l(x, y)^α] * [c(x, y)^β] * [s(x, y)^γ]

The numerical scale: -1 to 1, where 1 is identical and -1 is anti-correlated. In practice, video quality SSIM is in the 0.7-1.0 range:

SSIM 0.99+ — perceptually transparent. Indistinguishable from source on consumer playback.
SSIM 0.97-0.99 — high quality. Visible difference only on careful expert review.
SSIM 0.93-0.97 — acceptable quality. Standard streaming.
SSIM 0.85-0.93 — visible quality issues. Mid-tier streaming.
SSIM below 0.85 — significant degradation.

How SSIM is computed in practice

SSIM is computed per color plane (Y, U, V) separately, then aggregated. Most production use cites SSIM-Y (luma only) as the primary number.

ffmpeg's SSIM filter:

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi ssim -f null -

Output:

SSIM Y:0.97123 U:0.98234 V:0.98345 All:0.97456

The All: value is the per-plane weighted combination. Per-frame SSIM is computed and aggregated; ffmpeg outputs the mean.

MS-SSIM (multi-scale SSIM)

The MS-SSIM procedure:

Compute SSIM at the original resolution.
Downsample the images by 2x.
Compute SSIM at the downsampled resolution.
Repeat (typically 5 levels total).
Combine using empirically-derived per-scale weights.

The numerical scale is the same as SSIM (0 to 1). The same threshold guidance applies, with MS-SSIM being slightly more discriminating in the high-quality range (0.97-1.0).

ffmpeg supports MS-SSIM via:

ffmpeg -i reference.mp4 -i distorted.mp4 -lavfi ssim,scale=540:270,ssim,scale=270:135,ssim ...

This is awkward; in practice MS-SSIM is more commonly computed via dedicated tools (the vmaf tool actually includes MS-SSIM as one of its supported metrics).

SSIMplus

SSIMplus is a variant developed by SSIMWAVE (now a Crunchyroll/IMAX subsidiary) that extends SSIM with additional perceptual considerations:

Display-aware — accounts for display size and viewing distance.
Color-space-aware — better handling of HDR and wide color gamut.
Cross-resolution comparison — designed to compare quality across different resolutions, not just same-resolution distortion.

SSIM vs PSNR

The case for SSIM over PSNR is well-established:

Better perceptual correlation — SSIM scores correlate more strongly with human MOS ratings than PSNR scores, by most measurement.
Sensitivity to structural artifacts — blocking, ringing, blurring all show up more clearly in SSIM than in PSNR.
Less sensitive to global shifts — small uniform brightness or contrast changes don't dramatically affect SSIM the way they would PSNR.
Comparable computational cost for production analysis (a few times PSNR cost, well within budget for non-real-time analysis).

The case for keeping PSNR alongside:

Historical comparability — codec literature uses PSNR; SSIM doesn't replace PSNR for that use.
Different failure modes — PSNR catches some things SSIM misses (extreme outlier pixels, certain compression artifacts that don't change local structure).
Codec rate-distortion — codecs internally optimize for PSNR-like metrics; reporting PSNR shows what the codec is doing.

In 2026 codec research and production evaluation, both PSNR and SSIM (or MS-SSIM) are typically reported. They're cheap to compute together; they sometimes disagree informatively.

SSIM vs VMAF

The cleaner comparison in modern production:

Dimension	SSIM	VMAF
Perceptual correlation	Good	Better
Computational cost	Modest (~2-3x PSNR)	High (~10-20x PSNR)
Implementation variants	SSIM, MS-SSIM, SSIMplus	vmaf_v0.6.1, vmaf_v0.6.1neg, vmaf_4k
Trained vs derived	Mathematically derived	Trained on human MOS
Generalization	Across content types reasonably well	Better in training distribution; worse outside
Standard body status	ITU-T standard	Netflix open-source, de facto standard
Production use today	Quality monitoring, some encoder eval	Production codec selection, ABR optimization

Where SSIM is still the right tool

The cases where SSIM is the right metric:

Cross-codec sanity checks — when comparing codecs, having SSIM alongside PSNR and VMAF helps identify cases where the metrics disagree (often perceptually meaningful signal).

Budget-constrained pipelines — startup-tier and mid-scale pipelines often can't justify VMAF computation across all content. SSIM is the upgrade from PSNR that's affordable.

SSIM in BD-rate analysis

For codec research papers, BD-rate MS-SSIM is increasingly cited alongside BD-rate PSNR. It provides a perceptually-relevant complement to the PSNR baseline. BD-rate VMAF is the modern third metric.

SSIM implementation gotchas

A few details that matter when computing SSIM in production:

Window size and shape — the original SSIM paper used 8×8 sliding windows; the ITU-T standardization uses 11×11 with Gaussian weighting. Different implementations may default to different choices, producing slightly different absolute SSIM values for the same content. Pick one implementation and stay with it across your evaluations.
Color plane weighting — when reporting an aggregate SSIM, the per-plane (Y, U, V) weighting matters. Standard weighting matches chroma sub-sampling (6:1:1 for 4:2:0). Some tools report unweighted average; verify what your tool is doing.
Luma-only vs full-color — SSIM-Y is the most-cited single number, but for chroma-heavy artifacts (banding in skies, color compression artifacts) SSIM-U or SSIM-V can be more discriminating. Report all three for diagnostic purposes; report Y for headline numbers.
Bit-depth handling — SSIM with 10-bit or 12-bit content needs to account for the larger dynamic range. Some implementations normalize to 0-1 internally; others operate in the native bit depth. Verify behavior on HDR content specifically.
Sub-pixel artifacts — SSIM at the pixel level can miss sub-pixel positional artifacts that are visible to viewers. For content with chromatic aberration concerns or sub-pixel rendering, supplement SSIM with golden-eyes review.

SSIM — the structural similarity metric and its multi-scale variants

What SSIM is

How SSIM is computed in practice

MS-SSIM (multi-scale SSIM)

SSIMplus

SSIM vs PSNR

SSIM vs VMAF

Where SSIM is still the right tool

SSIM in BD-rate analysis

SSIM implementation gotchas

What MpegFlow does with SSIM

Related topics and reading

SSIM — the structural similarity metric and its multi-scale variants

What SSIM is

How SSIM is computed in practice

MS-SSIM (multi-scale SSIM)

SSIMplus

SSIM vs PSNR

SSIM vs VMAF

Where SSIM is still the right tool

SSIM in BD-rate analysis

SSIM implementation gotchas

What MpegFlow does with SSIM

Related topics and reading

SSIM — the structural similarity metric and its multi-scale variants

#What SSIM is

#How SSIM is computed in practice

#MS-SSIM (multi-scale SSIM)

#SSIMplus

#SSIM vs PSNR

#SSIM vs VMAF

#Where SSIM is still the right tool

#SSIM in BD-rate analysis

#SSIM implementation gotchas

#What MpegFlow does with SSIM

Related topics and reading

SSIM — the structural similarity metric and its multi-scale variants

#What SSIM is

#How SSIM is computed in practice

#MS-SSIM (multi-scale SSIM)

#SSIMplus

#SSIM vs PSNR

#SSIM vs VMAF

#Where SSIM is still the right tool

#SSIM in BD-rate analysis

#SSIM implementation gotchas

#What MpegFlow does with SSIM

Related topics and reading

What SSIM is

How SSIM is computed in practice

MS-SSIM (multi-scale SSIM)

SSIMplus

SSIM vs PSNR

SSIM vs VMAF

Where SSIM is still the right tool

SSIM in BD-rate analysis

SSIM implementation gotchas

What MpegFlow does with SSIM

What SSIM is

How SSIM is computed in practice

MS-SSIM (multi-scale SSIM)

SSIMplus

SSIM vs PSNR

SSIM vs VMAF

Where SSIM is still the right tool

SSIM in BD-rate analysis

SSIM implementation gotchas

What MpegFlow does with SSIM