I-frames, P-frames, and B-frames are the three fundamental frame types in modern video codecs. Every video stream is a sequence of these frame types organized by the encoder's rate-distortion logic. Understanding what each is, how they reference each other, and what tradeoffs they enable is foundational to making good encoder configuration decisions — especially for live streaming where B-frames have latency implications, and for ABR streaming where keyframe placement determines adaptation behavior. This page is the engineering reference.
What each frame type is
I-frame (Intra-coded frame) — a complete picture, encoded standalone without reference to any other frame. Compression uses spatial techniques only (transform coding, intra prediction within the frame). I-frames are the largest frame type but enable random access — the decoder can start at any I-frame without prior context.
P-frame (Predicted frame) — encoded as the difference from one or more previous frames (forward prediction). The encoder finds blocks in the previous reference frame that match blocks in the current frame; the encoded P-frame stores motion vectors plus residual differences. P-frames are typically 25-50% the size of I-frames.
B-frame (Bi-directionally predicted frame) — encoded as the difference from frames in both the past AND future. The encoder can pick blocks from either direction; this gives more compression flexibility, especially for content with smooth motion or temporal symmetry. B-frames are typically the smallest frame type — 10-25% the size of I-frames.
The compression-efficiency hierarchy: B > P > I. The size hierarchy: I > P > B. The encoder uses these tradeoffs to fit the rate-distortion budget — more I-frames means lower compression efficiency but more random access points; more B-frames means better compression but higher decoder complexity and latency.
Reference frames
P-frames and B-frames reference other frames. The "reference frame" concept:
- A P-frame at time t typically references the most recent I or P frame.
- A B-frame at time t references frames before AND after it in display order.
- Modern codecs (H.264 high profiles, HEVC, AV1) support multiple reference frames — a P-frame can pick blocks from any of the most recent N frames.
The reference distance (how far back the encoder can look) and reference count are codec-specific:
- H.264: up to 16 references in High profile.
- HEVC: up to 16 references typically used.
- AV1: more flexible; up to 8 active references with sophisticated reference management.
- VP9: similar to AV1's flexibility.
More references = better compression (more block-match candidates) but more decoder memory and complexity. Production encoders typically use 4-6 references; going higher gives diminishing returns.
Decode order vs display order
B-frames create a distinction between decode order (the order frames must be decoded to produce output) and display order (the order frames are shown to the viewer).
A typical GOP with B-frames in display order:
I B B P B B P B B P
0 1 2 3 4 5 6 7 8 9
In decode order (which the encoder writes to the bitstream):
I P B B P B B P B B
0 3 1 2 6 4 5 9 7 8
The P-frames are decoded first because the B-frames between them need both their past (I or earlier P) and future (later P) as references. The decoder reorders frames after decoding to display them in the correct sequence.
The implication: decoders need a buffer large enough to hold reordering frames. The buffer size depends on the maximum B-frame run between reference frames. This affects:
- Decoder memory — more reordering = more memory.
- Decode latency — for live, the decoder can't display a frame until all earlier-in-display-order frames are decoded.
- Hardware decoder constraints — some hardware decoders cap the reordering buffer size, limiting acceptable B-frame patterns.
B-pyramid (hierarchical B-frames)
Modern codecs (H.264 high profile, HEVC, AV1) support B-pyramid encoding. Instead of a flat B-frame layer between P-frames, B-pyramid uses hierarchical B-frame layers:
I B B B P B B B P
└──────┘ └──────┘
↓ ↓
Level 0: I,P,P (anchor frames)
Level 1: middle B (references both I/P)
Level 2: outer B (references I/P and middle B)
The hierarchical structure provides:
- Better compression — outer B-frames have more reference options.
- Temporal scalability — dropping outer layers produces a lower-frame-rate version of the same stream. Useful for adaptive streaming.
- Independent encoder parallelism — different B-frame layers can be encoded in parallel.
B-pyramid is the standard approach in modern HEVC/AV1 encoders. It adds modest decoder complexity for meaningful compression gains.
Frame type selection by the encoder
The encoder decides frame type per frame based on rate-distortion optimization. Factors:
- Position in GOP — first frame of GOP must be I; specific positions might be P or B based on configured pattern.
- Scene-change detection — if scene-change-keyframe insertion is enabled, scene changes become I-frames mid-GOP.
- Look-ahead analysis — the encoder can analyze upcoming frames to decide where to place B-frames most effectively.
- B-frame budget — configured maximum B-frames between P-frames.
- Reference picture buffer state — whether the necessary references are available.
The encoder configuration sets the constraint envelope; the encoder picks specifics within that envelope.
B-frame configuration
ffmpeg encoder configuration for B-frames:
x264:
-c:v libx264 -bf 3 -b_strategy 2
-bf 3 allows up to 3 B-frames between P-frames; -b_strategy 2 enables adaptive B-frame placement based on lookahead analysis.
x265:
-c:v libx265 -x265-params "bframes=8:b-adapt=2:b-pyramid=1"
bframes=8 allows up to 8 B-frames; b-adapt=2 enables adaptive placement; b-pyramid=1 enables hierarchical B-frames.
SVT-AV1:
B-frame management in AV1 uses different terminology. AV1's reference structure is more flexible; the encoder configuration uses hierarchical-levels for B-pyramid-like structures:
-c:v libsvtav1 -svtav1-params "hierarchical-levels=4"
NVENC HEVC:
-c:v hevc_nvenc -bf 4 -b_ref_mode middle
-bf 4 allows 4 B-frames; -b_ref_mode middle enables hierarchical B-frame referencing.
B-frames and live latency
B-frames have a meaningful latency cost in live streaming. Because B-frames reference future frames, the encoder must look ahead to encode them — and the decoder must wait for the reference future frame before it can decode the current B-frame.
For a B-frame pattern of "I B B B P", the encoder must process 5 frames before it can output the first B-frame. At 30 fps, that's ~167 ms of encode-side latency just from B-frame look-ahead.
For low-latency live, the typical approach is to disable B-frames:
-c:v libx264 -bf 0 -tune zerolatency
-bf 0 disables B-frames entirely; -tune zerolatency applies several latency-reducing optimizations including B-frame disabling. The compression cost is real (~10-20% bitrate increase for equivalent quality) but the latency reduction matters more for live use cases.
For low-latency live with HW encoders, similar disable:
-c:v hevc_nvenc -bf 0 -preset p1
For VOD or non-latency-critical live, enable B-frames for the compression benefit.
Hardware decoder constraints
Hardware decoders sometimes have limits that affect B-frame and reference frame choices:
- Reference picture buffer size — typically 2-16 reference frames depending on hardware generation. Modern hardware (2018+) has higher limits; older hardware caps lower.
- B-frame depth limits — some hardware can't decode beyond a certain hierarchical B-frame depth.
- Profile / level enforcement — H.264 levels and HEVC tiers/levels have constraints on max reference count, max B-frames, etc.
For consumer streaming, sticking within H.264 High level 4.1 / HEVC Main 10 level 5.1 keeps you within universal hardware decoder support. Pushing limits (more B-frames than the level allows, more references than the level allows) can cause playback failures on older hardware.
Frame types in HLS/DASH manifests
The frame type structure isn't directly signaled in manifests — manifests describe segments, not individual frames. But frame type structure affects manifest behavior:
- Segment durations — segments must start at IDR keyframes (closed GOP boundaries). Frame type structure within segments is encoder-internal.
- Trick play (HLS I-frame playlists) — HLS supports separate I-frame-only playlists for trick play (fast-forward, scrubbing). Generated by extracting I-frames from the main streams.
- Adaptive switching — can occur only at IDR keyframes. The B-frame structure between IDRs is decoder-internal.
For most pipeline operators, frame type choice within a GOP is encoder configuration; manifest concerns are about GOP boundaries (IDR placement), not internal structure.
Operational considerations
Things that matter for frame type configuration in production:
- B-frame disable for low-latency live — disable B-frames or use very few when targeting sub-2s latency.
- B-pyramid for VOD — enable for the compression benefit when latency isn't a concern.
- Reference frame count tuning — 4-6 references is the sweet spot for most content. Higher rarely helps.
- Hardware decoder testing — verify your B-frame configuration plays correctly on target devices, especially older hardware in the long tail.
- Encoder version updates — frame type heuristics can change between encoder versions. Test on representative content after upgrades.
Frame types and compression efficiency
A rough quality-vs-frame-types tradeoff at 1080p H.264 medium preset:
| Configuration | VMAF at 4 Mbps | Latency | Decoder complexity |
|---|---|---|---|
No B-frames (-bf 0) |
88.5 | Low | Lowest |
2 B-frames (-bf 2) |
89.7 | Modest | Modest |
| 3 B-frames + B-pyramid | 90.5 | Higher | Higher |
| 8 B-frames + B-pyramid (HEVC) | 91.2 | Highest | Highest |
The numbers vary by content but the pattern is consistent: more B-frames = better compression at the cost of latency and decoder complexity.
For production decisions:
- Live, low-latency:
-bf 0. Accept the compression cost. - Live, standard latency: 2-3 B-frames. Balance.
- VOD baseline: 3 B-frames + B-pyramid. Good compression, moderate complexity.
- VOD premium: 4-8 B-frames + B-pyramid. Maximum compression for premium content.
What MpegFlow does with frame types
MpegFlow's DAG runtime configures B-frame counts and B-pyramid per rendition through workflow YAML; each value flows into the corresponding FfmpegExecutor stage's parameters. The partitioner persists each rendition stage to job_stages with dependency tracking; per-stage retry handles transient failures; per-rung tuning is straightforward (different B-frame counts on different tiers) because each rendition is its own stage.
Default configurations:
- Live workflows: B-frames disabled (
-bf 0) for standard low-latency operation. - Standard live: 2 B-frames for non-latency-critical live.
- VOD baseline: 3 B-frames + B-pyramid for good compression at moderate complexity.
- VOD premium: 4-8 B-frames + B-pyramid for premium content where compression matters most.
The packaging stage relies on segment-aligned IDRs from the encode stage (see GOP/keyframes for the force-key-frames discipline that makes this work); trick-play I-frame-only playlist generation, where required, runs as an additional stage with its own executor configuration.
The strict-broker security model treats frame-type configuration like any pipeline payload — workers carry no ambient credentials; content access flows through short-lived presigned URLs scoped per stage; access is disposed on completion.
For customers tuning frame types for their pipeline, the standing recommendation: disable B-frames for low-latency live (the compression cost is worth the latency savings); enable B-pyramid for VOD (the compression gain is worth the modest decoder complexity); test on actual target devices after configuration changes (hardware decoder constraints can cause playback failures that don't show in encoder testing). Frame type choice is one of the encoder configurations where defaults work for most cases; tuning matters for the edge cases where they don't.