MP4 is the most-deployed video container format in the world. Every device that plays video opens MP4 files. Every streaming service produces MP4 in some variant (progressive MP4, fragmented MP4, CMAF). Every codec ships in MP4 as one of its primary containers. Understanding what MP4 actually is — the box-based structure, the difference between progressive and fragmented variants, the relationship to ISOBMFF and MOV — is foundational to working in video pipelines. This page is the engineering reference.
What MP4 is
MP4 (formally ISO/IEC 14496-14, MPEG-4 Part 14) is a specific profile of the ISO Base Media File Format (ISOBMFF, formally ISO/IEC 14496-12). The relationship:
- ISOBMFF is the base container specification. It defines the structure but not the specific content.
- MP4 is the MPEG-4-specific profile that constrains ISOBMFF for MPEG-4 audio and video carriage.
- MOV is Apple's QuickTime container, the ancestor of ISOBMFF. Practically interchangeable with MP4 for most purposes.
- CMAF (Common Media Application Format) is a specific profile of fragmented ISOBMFF for streaming.
- 3GP (used in mobile) is another ISOBMFF profile.
These all share the underlying box-based structure. Most tools that read MP4 also read MOV, fragmented MP4, CMAF, and 3GP — they're variations on the same theme.
The box structure
ISOBMFF (and therefore MP4) is built from boxes — also called atoms in the older MOV terminology. Each box has:
- Size (4 bytes) — total size including the header.
- Type (4 bytes) — four-character code (FourCC) identifying the box type.
- Optional extended size (8 bytes for very large boxes).
- Payload — content specific to the box type.
Boxes can contain other boxes (nested), creating a tree structure. The major top-level boxes in a typical MP4 file:
ftyp (file type)
moov (movie metadata)
├── mvhd (movie header)
├── trak (track) — one per video/audio/subtitle track
│ ├── tkhd (track header)
│ ├── mdia (media)
│ │ ├── mdhd (media header)
│ │ ├── hdlr (handler reference)
│ │ └── minf (media information)
│ │ └── stbl (sample table) — describes how samples are stored
│ └── edts (edit list) — optional, defines edits/offsets
└── udta (user data) — optional metadata
mdat (media data) — the actual encoded bits
The moov box is the metadata describing what's in the file; mdat is the actual media payload. The stbl (sample table) within each track is the index that maps from time/sample number to byte offset in mdat.
Progressive MP4 vs fragmented MP4
MP4 has two structural variants for streaming:
Progressive MP4 (the original) — moov followed by mdat. Player must read all of moov before it can start playing (because moov contains the byte offsets needed to find media samples). For streaming, this means the player downloads metadata first, then media. Workable for VOD; awkward for live.
Fragmented MP4 (fMP4) — small moov followed by sequence of moof + mdat pairs (movie fragments). Each fragment has self-contained metadata and media data; players can start playing as fragments arrive without waiting for the whole moov. Designed for streaming.
The fMP4 structure:
ftyp
moov (initialization — codec config, but no per-fragment indexing)
moof (fragment metadata)
mdat (fragment media)
moof (next fragment metadata)
mdat (next fragment media)
...
For modern streaming (HLS modern, DASH, CMAF), fMP4 is what's used. Progressive MP4 still works for VOD download, but streaming infrastructure is built around fMP4.
Faststart for progressive MP4
For progressive MP4 used over HTTP, the moov box position matters. By default, encoders write moov at the END of the file (because the encoder needs to know all sample positions before it can write the index). For HTTP streaming, this means the player has to download the entire file before it can start playback.
The fix: faststart mode — relocate moov to the start of the file after encoding. Then the player downloads moov first (small), starts playback, and downloads mdat (large) progressively.
ffmpeg with faststart:
ffmpeg -i input -c:v libx264 -movflags +faststart output.mp4
For VOD progressive download, faststart is essentially mandatory. Without it, playback can't start until the whole file is downloaded — terrible UX.
For fragmented MP4, faststart isn't needed — the structure is inherently streaming-friendly.
CMAF — the streaming profile
CMAF (Common Media Application Format, ISO/IEC 23000-19) is a specific profile of fragmented MP4 designed for streaming. It constrains the fMP4 structure to ensure interoperability between HLS and DASH players:
- Single track per segment — each CMAF segment contains exactly one media track.
- Specific box structure — only certain box types and arrangements are allowed.
- Defined timing model —
tfdt(track fragment decode time) for absolute timing. - Encryption profile — Common Encryption (CENC) with specific cbcs configuration for multi-DRM compatibility.
Same underlying fMP4 structure; tighter rules for streaming compatibility. See the CMAF topic for the full reference.
Codec compatibility
MP4 supports a wide range of codecs:
Video:
- H.264 (AVC) — universal.
- H.265 (HEVC) — universal in modern MP4 use.
- AV1 — supported in CMAF and modern MP4.
- VP9 — supported but uncommon (typically WebM is preferred for VP9).
- VVC (H.266) — supported in newer MP4 spec versions.
- MPEG-4 Part 2 (DivX, Xvid) — historical; rarely used now.
- ProRes — supported but typically delivered in MOV.
Audio:
- AAC (LC, HE, HE v2, xHE) — universal.
- AC-3 (Dolby Digital) — broadcast and surround content.
- E-AC-3 (Dolby Digital Plus) — Atmos delivery via E-AC-3 JOC.
- ALAC (Apple Lossless) — supported.
- Opus — supported (since iOS 11+, modern Android).
- MP3 — supported (legacy).
Subtitles / timed text:
- TTML / IMSC — supported.
- WebVTT — supported.
- TX3G (Apple's MP4-native subtitle format) — supported.
Other:
- ID3 timed metadata.
- HEIF (still images, used for poster frames).
The codec carriage uses standardized FourCC identifiers: avc1 for H.264, hev1/hvc1 for HEVC, av01 for AV1, mp4a for AAC, etc.
Edit lists
MP4 supports edit lists — metadata that describes how to play back the media. An edit list can:
- Skip ranges — play sample 0 through 100, skip 101-200, play 201-300.
- Repeat ranges — play sample 0 through 100 twice.
- Time offset — start playback at sample 50, treating it as time 0.
Edit lists are common in editorial workflows (where the source has trimmed sections that should be hidden from playback) and in some streaming contexts (where leading frames need to be skipped during decoding).
The catch: not all players honor edit lists correctly. Some assume edit lists are absent and play all samples in mdat. Some honor them but with bugs. For streaming reliability, content with edit lists should be carefully tested across target players.
DRM and encryption
MP4 supports content encryption via Common Encryption (CENC, ISO/IEC 23001-7):
- Encryption boxes within
trakdescribe the encryption parameters. - pssh (Protection System Specific Header) boxes carry DRM-specific data (Widevine, PlayReady).
- Sample-level encryption with optional patterns.
- AES-CTR and AES-CBC (cbcs) modes supported.
For streaming, CENC + cbcs in CMAF segments is the multi-DRM unlock that lets one set of encrypted content serve Widevine, FairPlay, and PlayReady.
Streaming use of MP4
Different MP4 variants for different streaming patterns:
- Progressive MP4 with faststart — VOD download/progressive play. Single file delivery. Used by Twitter, Reddit (for short-form video), some social embed playback.
- Fragmented MP4 — modern streaming. Used as CMAF segments by HLS modern and DASH.
- fMP4 with byte ranges — single-file fMP4 with players using HTTP byte-range requests to fetch fragments. Used by some VOD platforms.
For 2026 production streaming, CMAF segments (fMP4 with the CMAF profile constraints) is the dominant pattern.
MP4 vs MKV/WebM
The container choice question often comes up. The honest comparison:
| Dimension | MP4 (ISOBMFF) | MKV/WebM |
|---|---|---|
| Industry adoption | Universal | Web ecosystem strong |
| Codec flexibility | Wide but defined | Very flexible (any codec) |
| Streaming spec | CMAF defined for streaming | No formal streaming profile |
| Browser support | Universal native | Most browsers via WebM |
| Apple ecosystem | Native | Limited |
| Open source community | Common | Preferred |
| File size overhead | Modest | Slightly less |
For streaming infrastructure: MP4 (specifically CMAF). For open-source community use, archival, or Linux desktop video: MKV/WebM is often preferred.
MP4 vs MOV
MOV is the QuickTime container; MP4 derived from MOV. The differences are mostly historical:
- MOV uses 4-character "atoms"; MP4 uses 4-character "boxes." Same concept, different terminology.
- MOV supports some legacy features (older Apple-specific codecs, edit lists more thoroughly) that aren't in MP4.
- MP4 supports CMAF profile and standardized streaming features that MOV doesn't define.
- ProRes is officially carried in MOV; technically it can be in MP4 but tooling is built around MOV ProRes.
For most practical purposes, MP4 and MOV are interchangeable. ffmpeg uses the same demuxer (mov,mp4,m4a,3gp,3g2,mj2) for all of them. See the MOV topic for the editorial-workflow context.
Operational considerations
Things that matter for MP4 in production:
- Codec FourCC selection —
avc1vsavc3for H.264,hvc1vshev1for HEVC. Subtle differences in how parameter sets are signaled. Some players prefer specific FourCCs. - Faststart for progressive — verify
moovis at file start for VOD delivery. - Edit list testing — content with edit lists needs cross-player testing.
- Box ordering — some players are strict about box order; non-standard orderings can cause playback issues.
- Chunked transfer encoding compatibility — for fMP4 over HTTP, chunked transfer is common. Ensure CDN supports it.
- Color metadata preservation —
colr,mdcv,clliboxes for HDR signaling must be preserved through any re-multiplexing.
What MpegFlow does with MP4
MpegFlow's DAG runtime expresses MP4 muxing as part of the per-rendition encode stage; each FfmpegExecutor rendition produces an MP4 (progressive or fragmented per workflow spec) which downstream stages consume via cross-stage data flow. The partitioner persists each stage to job_stages with explicit dependency tracking; per-stage retry handles transient failures; rendition-level partial-success reporting surfaces granular state when a subset fails.
For streaming workflows, the default output is CMAF / fragmented MP4 — the fMP4 profile that the downstream HLS and DASH packaging stages reference from one rendition output. Today's CMAF emission is FfmpegExecutor-driven (the FFmpeg CMAF muxer producing ftyp, moov init, and moof+mdat media boxes).
For VOD delivery (downloads, progressive web embeds), progressive MP4 with faststart is supported as an output type via FFmpeg's -movflags +faststart; the muxer rewrites the file with moov at the start.
Advanced MP4 packaging behaviors that depend on dedicated packagers — multi-DRM CENC signaling, sophisticated edit-list authoring, certain low-latency CMAF variants — are part of the Phase 2D / Shaka Packager roadmap and are not currently shipped runtime executors. Color metadata, codec configuration, and SEI-carried metadata pass through whatever the FFmpeg muxer preserves; timestamp-preservation discipline at stage boundaries keeps timing consistent across the workflow.
The strict-broker security model handles MP4 muxing like any pipeline payload — workers carry no ambient credentials; content access flows through short-lived presigned URLs scoped per stage; access is disposed on completion.
For customers with specific MP4 requirements (faststart for embedded playback, edit list use cases, specific FourCC choices), the workflow YAML supports per-output configuration on the encode stage. Defaults produce streaming-optimized CMAF; alternative MP4 forms are configured explicitly.
The MP4 container is one of the parts of video infrastructure that "just works" once you understand the box structure. The complexity is in the codec inside the container, not the container itself. Get the codec configuration right; let the container do its job.