Fragmented MP4 (fMP4) is the segment format that powers modern streaming. CMAF segments are fMP4 with specific profile constraints. HLS modern uses fMP4. DASH uses fMP4. Understanding the internal structure — what each box does, how segments self-describe, how timing is encoded — is essential for debugging streaming pipelines, building custom packagers, and interpreting segment-level issues. This page is the engineering reference.
What fMP4 segments are
A fragmented MP4 segment is a self-contained fMP4 file that can be decoded standalone (given the corresponding initialization segment). Unlike progressive MP4 (single moov + single mdat), fMP4 uses repeating moof + mdat pairs:
ftyp ← file type identifier
moov ← initialization metadata (codec params, no per-fragment indexing)
moof ← fragment 1 metadata
mdat ← fragment 1 media data
moof ← fragment 2 metadata
mdat ← fragment 2 media data
moof ← fragment 3 metadata
mdat ← fragment 3 media data
...
Each moof + mdat pair is one fragment. Fragments can be encoded sequentially as content is processed; players can decode fragments as they arrive without needing the whole file.
For streaming, fragments are typically delivered as separate HTTP segments — each segment containing one or more moof + mdat pairs. The streaming structure becomes:
init.mp4 (one per variant):
ftyp + moov
segment-001.m4s, segment-002.m4s, ... (continuous):
moof + mdat
The init segment contains codec configuration (SPS/PPS for H.264/HEVC, codec-specific config for AV1, etc.) and is loaded once per variant. Media segments are loaded sequentially during playback.
Initialization segment structure
The init segment contains:
ftyp ← file type
- major brand, minor version, compatible brands
moov
├── mvhd ← movie header
├── trak ← track (one per stream)
│ ├── tkhd ← track header
│ ├── mdia ← media
│ │ ├── mdhd ← media header
│ │ ├── hdlr ← handler reference
│ │ └── minf ← media information
│ │ └── stbl ← sample table (mostly empty for fMP4)
│ └── ...
└── mvex ← movie extends — signals "this file uses fragments"
└── trex ← track extends — default sample properties
Key boxes:
mvex(movie extends) — this is what makes the file fragmented. Withoutmvex, the file is treated as progressive MP4. Withmvex, players know to look formoofboxes aftermoov.trex(track extends) — default sample flags, duration, and size that fragments inherit unless overridden.stbl(sample table) — present but mostly empty in fMP4. The actual samples are in fragments, not indexed bystbllike in progressive MP4.
The init segment is small (typically a few KB) and is loaded once per variant. Players cache it and reuse for every media segment of that variant.
Media segment structure
Each media segment contains one or more fragments:
styp ← segment type (optional but recommended)
- major brand, compatible brands (signals segment self-description)
moof ← movie fragment
├── mfhd ← movie fragment header
│ - sequence number
└── traf ← track fragment
├── tfhd ← track fragment header
│ - track ID, default sample properties
├── tfdt ← track fragment decode time (CMAF requires this)
│ - absolute decode time
├── trun ← track run
│ - sample count, data offset, sample sizes/durations
└── (encryption boxes if encrypted)
mdat ← media data
- actual encoded video/audio bytes for this fragment
Key boxes:
styp— segment type box. Identifies this as a CMAF segment, MSF (Media Segment Format) segment, etc. Not strictly required but improves interoperability.mfhd— sequence number for this fragment. Useful for re-ordering and validation.tfhd— track fragment header. Inherits defaults fromtrexin init segment.tfdt— track fragment decode time. CMAF requires this; it provides absolute timing reference for the fragment.trun— track run. Lists per-sample data: size, duration, flags. The mapping from samples to bytes in mdat.mdat— the actual media bytes. Encoded video/audio data.
tfdt — absolute timing
The tfdt (track fragment decode time) box is critical for CMAF and modern fMP4. It provides the absolute decode time of the first sample in the fragment, in the timescale defined in the init segment.
For example, if timescale is 90000 (common for video at 25 fps where each frame is 3600 timescale units):
- Fragment 1: tfdt = 0 (first frame)
- Fragment 2: tfdt = 360000 (4-second segment, 4 × 90000 = 360,000 units)
- Fragment 3: tfdt = 720000 (8 seconds)
The absolute timing lets players:
- Synchronize across multiple variants (audio + video alignment).
- Resume playback after seek without ambiguity.
- Handle live edge tracking with stable time references.
For CMAF, tfdt is mandatory in every fragment. For older fMP4, it's optional but strongly recommended.
styp box
The segment type box describes the segment's purpose and compatible profiles:
styp brand=cmfc compatibility=[isom, msdh, msix]
Common brand values:
cmfc— CMAF segment.msdh— MSF (Media Segment Format) for DASH.msix— Indexed MSF for DASH.dash— DASH segment.iso6— generic fMP4.
The styp box is technically optional (the segment can be recognized from the moof box that follows), but having it improves interoperability with strict players and packagers.
sidx — Segment Index
The Segment Index box is optional but useful for byte-range fetching:
sidx
- reference_id (track this index covers)
- subsegment_duration
- referenced_size (byte size)
sidx provides a byte-range index of subsegments within a segment. This lets players fetch specific time ranges via HTTP byte-range requests rather than downloading whole segments.
For DASH, sidx is sometimes used; for CMAF, it's optional. For HLS, it's not used (HLS doesn't fetch byte ranges within segments).
Sample format considerations
Within mdat, samples are stored sequentially. For each sample, the moof's trun provides:
- size — bytes in the sample.
- duration — sample duration in timescale units.
- flags — sample-specific flags (keyframe, dependency).
- composition time offset — for B-frames where decode order differs from display order.
The trun flags determine which fields are present. default-sample-flags-set in tfhd allows samples to inherit defaults rather than carrying redundant flag info.
For encrypted content (CENC), additional boxes appear:
saiz— sample auxiliary information sizes.saio— sample auxiliary information offsets.senc— sample encryption (initialization vectors, subsample boundaries).
These describe per-sample encryption parameters that the decoder needs to decrypt.
Common construction mistakes
Mistakes that cause segment playback issues:
Missing tfdt — players that require absolute timing can't decode. CMAF strictness varies; some players tolerate missing tfdt, others don't.
Wrong tfdt values — if tfdt doesn't match expected media time, audio/video sync breaks. Off-by-one frame at segment boundaries causes audible/visible glitches.
Inconsistent timescale across segments — timescale is set in the init segment; all media segments must use the same timescale. Mismatches cause time math errors.
Missing styp — most players tolerate missing styp; some packagers reject it.
Wrong mfhd sequence numbers — sequence numbers should be monotonically increasing. Out-of-sequence fragments confuse players.
Mismatched track IDs — trak IDs in moov must match tfhd track IDs in moof. Mismatch = fragment associated with non-existent track.
Padding in mdat — some encoders insert padding bytes for alignment. trun's data offset must point to the correct byte.
Inspecting fMP4 segments
For debugging, useful tools:
ffprobe with packet inspection:
ffprobe -v error -show_packets -of json segment.m4s
Lists each packet's timing, size, position.
MP4Box / MP4 dump utilities:
MP4Box -info segment.m4s
mp4dump segment.m4s
These show box structure and contents.
Custom Python parsing:
import struct
def parse_box(f, depth=0):
pos = f.tell()
header = f.read(8)
if not header: return None
size = struct.unpack('>I', header[:4])[0]
box_type = header[4:8].decode()
print(f"{' ' * depth}{box_type} @ {pos} size={size}")
return size
with open('segment.m4s', 'rb') as f:
while True:
size = parse_box(f)
if size is None: break
f.seek(f.tell() + size - 8)
Programmatic inspection is useful for automated pipeline validation.
Encoder vs packager generation
fMP4 segments can be generated by:
The encoder directly — ffmpeg with appropriate flags produces fragmented MP4:
ffmpeg -i input.mp4 -c copy -movflags +frag_keyframe+empty_moov+default_base_moof -f mp4 output.mp4
Flags:
frag_keyframe— fragment at every keyframe.empty_moov— produce empty moov (so mvex/trex work, no mdat in moov).default_base_moof— use moof as base offset for sample data (CMAF-friendly).
A separate packager — Shaka Packager, MP4Box, custom code. The encoder produces a regular MP4 or raw bitstream; the packager creates fMP4 segments.
For production pipelines, separate packagers are common because they can produce HLS + DASH manifests + CMAF segments in one pass.
fMP4 vs progressive MP4
Quick comparison:
| Dimension | Progressive MP4 | Fragmented MP4 (fMP4) |
|---|---|---|
| Structure | Single moov + single mdat | Sequential moof+mdat pairs |
| Streaming | Faststart needed | Native streaming |
| Random access | Via stbl indexing | Via fragment boundaries |
| Live encoding | Awkward | Native |
| Editing | Easy (whole file at once) | Harder (fragment boundaries) |
| Use case | File-based VOD | Streaming (HLS, DASH) |
For 2026 streaming, fMP4 (specifically CMAF profile) is the dominant pattern. Progressive MP4 is for download/embed delivery.
Operational considerations
Things that matter for fMP4 in production:
- Packager version pinning — different packager versions may produce slightly different fMP4 structure. Pin to known-good version.
- Box order strictness — some players are strict about moof/mdat ordering; others are lenient. Test with target players.
- Timescale consistency — all segments of one variant must share timescale. Manage carefully if pipeline involves multiple processing stages.
- Encryption metadata — encrypted segments add saiz/saio/senc boxes; ensure these are correctly populated.
- Segment durations — variable segment durations (e.g., GOP-aligned segments from VFR source) work but cause complexity. Prefer fixed segment durations.
- HTTP range support — fMP4 segments are typically delivered as whole files via HTTP. Byte-range fetching of subsegments is rare.
What MpegFlow does with fMP4 construction
MpegFlow's DAG runtime expresses fMP4 packaging as a discrete stage downstream of the parallel rendition encodes. The partitioner persists each rendition stage and the packaging stage to job_stages with explicit dependency tracking; cross-stage data flow wires the rendition outputs into the packager's input. Today the fMP4-emitting executor is FfmpegExecutor (FFmpeg's CMAF muxer); per-stage retry handles transient failures; sibling cancellation propagates so dependents don't run on broken renditions; rendition-level partial-success reporting surfaces granular per-stage state.
Default fMP4 behavior the FFmpeg path produces:
- Init segment per variant with proper
mvex/trexsetup. - Media segments with
styp,moof,mdatstructure. tfdtpopulated correctly per fragment.
CENC encryption and advanced packaging features (multi-DRM signaling, certain LL-CMAF chunked variants, sophisticated multi-period structures) are part of the Phase 2D / Shaka Packager roadmap, not currently shipped runtime executors. Customers needing those features today handle them in their own packaging tooling alongside MpegFlow.
The packager tracks segment continuity (sequence numbers, timing alignment) through whatever the muxer provides; we exercise box-order, tfdt monotonicity, and size-consistency checks against pipeline output during regression validation.
The strict-broker security model handles fMP4 packaging like any pipeline payload — workers carry no ambient credentials; content access flows through short-lived presigned URLs scoped per stage; access is disposed on completion.
For customers debugging fMP4 issues in production, the standing recommendation: inspect with ffprobe + MP4Box; verify init segment structure; verify media segment box ordering; verify tfdt continuity; verify timescale consistency. fMP4 problems are mostly mechanical; once you know what to look for, they're tractable.
The general guidance: fMP4 is precise but not arcane. Understanding the box structure pays off when debugging streaming issues; the alternative (treating fMP4 as a black box) leads to "works on my player but not theirs" production incidents.