WebVTT — Web Video Text Tracks, formally W3C WebVTT 1.0 — is the caption format every browser supports natively. It evolved from the SRT (SubRip) text-based format with significant additions for styling, positioning, and metadata. For HTML5 video, HLS streaming, DASH streaming, and most modern web-based video delivery, WebVTT is the de facto caption format. This page is the engineering reference.
What WebVTT is
WebVTT is a text-based caption file format. The structure is line-based: a header, optional metadata, then a sequence of caption cues. Each cue specifies a timestamp range and the text to display.
A minimal WebVTT file:
WEBVTT
00:00:00.000 --> 00:00:04.000
Welcome to the engineering reference.
00:00:04.500 --> 00:00:08.000
This is a sample WebVTT file.
00:00:08.500 --> 00:00:12.000
Each cue has a timestamp range and text.
The file starts with WEBVTT (mandatory header). Each cue has:
- Timestamp range — start and end times in
HH:MM:SS.mmmformat. - Text payload — what to display during the timestamp range.
Empty lines separate cues. The format is intentionally simple — easy to parse, easy to author, easy to debug.
Cue identifiers
Cues can have optional identifiers that help with referencing or styling:
WEBVTT
cue1
00:00:00.000 --> 00:00:04.000
Welcome to the engineering reference.
cue2
00:00:04.500 --> 00:00:08.000
This is a sample WebVTT file.
Cue identifiers are useful for:
- Programmatic reference (player APIs that target specific cues).
- Style targeting via CSS pseudo-selectors.
- Cross-references in subtitle authoring tools.
Styling
WebVTT supports limited inline styling and CSS integration. Inline styling tags:
WEBVTT
00:00:00.000 --> 00:00:04.000
This is <b>bold</b> and <i>italic</i> text.
00:00:04.500 --> 00:00:08.000
This text has a <c.character>character class</c>.
Supported inline tags:
<b>,<i>,<u>— bold, italic, underline.<c.classname>...</c>— apply CSS class.<v Speaker>...</v>— voice/speaker marker.<lang xx>...</lang>— language tag.<00:00:02.000>— internal timestamp markers (for karaoke-style highlighting).
CSS styling targets WebVTT via the ::cue pseudo-element:
::cue {
background-color: rgba(0, 0, 0, 0.6);
color: white;
font-family: Arial, sans-serif;
}
::cue(c.character) {
color: yellow;
}
The styling capability is more limited than full HTML/CSS — WebVTT is a caption format, not a layout system. Use it for the styling captions need; don't try to do creative typography in WebVTT.
Positioning
WebVTT supports positioning via cue settings appended to the timestamp line:
WEBVTT
00:00:00.000 --> 00:00:04.000 line:0 position:50% align:center
Top-of-screen centered caption
00:00:04.500 --> 00:00:08.000 line:80% position:50% align:center
Bottom-of-screen centered caption (default)
00:00:08.500 --> 00:00:12.000 line:50% position:10% align:start
Mid-left caption
Settings:
line— vertical position.line:0is top,line:100%is bottom. Default depends on player.position— horizontal position of the cue's anchor point.align— text alignment within the cue (start,center,end).size— width of the cue area.vertical— vertical text orientation (for languages that read top-to-bottom).
Positioning matters for accessibility (avoiding interference with on-screen elements) and aesthetic considerations (avoiding subtitle overlap with important visual content).
Notes and metadata
WebVTT supports comments and metadata:
WEBVTT
Kind: captions
Language: en
NOTE
This is a comment, ignored by players.
00:00:00.000 --> 00:00:04.000
First caption.
Header lines after WEBVTT and before the first blank line are file metadata. NOTE blocks anywhere in the file are comments.
For accessibility-focused content, Kind: is meaningful — captions (for hearing-impaired audience), subtitles (translation), descriptions (audio description for visually impaired), chapters (chapter markers).
WebVTT in HLS
HLS handles WebVTT subtitles via separate variant streams:
#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English",DEFAULT=YES,AUTOSELECT=YES,LANGUAGE="en",URI="subs/en.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="Spanish",DEFAULT=NO,AUTOSELECT=YES,LANGUAGE="es",URI="subs/es.m3u8"
#EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720,CODECS="avc1.4d401f,mp4a.40.2",SUBTITLES="subs"
720p.m3u8
The subtitle media playlist (subs/en.m3u8) is itself a manifest of WebVTT segment files:
#EXTM3U
#EXT-X-VERSION:5
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:6.0,
en-001.vtt
#EXTINF:6.0,
en-002.vtt
...
Each .vtt segment contains the WebVTT cues for the corresponding time range of the video. The player aligns subtitle segments with video segments based on timestamps.
For HLS legacy compatibility, subtitle segments need timestamp adjustments — WebVTT's default 00:00:00.000 base must be offset to match the video's media timeline. This is signaled with X-TIMESTAMP-MAP headers:
WEBVTT
X-TIMESTAMP-MAP=MPEGTS:900000,LOCAL:00:00:00.000
00:00:01.000 --> 00:00:04.000
First caption in this segment.
The X-TIMESTAMP-MAP tells the player how to map the WebVTT timestamps to the video's MPEG-TS timestamps.
WebVTT in DASH
DASH supports WebVTT subtitles via AdaptationSets with mimeType="text/vtt":
<AdaptationSet mimeType="text/vtt" lang="en" id="3">
<Representation id="webvtt-en" bandwidth="0">
<BaseURL>subs/en.vtt</BaseURL>
</Representation>
</AdaptationSet>
<AdaptationSet mimeType="text/vtt" lang="es" id="4">
<Representation id="webvtt-es" bandwidth="0">
<BaseURL>subs/es.vtt</BaseURL>
</Representation>
</AdaptationSet>
Single-file WebVTT delivery is common in DASH (one .vtt file per language for the whole content, rather than segmented). For very long content or live streams, segmented WebVTT is also supported.
For sidecar WebVTT outside any manifest, browsers can load it via the <track> HTML element:
<video controls>
<source src="video.mp4" type="video/mp4">
<track src="subs/en.vtt" srclang="en" label="English" kind="captions" default>
<track src="subs/es.vtt" srclang="es" label="Español" kind="captions">
</video>
This is the simplest WebVTT integration — works without HLS or DASH manifests.
WebVTT vs SRT
SRT (SubRip) is the simpler ancestor format. WebVTT extends SRT:
| Feature | SRT | WebVTT |
|---|---|---|
| Header | None | WEBVTT mandatory |
| Cue identifiers | Numeric (1, 2, 3...) | Optional named identifiers |
| Styling | None | Inline tags + CSS |
| Positioning | None | Full cue settings |
| HTML5 support | Limited (player-dependent) | Native via <track> element |
| HLS/DASH support | No | Yes (HLS native, DASH via mimeType) |
| Metadata | None | Header metadata + NOTE comments |
Conversion between SRT and WebVTT is straightforward (mostly mechanical — add WEBVTT header, change , to . in timestamps for milliseconds). Many tools handle the conversion automatically.
For modern web delivery, WebVTT is the right format. SRT remains common in offline workflows (subtitling tools, archive formats) but the streaming and browser-native delivery uses WebVTT.
WebVTT vs TTML/IMSC
TTML (Timed Text Markup Language) and its IMSC profile are XML-based caption formats with richer styling and positioning capabilities. The comparison:
| Dimension | WebVTT | TTML/IMSC |
|---|---|---|
| Format | Text-based | XML |
| Browser support | Native | Via JavaScript polyfill |
| Styling capability | Modest | Rich (full CSS-like styling) |
| Positioning capability | Modest | Precise pixel-level positioning |
| File size | Smaller | Larger |
| HLS/DASH support | Both | Both (TTML/IMSC widely in DASH, less in HLS) |
| Use cases | Web streaming, mass-market | Premium broadcast, accessibility-critical |
For browser-native streaming, WebVTT is the simpler answer. For broadcast-grade caption delivery (DVB, HbbTV, premium streaming), TTML/IMSC offers more capability. Many premium streaming services ship both — WebVTT for browser/HLS reach, IMSC for broadcast and premium TV apps.
Operational considerations
Things that matter for production WebVTT:
- Timestamp accuracy — captions must align with the video's actual timing. Off-by-a-second captions are a visible bug.
- Encoding (UTF-8) — WebVTT is UTF-8. Non-Latin scripts and special characters require correct encoding throughout the pipeline.
- Line breaks and length — long lines wrap awkwardly on small screens. Most subtitle conventions cap at ~32-42 characters per line, 1-2 lines per cue.
- Reading speed — viewers need time to read. Standard caption pacing is 17 characters per second; respect that when authoring.
- Player rendering differences — WebVTT styling renders consistently across major browsers but subtle differences exist. Test on Safari, Chrome, Firefox.
- Live caption insertion — for live workflows, WebVTT segments are produced in real-time. Timing accuracy matters more than polish.
- Forced narrative subtitles — content with foreign-language dialogue often needs forced narrative subtitles (always on for non-original-language portions). Mark these via metadata or separate WebVTT files.
A note on accessibility regulations
Captions aren't just a quality-of-life feature for many distributions — they're a legal requirement. The major frameworks:
- ADA (Americans with Disabilities Act) — covers accessibility for video content distributed by entities subject to ADA. Requires captions for video with audible speech.
- CVAA (21st Century Communications and Video Accessibility Act) — US federal law requiring captioning for video programming distributed via internet that previously aired on US TV.
- EAA (European Accessibility Act) — EU directive requiring caption availability for many digital services starting 2025.
- AODA (Accessibility for Ontarians with Disabilities Act) and similar Canadian provincial regulations.
- DDA (Disability Discrimination Act) in UK and Australia.
Compliance varies by content category and distribution channel. WebVTT is the practical caption format for compliance because it's broadly supported and easy to author/produce. Failure to ship captions where legally required can result in lawsuits and regulatory penalties; build captions into pipeline defaults rather than as opt-in features.
What MpegFlow does with WebVTT
MpegFlow's DAG runtime handles WebVTT as a first-class caption format expressed through discrete stages. The partitioner places caption-producing work on appropriate executors — CcextractorExecutor for CEA-608/708 source extraction, FfmpegExecutor for caption-format conversion — and each stage is persisted to job_stages with explicit dependency tracking and per-stage retry. The executor proto field tells the ExecutorRegistry which binary to dispatch.
For pipelines ingesting captions from other formats (CEA-608/708 from broadcast, TTML from premium production), cross-stage data flow wires the conversion stage's WebVTT output into the downstream HLS packaging stage; sibling cancellation propagates if conversion fatally fails so dependent encodes don't waste compute.
For live workflows, WebVTT segments are produced alongside video segments through the same DAG runtime; rendition-level partial-success reporting means caption-segment failures don't necessarily fail the whole job — the customer sees granular per-stage state via the job_stages projection.
X-TIMESTAMP-MAP control is not currently a customer-facing knob. The pipeline emits whatever the underlying caption tooling produces from the source PTS. Operators with strict X-TIMESTAMP-MAP requirements for specific player targets handle the post-processing in their own tooling today; native pipeline-level control over X-TIMESTAMP-MAP synthesis is on the backlog.
The strict-broker security model treats WebVTT as workflow content — workers carry no ambient credentials; content access is via short-lived presigned URLs scoped per stage; access is disposed on completion. WebVTT files don't typically need encryption, but encryption is supported for use cases that require it.
For customers building their first multi-language caption workflow, the conversation focuses on translation/localization (typically out-of-pipeline, customer-managed), file format selection (WebVTT for web reach, IMSC for premium where supported), and integration with player UI (default language selection, language switching). The pipeline side is solved; the editorial and localization integration is where customer-specific work happens.