MpegFlowBlogBack to home
← Topics·Captions

WebVTT — the W3C caption format every browser speaks

Practical reference on WebVTT — file structure, styling, positioning support, integration with HLS and DASH, vs SRT and TTML, and when WebVTT is the right caption format.

ByMpegFlow Engineering Team·Captions
·May 8, 2026·9 min read·1,716 words
In this topic
  1. What WebVTT is
  2. Cue identifiers
  3. Styling
  4. Positioning
  5. Notes and metadata
  6. WebVTT in HLS
  7. WebVTT in DASH
  8. WebVTT vs SRT
  9. WebVTT vs TTML/IMSC
  10. Operational considerations
  11. A note on accessibility regulations
  12. What MpegFlow does with WebVTT

WebVTT — Web Video Text Tracks, formally W3C WebVTT 1.0 — is the caption format every browser supports natively. It evolved from the SRT (SubRip) text-based format with significant additions for styling, positioning, and metadata. For HTML5 video, HLS streaming, DASH streaming, and most modern web-based video delivery, WebVTT is the de facto caption format. This page is the engineering reference.

#What WebVTT is

WebVTT is a text-based caption file format. The structure is line-based: a header, optional metadata, then a sequence of caption cues. Each cue specifies a timestamp range and the text to display.

A minimal WebVTT file:

WEBVTT

00:00:00.000 --> 00:00:04.000
Welcome to the engineering reference.

00:00:04.500 --> 00:00:08.000
This is a sample WebVTT file.

00:00:08.500 --> 00:00:12.000
Each cue has a timestamp range and text.

The file starts with WEBVTT (mandatory header). Each cue has:

  • Timestamp range — start and end times in HH:MM:SS.mmm format.
  • Text payload — what to display during the timestamp range.

Empty lines separate cues. The format is intentionally simple — easy to parse, easy to author, easy to debug.

#Cue identifiers

Cues can have optional identifiers that help with referencing or styling:

WEBVTT

cue1
00:00:00.000 --> 00:00:04.000
Welcome to the engineering reference.

cue2
00:00:04.500 --> 00:00:08.000
This is a sample WebVTT file.

Cue identifiers are useful for:

  • Programmatic reference (player APIs that target specific cues).
  • Style targeting via CSS pseudo-selectors.
  • Cross-references in subtitle authoring tools.

#Styling

WebVTT supports limited inline styling and CSS integration. Inline styling tags:

WEBVTT

00:00:00.000 --> 00:00:04.000
This is <b>bold</b> and <i>italic</i> text.

00:00:04.500 --> 00:00:08.000
This text has a <c.character>character class</c>.

Supported inline tags:

  • <b>, <i>, <u> — bold, italic, underline.
  • <c.classname>...</c> — apply CSS class.
  • <v Speaker>...</v> — voice/speaker marker.
  • <lang xx>...</lang> — language tag.
  • <00:00:02.000> — internal timestamp markers (for karaoke-style highlighting).

CSS styling targets WebVTT via the ::cue pseudo-element:

::cue {
  background-color: rgba(0, 0, 0, 0.6);
  color: white;
  font-family: Arial, sans-serif;
}

::cue(c.character) {
  color: yellow;
}

The styling capability is more limited than full HTML/CSS — WebVTT is a caption format, not a layout system. Use it for the styling captions need; don't try to do creative typography in WebVTT.

#Positioning

WebVTT supports positioning via cue settings appended to the timestamp line:

WEBVTT

00:00:00.000 --> 00:00:04.000 line:0 position:50% align:center
Top-of-screen centered caption

00:00:04.500 --> 00:00:08.000 line:80% position:50% align:center
Bottom-of-screen centered caption (default)

00:00:08.500 --> 00:00:12.000 line:50% position:10% align:start
Mid-left caption

Settings:

  • line — vertical position. line:0 is top, line:100% is bottom. Default depends on player.
  • position — horizontal position of the cue's anchor point.
  • align — text alignment within the cue (start, center, end).
  • size — width of the cue area.
  • vertical — vertical text orientation (for languages that read top-to-bottom).

Positioning matters for accessibility (avoiding interference with on-screen elements) and aesthetic considerations (avoiding subtitle overlap with important visual content).

#Notes and metadata

WebVTT supports comments and metadata:

WEBVTT
Kind: captions
Language: en

NOTE
This is a comment, ignored by players.

00:00:00.000 --> 00:00:04.000
First caption.

Header lines after WEBVTT and before the first blank line are file metadata. NOTE blocks anywhere in the file are comments.

For accessibility-focused content, Kind: is meaningful — captions (for hearing-impaired audience), subtitles (translation), descriptions (audio description for visually impaired), chapters (chapter markers).

#WebVTT in HLS

HLS handles WebVTT subtitles via separate variant streams:

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English",DEFAULT=YES,AUTOSELECT=YES,LANGUAGE="en",URI="subs/en.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="Spanish",DEFAULT=NO,AUTOSELECT=YES,LANGUAGE="es",URI="subs/es.m3u8"
#EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720,CODECS="avc1.4d401f,mp4a.40.2",SUBTITLES="subs"
720p.m3u8

The subtitle media playlist (subs/en.m3u8) is itself a manifest of WebVTT segment files:

#EXTM3U
#EXT-X-VERSION:5
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:6.0,
en-001.vtt
#EXTINF:6.0,
en-002.vtt
...

Each .vtt segment contains the WebVTT cues for the corresponding time range of the video. The player aligns subtitle segments with video segments based on timestamps.

For HLS legacy compatibility, subtitle segments need timestamp adjustments — WebVTT's default 00:00:00.000 base must be offset to match the video's media timeline. This is signaled with X-TIMESTAMP-MAP headers:

WEBVTT
X-TIMESTAMP-MAP=MPEGTS:900000,LOCAL:00:00:00.000

00:00:01.000 --> 00:00:04.000
First caption in this segment.

The X-TIMESTAMP-MAP tells the player how to map the WebVTT timestamps to the video's MPEG-TS timestamps.

#WebVTT in DASH

DASH supports WebVTT subtitles via AdaptationSets with mimeType="text/vtt":

<AdaptationSet mimeType="text/vtt" lang="en" id="3">
  <Representation id="webvtt-en" bandwidth="0">
    <BaseURL>subs/en.vtt</BaseURL>
  </Representation>
</AdaptationSet>
<AdaptationSet mimeType="text/vtt" lang="es" id="4">
  <Representation id="webvtt-es" bandwidth="0">
    <BaseURL>subs/es.vtt</BaseURL>
  </Representation>
</AdaptationSet>

Single-file WebVTT delivery is common in DASH (one .vtt file per language for the whole content, rather than segmented). For very long content or live streams, segmented WebVTT is also supported.

For sidecar WebVTT outside any manifest, browsers can load it via the <track> HTML element:

<video controls>
  <source src="video.mp4" type="video/mp4">
  <track src="subs/en.vtt" srclang="en" label="English" kind="captions" default>
  <track src="subs/es.vtt" srclang="es" label="Español" kind="captions">
</video>

This is the simplest WebVTT integration — works without HLS or DASH manifests.

#WebVTT vs SRT

SRT (SubRip) is the simpler ancestor format. WebVTT extends SRT:

Feature SRT WebVTT
Header None WEBVTT mandatory
Cue identifiers Numeric (1, 2, 3...) Optional named identifiers
Styling None Inline tags + CSS
Positioning None Full cue settings
HTML5 support Limited (player-dependent) Native via <track> element
HLS/DASH support No Yes (HLS native, DASH via mimeType)
Metadata None Header metadata + NOTE comments

Conversion between SRT and WebVTT is straightforward (mostly mechanical — add WEBVTT header, change , to . in timestamps for milliseconds). Many tools handle the conversion automatically.

For modern web delivery, WebVTT is the right format. SRT remains common in offline workflows (subtitling tools, archive formats) but the streaming and browser-native delivery uses WebVTT.

#WebVTT vs TTML/IMSC

TTML (Timed Text Markup Language) and its IMSC profile are XML-based caption formats with richer styling and positioning capabilities. The comparison:

Dimension WebVTT TTML/IMSC
Format Text-based XML
Browser support Native Via JavaScript polyfill
Styling capability Modest Rich (full CSS-like styling)
Positioning capability Modest Precise pixel-level positioning
File size Smaller Larger
HLS/DASH support Both Both (TTML/IMSC widely in DASH, less in HLS)
Use cases Web streaming, mass-market Premium broadcast, accessibility-critical

For browser-native streaming, WebVTT is the simpler answer. For broadcast-grade caption delivery (DVB, HbbTV, premium streaming), TTML/IMSC offers more capability. Many premium streaming services ship both — WebVTT for browser/HLS reach, IMSC for broadcast and premium TV apps.

#Operational considerations

Things that matter for production WebVTT:

  • Timestamp accuracy — captions must align with the video's actual timing. Off-by-a-second captions are a visible bug.
  • Encoding (UTF-8) — WebVTT is UTF-8. Non-Latin scripts and special characters require correct encoding throughout the pipeline.
  • Line breaks and length — long lines wrap awkwardly on small screens. Most subtitle conventions cap at ~32-42 characters per line, 1-2 lines per cue.
  • Reading speed — viewers need time to read. Standard caption pacing is 17 characters per second; respect that when authoring.
  • Player rendering differences — WebVTT styling renders consistently across major browsers but subtle differences exist. Test on Safari, Chrome, Firefox.
  • Live caption insertion — for live workflows, WebVTT segments are produced in real-time. Timing accuracy matters more than polish.
  • Forced narrative subtitles — content with foreign-language dialogue often needs forced narrative subtitles (always on for non-original-language portions). Mark these via metadata or separate WebVTT files.

#A note on accessibility regulations

Captions aren't just a quality-of-life feature for many distributions — they're a legal requirement. The major frameworks:

  • ADA (Americans with Disabilities Act) — covers accessibility for video content distributed by entities subject to ADA. Requires captions for video with audible speech.
  • CVAA (21st Century Communications and Video Accessibility Act) — US federal law requiring captioning for video programming distributed via internet that previously aired on US TV.
  • EAA (European Accessibility Act) — EU directive requiring caption availability for many digital services starting 2025.
  • AODA (Accessibility for Ontarians with Disabilities Act) and similar Canadian provincial regulations.
  • DDA (Disability Discrimination Act) in UK and Australia.

Compliance varies by content category and distribution channel. WebVTT is the practical caption format for compliance because it's broadly supported and easy to author/produce. Failure to ship captions where legally required can result in lawsuits and regulatory penalties; build captions into pipeline defaults rather than as opt-in features.

#What MpegFlow does with WebVTT

MpegFlow's DAG runtime handles WebVTT as a first-class caption format expressed through discrete stages. The partitioner places caption-producing work on appropriate executors — CcextractorExecutor for CEA-608/708 source extraction, FfmpegExecutor for caption-format conversion — and each stage is persisted to job_stages with explicit dependency tracking and per-stage retry. The executor proto field tells the ExecutorRegistry which binary to dispatch.

For pipelines ingesting captions from other formats (CEA-608/708 from broadcast, TTML from premium production), cross-stage data flow wires the conversion stage's WebVTT output into the downstream HLS packaging stage; sibling cancellation propagates if conversion fatally fails so dependent encodes don't waste compute.

For live workflows, WebVTT segments are produced alongside video segments through the same DAG runtime; rendition-level partial-success reporting means caption-segment failures don't necessarily fail the whole job — the customer sees granular per-stage state via the job_stages projection.

X-TIMESTAMP-MAP control is not currently a customer-facing knob. The pipeline emits whatever the underlying caption tooling produces from the source PTS. Operators with strict X-TIMESTAMP-MAP requirements for specific player targets handle the post-processing in their own tooling today; native pipeline-level control over X-TIMESTAMP-MAP synthesis is on the backlog.

The strict-broker security model treats WebVTT as workflow content — workers carry no ambient credentials; content access is via short-lived presigned URLs scoped per stage; access is disposed on completion. WebVTT files don't typically need encryption, but encryption is supported for use cases that require it.

For customers building their first multi-language caption workflow, the conversation focuses on translation/localization (typically out-of-pipeline, customer-managed), file format selection (WebVTT for web reach, IMSC for premium where supported), and integration with player UI (default language selection, language switching). The pipeline side is solved; the editorial and localization integration is where customer-specific work happens.

Tags
  • webvtt
  • captions
  • subtitles
  • w3c
  • hls
  • dash
See also

Related topics and reading

  • HLS X-TIMESTAMP-MAP — webvtt subtitle timing alignment for HLS
  • TTML and IMSC — XML-based timed text for premium video and broadcast
  • Burn-in vs soft subtitles — when to render captions into video vs deliver as separate tracks
Building on this?

Join the MpegFlow beta.

We're shipping the encoder MVP this quarter. If you're wrangling captions in production, the beta is built for you — no card, no console waiting.

Join the beta More captions
© 2026 MpegFlow, Inc. · Trust & complianceAll systems nominal·StatusPrivacy