Extract audio from video with FFmpeg: AAC, MP3, WAV, transcription-ready
Pull audio out of a video file with three real recipes — stream-copy AAC for podcasts, MP3 for compatibility, mono 16kHz WAV for transcription pipelines.
When to use this
You extract audio when you need the audio independently of the video — podcast extraction from video recordings, ML pipelines (Whisper, Deepgram, AWS Transcribe expect audio-only inputs), audio-only delivery, or building dataset preprocessing. Three common targets: stream-copy AAC (when the original audio is already AAC and you just need a separate file), high-quality MP3 (broader compatibility), and mono 16kHz WAV (the standard input for transcription services).
Command variants
ffmpeg -i input.mp4 \
-vn -c:a copy \
output.aacNear-instant. Output extension should match the source codec — .aac for AAC, .mp3 for MP3, .opus for Opus.
ffmpeg -i input.mp4 \
-vn -c:a libmp3lame -q:a 2 \
output.mp3-q:a 2 is VBR ~190 kbps, near-CD quality. Use -b:a 128k for fixed bitrate.
ffmpeg -i input.mp4 \
-vn -ar 16000 -ac 1 -c:a pcm_s16le \
output.wavRequired format for OpenAI Whisper, AWS Transcribe, Google Speech-to-Text, and most ML transcription pipelines.
What each parameter does
-vnNo video. Tells FFmpeg to drop the video stream entirely and only process audio.
-c:a copyStream-copy audio without re-encoding. Fastest, preserves original quality exactly. Works only when output container supports the source codec.
-q:a 2libmp3lame VBR quality. Range 0-9; 2 ≈ 190 kbps, 4 ≈ 165 kbps. Lower number = higher quality + larger file.
-ar 16000Audio sample rate 16kHz. Standard for transcription; speech doesn't need higher rates and lower rates reduce file size.
-ac 1Mono (1 channel). Transcription services expect mono; stereo confuses some systems.
-c:a pcm_s16leUncompressed 16-bit PCM. The format ML transcription tools expect for highest accuracy.
What this outputs
An audio-only file. Stream-copy produces a file with identical audio bytes to the source. MP3 conversion produces a smaller file at slight quality loss. WAV-for-transcription produces a larger file but the format ML tools expect.
Pitfalls
- Stream-copy fails when the output container doesn't support the source codec. AAC into .mp3 won't work; convert with libmp3lame instead.
- For transcription pipelines, mono is non-negotiable. Stereo audio inputs to Whisper/Deepgram return errors or unpredictable results. Always -ac 1 for transcription.
- Higher sample rates than 16kHz waste storage for transcription without improving accuracy. Most speech recognition models downsample to 16kHz internally.
- MP3 VBR mode (-q:a) and CBR mode (-b:a) produce different file sizes for the same perceived quality. VBR is more efficient; CBR is more predictable.
- Some video files have multiple audio tracks (commentary, language alternatives). Without explicit -map, FFmpeg picks the first one. Use -map 0:a:0 / 0:a:1 to choose specific tracks.
At production scale
Audio extraction is I/O-bound, not compute-bound. Stream-copy at petabyte scale is essentially free of compute cost; MP3 conversion adds modest overhead. For transcription pipelines processing millions of hours, batch-extracting WAV upstream of the transcription service saves significant retry-on-format-error overhead. Many ML pipelines benefit from a dedicated extraction stage that produces a transcription-ready format once, regardless of how the audio is later consumed.
MpegFlow models audio extraction as a parallel DAG stage that can run alongside video transcoding. The same source file feeds the video pipeline and an audio-extraction pipeline simultaneously, which is more efficient than sequential extraction-then-transcoding.
- FFmpeg
- audio
- extraction
- transcription
- audio-operations