Extract audio from video with FFmpeg: AAC, MP3, WAV, transcription-ready

Pull audio out of a video file with three real recipes — stream-copy AAC for podcasts, MP3 for compatibility, mono 16kHz WAV for transcription pipelines.

ByMpegFlow Engineering Team·FFmpeg recipe

·3 variants·May 9, 2026

When to use this

You extract audio when you need the audio independently of the video — podcast extraction from video recordings, ML pipelines (Whisper, Deepgram, AWS Transcribe expect audio-only inputs), audio-only delivery, or building dataset preprocessing. Three common targets: stream-copy AAC (when the original audio is already AAC and you just need a separate file), high-quality MP3 (broader compatibility), and mono 16kHz WAV (the standard input for transcription services).

Command variants

Stream copy (preserves original codec, fastest)

ffmpeg -i input.mp4 \
  -vn -c:a copy \
  output.aac

Near-instant. Output extension should match the source codec — .aac for AAC, .mp3 for MP3, .opus for Opus.

Convert to MP3 (broad compatibility)

ffmpeg -i input.mp4 \
  -vn -c:a libmp3lame -q:a 2 \
  output.mp3

-q:a 2 is VBR ~190 kbps, near-CD quality. Use -b:a 128k for fixed bitrate.

Mono 16kHz WAV (transcription-ready)

ffmpeg -i input.mp4 \
  -vn -ar 16000 -ac 1 -c:a pcm_s16le \
  output.wav

Required format for OpenAI Whisper, AWS Transcribe, Google Speech-to-Text, and most ML transcription pipelines.

What each parameter does

-vn
No video. Tells FFmpeg to drop the video stream entirely and only process audio.
-c:a copy
Stream-copy audio without re-encoding. Fastest, preserves original quality exactly. Works only when output container supports the source codec.
-q:a 2
libmp3lame VBR quality. Range 0-9; 2 ≈ 190 kbps, 4 ≈ 165 kbps. Lower number = higher quality + larger file.
-ar 16000
Audio sample rate 16kHz. Standard for transcription; speech doesn't need higher rates and lower rates reduce file size.
-ac 1
Mono (1 channel). Transcription services expect mono; stereo confuses some systems.
-c:a pcm_s16le
Uncompressed 16-bit PCM. The format ML transcription tools expect for highest accuracy.

What this outputs

An audio-only file. Stream-copy produces a file with identical audio bytes to the source. MP3 conversion produces a smaller file at slight quality loss. WAV-for-transcription produces a larger file but the format ML tools expect.

Pitfalls

Stream-copy fails when the output container doesn't support the source codec. AAC into .mp3 won't work; convert with libmp3lame instead.
For transcription pipelines, mono is non-negotiable. Stereo audio inputs to Whisper/Deepgram return errors or unpredictable results. Always -ac 1 for transcription.
Higher sample rates than 16kHz waste storage for transcription without improving accuracy. Most speech recognition models downsample to 16kHz internally.
MP3 VBR mode (-q:a) and CBR mode (-b:a) produce different file sizes for the same perceived quality. VBR is more efficient; CBR is more predictable.
Some video files have multiple audio tracks (commentary, language alternatives). Without explicit -map, FFmpeg picks the first one. Use -map 0:a:0 / 0:a:1 to choose specific tracks.

At production scale

Audio extraction is I/O-bound, not compute-bound. Stream-copy at petabyte scale is essentially free of compute cost; MP3 conversion adds modest overhead. For transcription pipelines processing millions of hours, batch-extracting WAV upstream of the transcription service saves significant retry-on-format-error overhead. Many ML pipelines benefit from a dedicated extraction stage that produces a transcription-ready format once, regardless of how the audio is later consumed.

How MpegFlow handles this

MpegFlow models audio extraction as a parallel DAG stage that can run alongside video transcoding. The same source file feeds the video pipeline and an audio-extraction pipeline simultaneously, which is more efficient than sequential extraction-then-transcoding.

Topics

FFmpeg
audio
extraction
transcription
audio-operations