2026-05-11|20 min read|[FFmpeg, Video, Audio, Multimedia, C, Architecture]

FFmpeg, Inside Out

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

1. Overview

FFmpeg is a comprehensive open-source multimedia framework that can decode, encode, transcode, mux, demux, stream, filter, and play almost any media format in use today. It underpins an enormous ecosystem of applications and services, from desktop players like VLC to browsers, streaming platforms (YouTube, Netflix), video editors, chat apps, and automated media pipelines.

Goal of this Guide: Understanding FFmpeg at depth requires unpacking both its conceptual model (formats, codecs, streams, timestamps, filters) and its software architecture (libraries, data structures, pipelines). This guide maps out the framework from a high-level view down to microscopic details, providing a conceptual map similar to "The Illustrated Transformer".

2. What FFmpeg Actually Is

FFmpeg is not just a single command-line binary. It is a suite of libraries and tools designed for multimedia processing. The core is written in C and highly optimized.

The Core Libraries

LibraryPrimary Responsibility
libavutilCommon utilities: data structures, math helpers, pixel formats, color space utilities, logging, memory management.
libavcodecCodecs: encode/decode audio, video, subtitles; bitstream parsing; hardware-accelerated codecs.
libavformatContainers and I/O: demuxers, muxers, streaming protocols (file, HTTP, RTMP, HLS, etc.).
libavfilterFilter framework: directed graphs of audio/video filters operating on decoded raw frames.
libavdeviceCapture and playback devices abstraction (webcams, microphones, screens).
libswscaleVideo scaling and pixel format conversion (e.g., YUV to RGB).
libswresampleAudio resampling, sample format, and channel layout conversion.
The Modular Pipeline Architecture
Input FileDemuxlibavformatDecodelibavcodecFilterlibavfilterEncodeMuxOutput

On top of these libraries sit the familiar command-line tools: ffmpeg (the main processing engine), ffprobe (metadata and stream inspector), and ffplay (SDL-based player).

Microscopic Details: How libavformat "Probes" a File Layer 3 Depth

When you pass a file to FFmpeg, how does it know it's an MP4 or MKV without relying purely on file extensions? It uses a mechanism called Probing.

Internally, `libavformat` calls av_probe_input_format3(). This function iterates through every registered demuxer. Each demuxer implements a read_probe function. FFmpeg reads the first few kilobytes of the file (up to probesize) and passes this buffer to every demuxer.

Each demuxer looks for "magic numbers" (e.g., ftyp atoms for MP4) and returns a score from 0 to 100 (AVPROBE_SCORE_MAX). The demuxer with the highest score wins.

3. Multimedia Foundations: Formats, Streams, and Codecs

3.1 Containers vs. Codecs

A recurring source of confusion is the difference between containers and codecs.

  • Container (Format): A file structure (like a ZIP file) that bundles multiple streams (video, audio, subtitles), defines how they are interleaved, and stores metadata/timing. Handled by libavformat. Examples: MP4, MKV, AVI, WebM, TS.
  • Codec: An algorithm for compressing/decompressing a single stream of media. Handled by libavcodec. Examples: H.264, VP9, AAC, Opus.
An .mp4 file might contain an H.264 video stream and an AAC audio stream. If you change to an .mkv container without changing the streams, it is called a "stream copy"—no loss of quality and happens instantly!

3.2 Elementary Streams

Inside a container, media is separated into elementary streams:

  • Video Streams: Contain properties like width, height, pixel format (e.g., YUV420p), framerate.
  • Audio Streams: Contain properties like sample rate (e.g., 48kHz), channels (stereo, 5.1), sample format.
  • Subtitle Streams: SRT, ASS, bitmap subtitles.

3.3 Codecs: Intra, Inter, and GOPs

Video compression relies heavily on temporal compression—finding what stays the same between frames. This creates a Group of Pictures (GOP) structure:

Visualizing a GOP (Group of Pictures)
IKeyframeBBPPredicted

I-frame (Intra): Self-contained, like a JPEG. Crucial for seeking.
P-frame (Predicted): Stores only changes from previous frames.
B-frame (Bi-directional): Looks forward AND backward. Highest compression, but forces frames out of chronological order in the bitstream!

4. FFmpeg’s Core Libraries and Architecture

Applications (including the ffmpeg CLI) build pipelines by combining libraries. Here are the core C data structures that flow through these pipelines:

Key Data Structures Lifecycle

AVFormatContext & AVStream

Represents the container (e.g., MP4) and its internal streams (audio track, video track). Holds I/O context and format-specific metadata.

AVPacket

A piece of compressed data (e.g., one encoded H.264 video chunk). It contains timestamps (PTS/DTS) and a stream index.

AVFrame

A decoded, raw audio or video frame. Contains actual pixel data (YUV, RGB) or raw audio samples, plus metadata like dimensions and format.

The Developer API Flow: You read AVPackets from an AVFormatContext via av_read_frame(), send them to a decoder via avcodec_send_packet(), and pull out raw AVFrames using avcodec_receive_frame().

Microscopic Details: Reference Counting & AVBufferRef Layer 4 Depth

Both AVPacket and AVFrame are large data structures. Copying a 4K raw video frame (which can be over 24MB of data) from the decoder to a filter would destroy performance.

Instead, FFmpeg uses Reference Counting via the AVBufferRef struct. The actual pixel data sits in a single block of heap memory. Multiple AVFrame instances can point to this exact same memory block. The AVBufferRef keeps an atomic counter of how many frames point to it.

When you call av_frame_free(), it decrements the counter. The heavy payload memory is only freed when the counter hits zero. This allows "zero-copy" passing of frames between decoders, filters, and encoders.

5. The `ffmpeg` CLI Mental Model

The ffmpeg binary is a pipeline builder. It parses your command line, constructs an internal graph of the libraries we just discussed, and runs until EOF.

5.1 The Syntax Structure

ffmpeg [global_options] -i [input_file] [input_options] [output_options] [output_file]

A massive point of confusion is order of operations. Options affect what comes after them.

5.2 The `-map` Command

By default, FFmpeg tries to be smart and picks the "best" video and audio stream. But if you have multiple inputs or tracks, you must use -map to route streams manually:

# Take video from input 0, and all audio from input 1 ffmpeg -i video.mp4 -i external_audio.m4a -map 0:v:0 -map 1:a -c:v copy -c:a aac output.mp4

5.3 Copy vs. Re-encode

Stream Copy (-c copy)

Directly remuxes the encoded AVPackets from input to output container. Extremely fast. Lossless. No filters or scaling allowed.

ffmpeg -i in.mkv -c copy out.mp4

Re-encode (-c:v libx264)

Decodes to AVFrame, applies filters, and re-encodes back to new AVPackets. Slower, but required for changing resolution or codec.

ffmpeg -i in.mkv -c:v libx264 out.mp4

6. Filtergraphs and libavfilter

The libavfilter library allows you to apply transformations (scaling, cropping, color correction, watermarks) to uncompressed frames.

6.1 Simple vs. Complex Filtergraphs

  • Simple (-vf or -af): One input, one output. It forms a linear chain.
    ffmpeg -i in.mp4 -vf "scale=1920:1080,hue=s=0" out.mp4
  • Complex (-filter_complex): Multiple inputs, multiple outputs, branching, and merging.
    ffmpeg -i bg.mp4 -i logo.png -filter_complex "[0:v][1:v]overlay=main_w-overlay_w-10:10[out]" -map "[out]" out.mp4
Graph Routing Architecture
[0:v][1:v]scaleoverlay[out][v0]

The link labels (like [v0] or [out]) act like pipes connecting the nodes in the graph.

Microscopic Details: AVFilterLink Format Negotiation Layer 3 Depth

When you connect two filters, like decode → scale, they communicate through an AVFilterLink. Before processing starts, `libavfilter` performs a "configuration phase".

The output pad of the decoder negotiates with the input pad of the scale filter over the AVFilterLink. It asks: "I output YUV420p, can you accept that?". If the next filter only accepts RGB24, the filtergraph automatically inserts a hidden format conversion filter (using libswscale) right in the middle of the link.

The injection points into the graph are special invisible filters called buffersrc (where decoded frames are pushed in) and buffersink (where finished frames are pulled out).

7. Time, Timestamps, and Synchronization

One of the most complex parts of multimedia programming is keeping audio and video synchronized. Because B-frames are encoded out of order, timestamps are critical.

Time Base (time_base)

Timestamps aren't stored as floats (seconds). They are stored as integers representing ticks. The time_base is a fraction (e.g., 1/90000). To get seconds: PTS * time_base.

PTS (Presentation Time Stamp)

When the frame should be shown on the screen. This is what the player uses for A/V sync. Chronological (1, 2, 3, 4...).

DTS (Decoding Time Stamp)

When the frame must be fed into the decoder. Because B-frames need future frames decoded first, DTS is ordered differently than PTS.

Why does this matter? If you blindly chop a file using a hex editor, or if your container metadata gets corrupted, PTS and DTS become mismatched. The player will decode frames, but show them in the wrong order, resulting in visual glitching (stuttering back and forth).
Microscopic Details: H.264 NAL Units and POC Layer 4 Depth

At the lowest bitstream level for codecs like H.264/HEVC, the data isn't just "frames", it's a sequence of NAL units (Network Abstraction Layer).

Inside these NAL units, specifically within the Slice Header, there is a field called the POC (Picture Order Count). The POC is the codec's internal version of PTS. When FFmpeg demuxes a container (like MP4) that has broken or missing PTS metadata, the `libavcodec` decoder parses the NAL units, extracts the POC, and reconstructs the correct PTS values to ensure the video doesn't stutter.

8. Hardware Acceleration

Moving data back and forth between the CPU (System RAM) and the GPU (VRAM) over the PCIe bus is slow. The goal of hardware acceleration is to keep the video in VRAM for the entire pipeline.

Hardware Backends

  • NVENC/NVDEC: NVIDIA's dedicated encode/decode chips.
  • VAAPI / QSV: Intel Quick Sync and generic Linux VAAPI for Intel/AMD.
  • VideoToolbox: Apple's native macOS/iOS hardware framework.

The Ideal Zero-Copy Pipeline

In a perfect pipeline, the hardware decoder decodes straight into VRAM. A hardware filter operates on that VRAM surface, and the hardware encoder reads it directly.

# Perfect VAAPI zero-copy pipeline ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -vf "scale_vaapi=w=1920:h=1080" -c:v h264_vaapi out.mp4
Beware `hwupload` and `hwdownload`: If you use a CPU-based filter (like `drawtext`) but a GPU encoder, FFmpeg must explicitly move the frame to the GPU using the hwupload filter. If you hardware decode but software encode, it uses hwdownload. This crossing of the PCIe bus bottlenecks performance.

9. Using FFmpeg as a Library

If you're writing C/C++ or using bindings (like PyAV), you interact with the lifecycle manually. Here is the canonical flow:

  1. avformat_open_input(): Opens the file and parses the container.
  2. avformat_find_stream_info(): Reads enough packets to figure out codecs.
  3. avcodec_open2(): Initializes your decoder instance.
  4. The Loop:
    • av_read_frame(): Pull an AVPacket.
    • avcodec_send_packet(): Send to decoder.
    • avcodec_receive_frame(): Get raw AVFrame.
  5. Feed frames into an AVFilterGraph (if filtering).
  6. Send frames to the encoder (avcodec_send_frame).
  7. Get encoded packets (avcodec_receive_packet) and mux (av_write_frame).

Memory Management: FFmpeg relies heavily on reference counting. If you don't call av_packet_unref() or av_frame_free() correctly, you will cause massive memory leaks.

10. Common FFmpeg Jargon Dictionary

TermDefinition & Context
DemuxerReads the container format, splits the interleaved chunks, and routes them to streams. Handled by libavformat.
MuxerThe reverse of a demuxer. Takes independent streams of encoded packets and packages them into a container file (.mp4, .mkv).
BSF (Bitstream Filter)Modifies encoded packets without decoding them. E.g., converting H.264 formats for MP4 vs TS. Used via -bsf.
GOPGroup of Pictures. A sequence beginning with an I-frame. Crucial for seeking and streaming.
Time baseThe clock tick unit used to interpret PTS/DTS timestamps, defined per stream.

11. Real World Systems & Pipelines

FFmpeg is the invisible engine powering modern media.

  • VLC Player: A GUI and playback engine built on top of libavcodec and libavformat. When VLC plays a file, FFmpeg libraries are doing the heavy lifting to decode it.
  • HLS / DASH (Streaming Platforms): Netflix and YouTube use Adaptive Bitrate Streaming. The source video is transcoded into an "ABR Ladder" (e.g., 1080p, 720p, 480p, 360p) by FFmpeg. libavformat then slices the video into 2-10 second chunks (.ts or .m4s) and generates a manifest file (.m3u8). The player dynamically downloads the chunk that fits the user's current internet speed.
  • Browsers: Some browsers compile FFmpeg via WebAssembly to run decoding/encoding directly on the client side.

12. Conceptual Pipeline Examples

Example 1: Transcode, Scale, and Lower Bitrate
ffmpeg -i in.mp4 -vf "scale=1280:-1" -c:v libx264 -crf 23 -c:a aac -b:a 128k out.mp4

Flow: Demux MP4 → Decode Video & Audio → Scale Video width to 1280 (keep aspect ratio) → Encode Video to H.264 (CRF 23 quality) → Encode Audio to AAC (128k) → Mux to MP4.

Example 2: Adding a Watermark (Filtergraph)
ffmpeg -i video.mp4 -i logo.png -filter_complex "[0:v][1:v]overlay=10:10[out]" -map "[out]" -map 0:a -c:v libx264 -c:a copy out.mp4

Flow: Read video and image. Pass both to the overlay filter. The result is mapped as the output video. The audio is mapped directly from input 0 and stream-copied (no re-encoding!).

Signup for Updates:

I promise to only email you cool shit. Draft chapters, progress updates, sneak peaks at illustrations I'm working on. Stuff like that.

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░