Marc R Leonard | Developer

How TS embeds metadata in every packet while CMAF front-loads it in the init segment

When it comes to streaming video over the internet, two fundamentally different approaches have emerged for how to package and describe media data. Transport Stream (MPEG-TS) and Common Media Application Format (CMAF) represent contrasting philosophies: one that continuously describes itself as it goes, and another that provides all the instructions upfront and then sends pure media data.

Understanding this distinction is crucial for anyone working with streaming technologies, as it impacts everything from stream parsing to error recovery to adaptive bitrate switching.

The Core Philosophical Difference

🔴 Transport Stream: Self-Describing Packets

Every packet carries metadata. TS was designed for broadcast scenarios where you might tune in at any moment. Each 188-byte TS packet contains header information that helps decoders understand what's inside and how to process it. You must actively decode and inspect these headers to know what you're dealing with.

🔵 CMAF: Initialization + Pure Bitstream

One-time setup, then data. CMAF uses fragmented MP4 (fMP4) where an initialization segment (init segment) contains all the codec parameters, track information, and decoding instructions. After that, media segments contain essentially raw media data with minimal overhead.

Analogy: Think of TS like receiving letters where each envelope has detailed instructions about the sender, priority, and how to read it. CMAF is more like receiving an instruction manual once, then getting unmarked packages that you process according to those initial instructions.

How Transport Stream Headers Work

Transport Stream packets are fixed at 188 bytes, and each one begins with a header that must be parsed to understand the payload. The decoder can't just skip ahead—it needs to actively interpret this metadata.

TS Packet Structure

TS Packet (188 bytes):
├── Sync Byte (0x47) - 1 byte
├── Header - 3+ bytes
│   ├── PID (Packet Identifier) - 13 bits
│   ├── Transport Error Indicator - 1 bit
│   ├── Payload Unit Start Indicator - 1 bit
│   ├── Transport Priority - 1 bit
│   ├── Continuity Counter - 4 bits
│   └── Adaptation Field Control - 2 bits
├── Optional Adaptation Field - variable
└── Payload - remaining bytes

What This Means in Practice

When you receive a TS packet, you must:

Find the sync byte (0x47) to identify packet boundaries
Extract the PID to determine which elementary stream this belongs to (video, audio, subtitles, metadata)
Check the continuity counter to detect dropped packets
Parse the adaptation field if present (contains timing info, random access indicators, etc.)
Look for PAT and PMT tables periodically to understand the stream structure

Key Insight: You cannot just read TS packets as raw bitstream. Each packet header tells you what's inside and how to interpret it. The stream is self-describing but requires active parsing.

The Role of PSI Tables

Transport Stream also uses Program Specific Information (PSI) tables that are periodically inserted into the stream:

PAT (Program Association Table): Maps program numbers to PMT PIDs
PMT (Program Map Table): Lists all elementary streams in a program and their PIDs
CAT (Conditional Access Table): Information about scrambling/encryption

These tables are themselves carried in TS packets and must be parsed to understand the overall stream structure.

Example TS Stream Flow:
[PAT packet] [PMT packet] [Video PID=256] [Audio PID=257] [Video PID=256] 
[Video PID=256] [Audio PID=257] [PAT packet] [Video PID=256] ...

How CMAF's Init Segment Works

CMAF takes a radically different approach. Instead of embedding metadata in every packet, it front-loads all the structural and codec information into an initialization segment (often called the "init segment" or "header").

Init Segment Contents

The init segment is a fragmented MP4 file containing several crucial boxes:

Init Segment Structure:
├── ftyp (File Type Box)
│   └── Brand and compatibility information
├── moov (Movie Box) - THE CRITICAL PART
    ├── mvhd (Movie Header)
    │   └── Timescale, duration
    ├── trak (Track Box) - one per media track
    │   ├── tkhd (Track Header)
    │   ├── mdia (Media Box)
    │   │   ├── mdhd (Media Header)
    │   │   ├── hdlr (Handler Reference)
    │   │   └── minf (Media Information)
    │   │       └── stbl (Sample Table)
    │   │           ├── stsd (Sample Description)
    │   │           │   └── CODEC INFO HERE (avc1, hev1, mp4a, etc.)
    │   │           ├── stts (empty - time-to-sample)
    │   │           ├── stsc (empty - sample-to-chunk)
    │   │           └── stco (empty - chunk offset)
    │   └── ...
    └── mvex (Movie Extends Box)
        └── trex (Track Extends) - defaults for fragments

What's in the Init Segment?

The initialization segment tells the decoder:

Codec parameters: SPS/PPS for H.264, VPS/SPS/PPS for H.265, decoder config for AAC
Track structure: How many tracks, their types (video/audio/text)
Timescale: How to interpret timestamps
Encryption info: If using DRM, the encryption scheme and key IDs
Sample entry information: Bit depth, channels, sample rate, resolution, etc.

The crucial point: Once you've parsed the init segment, you have everything you need to decode the media. The init segment is typically only a few KB.

Media Segments: Pure Bitstream

After the init segment, CMAF media segments (also called "chunks" or "fragments") contain primarily media data with minimal overhead:

Media Segment Structure:
├── moof (Movie Fragment Box)
│   ├── mfhd (Movie Fragment Header)
│   │   └── Sequence number
│   └── traf (Track Fragment)
│       ├── tfhd (Track Fragment Header)
│       │   └── Track ID, default flags
│       ├── tfdt (Track Fragment Decode Time)
│       │   └── Base media decode time (timestamp)
│       └── trun (Track Run)
│           └── Sample count, sizes, flags, composition offsets
└── mdat (Media Data Box)
    └── RAW MEDIA SAMPLES (the actual video/audio bitstream)

The moof box is small (often <200 bytes) and primarily contains timing and sample metadata. The mdat box contains the actual compressed video or audio data.

Key Insight: Unlike TS, you don't need to parse headers for every packet. The init segment told you how to interpret the bitstream. The media segments just give you timestamps and point you to the raw data.

The Fundamental Contrast

Let's make this concrete with a comparison:

Decoding a Transport Stream

For each 188-byte packet:
  1. Find sync byte (0x47)
  2. Parse PID from header
  3. Check if this is PAT/PMT
     - If PAT: Update program table
     - If PMT: Update elementary stream mappings
  4. Check continuity counter
  5. Parse adaptation field if present
  6. Extract payload
  7. If payload unit start indicator set:
     - This is the start of a PES packet
     - Parse PES header for PTS/DTS
  8. Accumulate payload into elementary stream buffer
  9. Pass to codec when complete frame available

This happens for every single packet. The stream cannot be understood without continuous header parsing.

Decoding CMAF

ONE TIME (on stream start or quality switch):
  1. Fetch init segment
  2. Parse moov box
  3. Extract codec parameters from stsd
  4. Initialize decoder with these parameters
  5. Store track metadata

FOR EACH MEDIA SEGMENT:
  1. Parse moof box (small, lightweight)
  2. Extract timestamp from tfdt
  3. Extract sample count and sizes from trun
  4. Read mdat box
  5. Pass samples directly to decoder (already configured)

The heavy lifting happens once. After that, it's mostly just reading bitstream data.

Why This Matters: Practical Implications

1. Stream Switching (ABR)

Transport Stream: When switching between quality levels in HLS, you need to: - Parse the new stream's PAT/PMT - Identify the new PIDs - Handle potential PID conflicts - Re-sync based on continuity counters

CMAF: When switching quality levels in DASH or HLS with fMP4: - Fetch the new init segment - Reconfigure decoder with new parameters - Start reading media segments - Track IDs remain consistent, timestamps align cleanly

CMAF makes adaptive bitrate switching significantly cleaner.

2. Random Access / Seeking

Transport Stream: To seek to a specific time: - Find the nearest sync byte - Hope you're at a random access point (IDR frame) - Parse packets until you find PAT/PMT - Build up the stream structure - Locate the correct PID - Start decoding

CMAF: To seek to a specific time: - You already have the init segment - Calculate which media segment contains that timestamp - Fetch that segment (if it starts with an IDR) - Start decoding immediately

3. Multiplexing Multiple Streams

Transport Stream: This is where TS actually shines. TS can multiplex video, audio, subtitles, data carousels, EPG info, and more into a single stream. The PID mechanism allows arbitrary interleaving:

[V1] [A1] [V1] [SUB] [V1] [A2] [V1] [EPG] [A1] ...

Each component has its own PID, and receivers can selectively decode what they need.

CMAF: Each track is typically in its own file/URL. Multiplexing happens at the manifest level (DASH MPD or HLS master playlist), not in the format itself. You request separate init segments and media segments for each track.

video/init.mp4
video/segment1.m4s
audio/en/init.mp4
audio/en/segment1.m4s
audio/es/init.mp4
audio/es/segment1.m4s

This is more HTTP-friendly but requires multiple requests.

4. Error Recovery

Transport Stream: TS was designed for lossy broadcast transmission. The sync byte and continuity counter help detect and recover from errors. You can re-sync by finding the next sync byte and continue parsing.

CMAF: Missing or corrupted data in a media segment often makes the entire segment unusable (unless using fragment-level recovery mechanisms). However, the next segment can be decoded independently if it starts with an IDR frame.

Where Each Format Excels

Transport Stream is Better For:

Broadcast scenarios: Where receivers tune in at random times
Linear streaming: Traditional TV-like experiences
Single-stream multiplexing: When you want video, audio, and metadata in one stream
Legacy infrastructure: Set-top boxes, broadcast equipment
Low-latency scenarios: With careful tuning, TS can achieve very low latency
Unreliable networks: Better error recovery mechanisms built-in

CMAF is Better For:

HTTP-based streaming: Designed for modern CDNs and web delivery
Adaptive bitrate: Clean switching between qualities
Multi-language support: Easy to offer multiple audio tracks
Modern codecs: Better support for new formats (AV1, VP9, etc.)
Efficient storage: Init segment shared across qualities saves space
Player simplicity: Less complex parsing logic needed
DRM integration: Common Encryption (CENC) is native to fMP4

A More General-Purpose Format?

You mentioned that TS might be more general-purpose than CMAF, and you're right in several ways:

Transport Stream's Generality: - Can carry any type of data (not just audio/video) - DVB subtitles, teletext, data carousels, EPG, etc. - Designed for broadcast, cable, satellite, and IPTV - Can multiplex 100+ channels in a single transport stream (MPTS) - Used in broadcast equipment, cameras, encoders, professional workflows

CMAF's Specialization: - Specifically designed for HTTP adaptive streaming - Optimized for video-on-demand and live streaming over IP - Part of the modern streaming stack (DASH, HLS) - Less suited for traditional broadcast workflows - Focused on web and mobile delivery

In this sense, TS is indeed more "general-purpose" as a media container, while CMAF is more specialized for internet streaming.

Summary: Two Different Design Goals

Aspect	Transport Stream	CMAF
Metadata Strategy	Continuous (every packet)	Front-loaded (init segment)
Parser Complexity	High (active header parsing)	Low (one-time setup)
Stream Structure	Self-contained, tunable anytime	Requires init segment first
Switching Efficiency	Moderate (re-parse stream)	High (swap init segment)
Multiplexing	Built-in (PIDs)	Manifest-level (separate files)
Error Recovery	Excellent (sync byte, counters)	Moderate (segment-level)
Primary Use Case	Broadcast, linear TV	HTTP adaptive streaming
Design Era	1990s (MPEG-2)	2010s (modern web)

Conclusion

Transport Stream and CMAF represent two fundamentally different philosophies for delivering media:

TS says: "I will continuously tell you what I am, in every packet. You can join me at any time, and I'll help you figure out what's going on."

CMAF says: "Let me give you all the instructions once. After that, I'm just sending you data. Trust the init segment, and we'll be efficient together."

Neither is universally better—they were designed for different worlds. TS excels in broadcast and linear streaming where self-describing streams matter. CMAF excels in HTTP-based adaptive streaming where initialization overhead is acceptable and efficiency matters.

Understanding this fundamental difference—headers in every packet vs. init segment + bitstream—is key to working with either format effectively. TS requires you to actively decode and parse continuously. CMAF lets you set up once and then consume mostly raw media data.

In the modern streaming landscape, we're seeing a shift toward CMAF for internet delivery, but TS remains deeply embedded in broadcast infrastructure and will continue to be relevant for years to come. Many workflows even involve converting between the two formats, which requires understanding these fundamental architectural differences.

marc leonard

Transport Stream vs CMAF: Different Philosophies for Streaming Metadata