Transport Stream vs CMAF: Different Philosophies for Streaming Metadata
How TS embeds metadata in every packet while CMAF front-loads it in the init segment
When it comes to streaming video over the internet, two fundamentally different approaches have emerged for how to package and describe media data. Transport Stream (MPEG-TS) and Common Media Application Format (CMAF) represent contrasting philosophies: one that continuously describes itself as it goes, and another that provides all the instructions upfront and then sends pure media data.
Understanding this distinction is crucial for anyone working with streaming technologies, as it impacts everything from stream parsing to error recovery to adaptive bitrate switching.
The Core Philosophical Difference
π΄ Transport Stream: Self-Describing Packets
Every packet carries metadata. TS was designed for broadcast scenarios where you might tune in at any moment. Each 188-byte TS packet contains header information that helps decoders understand what's inside and how to process it. You must actively decode and inspect these headers to know what you're dealing with.
π΅ CMAF: Initialization + Pure Bitstream
One-time setup, then data. CMAF uses fragmented MP4 (fMP4) where an initialization segment (init segment) contains all the codec parameters, track information, and decoding instructions. After that, media segments contain essentially raw media data with minimal overhead.
Analogy: Think of TS like receiving letters where each envelope has detailed instructions about the sender, priority, and how to read it. CMAF is more like receiving an instruction manual once, then getting unmarked packages that you process according to those initial instructions.
How Transport Stream Headers Work
Transport Stream packets are fixed at 188 bytes, and each one begins with a header that must be parsed to understand the payload. The decoder can't just skip aheadβit needs to actively interpret this metadata.
TS Packet Structure
TS Packet (188 bytes):
βββ Sync Byte (0x47) - 1 byte
βββ Header - 3+ bytes
β βββ PID (Packet Identifier) - 13 bits
β βββ Transport Error Indicator - 1 bit
β βββ Payload Unit Start Indicator - 1 bit
β βββ Transport Priority - 1 bit
β βββ Continuity Counter - 4 bits
β βββ Adaptation Field Control - 2 bits
βββ Optional Adaptation Field - variable
βββ Payload - remaining bytes
What This Means in Practice
When you receive a TS packet, you must:
- Find the sync byte (0x47) to identify packet boundaries
- Extract the PID to determine which elementary stream this belongs to (video, audio, subtitles, metadata)
- Check the continuity counter to detect dropped packets
- Parse the adaptation field if present (contains timing info, random access indicators, etc.)
- Look for PAT and PMT tables periodically to understand the stream structure
Key Insight: You cannot just read TS packets as raw bitstream. Each packet header tells you what's inside and how to interpret it. The stream is self-describing but requires active parsing.
The Role of PSI Tables
Transport Stream also uses Program Specific Information (PSI) tables that are periodically inserted into the stream:
- PAT (Program Association Table): Maps program numbers to PMT PIDs
- PMT (Program Map Table): Lists all elementary streams in a program and their PIDs
- CAT (Conditional Access Table): Information about scrambling/encryption
These tables are themselves carried in TS packets and must be parsed to understand the overall stream structure.
Example TS Stream Flow:
[PAT packet] [PMT packet] [Video PID=256] [Audio PID=257] [Video PID=256]
[Video PID=256] [Audio PID=257] [PAT packet] [Video PID=256] ...
How CMAF's Init Segment Works
CMAF takes a radically different approach. Instead of embedding metadata in every packet, it front-loads all the structural and codec information into an initialization segment (often called the "init segment" or "header").
Init Segment Contents
The init segment is a fragmented MP4 file containing several crucial boxes:
Init Segment Structure:
βββ ftyp (File Type Box)
β βββ Brand and compatibility information
βββ moov (Movie Box) - THE CRITICAL PART
βββ mvhd (Movie Header)
β βββ Timescale, duration
βββ trak (Track Box) - one per media track
β βββ tkhd (Track Header)
β βββ mdia (Media Box)
β β βββ mdhd (Media Header)
β β βββ hdlr (Handler Reference)
β β βββ minf (Media Information)
β β βββ stbl (Sample Table)
β β βββ stsd (Sample Description)
β β β βββ CODEC INFO HERE (avc1, hev1, mp4a, etc.)
β β βββ stts (empty - time-to-sample)
β β βββ stsc (empty - sample-to-chunk)
β β βββ stco (empty - chunk offset)
β βββ ...
βββ mvex (Movie Extends Box)
βββ trex (Track Extends) - defaults for fragments
What's in the Init Segment?
The initialization segment tells the decoder:
- Codec parameters: SPS/PPS for H.264, VPS/SPS/PPS for H.265, decoder config for AAC
- Track structure: How many tracks, their types (video/audio/text)
- Timescale: How to interpret timestamps
- Encryption info: If using DRM, the encryption scheme and key IDs
- Sample entry information: Bit depth, channels, sample rate, resolution, etc.
The crucial point: Once you've parsed the init segment, you have everything you need to decode the media. The init segment is typically only a few KB.
Media Segments: Pure Bitstream
After the init segment, CMAF media segments (also called "chunks" or "fragments") contain primarily media data with minimal overhead:
Media Segment Structure:
βββ moof (Movie Fragment Box)
β βββ mfhd (Movie Fragment Header)
β β βββ Sequence number
β βββ traf (Track Fragment)
β βββ tfhd (Track Fragment Header)
β β βββ Track ID, default flags
β βββ tfdt (Track Fragment Decode Time)
β β βββ Base media decode time (timestamp)
β βββ trun (Track Run)
β βββ Sample count, sizes, flags, composition offsets
βββ mdat (Media Data Box)
βββ RAW MEDIA SAMPLES (the actual video/audio bitstream)
The moof box is small (often <200 bytes) and primarily contains timing and sample metadata. The mdat box contains the actual compressed video or audio data.
Key Insight: Unlike TS, you don't need to parse headers for every packet. The init segment told you how to interpret the bitstream. The media segments just give you timestamps and point you to the raw data.
The Fundamental Contrast
Let's make this concrete with a comparison:
Decoding a Transport Stream
For each 188-byte packet:
1. Find sync byte (0x47)
2. Parse PID from header
3. Check if this is PAT/PMT
- If PAT: Update program table
- If PMT: Update elementary stream mappings
4. Check continuity counter
5. Parse adaptation field if present
6. Extract payload
7. If payload unit start indicator set:
- This is the start of a PES packet
- Parse PES header for PTS/DTS
8. Accumulate payload into elementary stream buffer
9. Pass to codec when complete frame available
This happens for every single packet. The stream cannot be understood without continuous header parsing.
Decoding CMAF
ONE TIME (on stream start or quality switch):
1. Fetch init segment
2. Parse moov box
3. Extract codec parameters from stsd
4. Initialize decoder with these parameters
5. Store track metadata
FOR EACH MEDIA SEGMENT:
1. Parse moof box (small, lightweight)
2. Extract timestamp from tfdt
3. Extract sample count and sizes from trun
4. Read mdat box
5. Pass samples directly to decoder (already configured)
The heavy lifting happens once. After that, it's mostly just reading bitstream data.
Why This Matters: Practical Implications
1. Stream Switching (ABR)
Transport Stream: When switching between quality levels in HLS, you need to: - Parse the new stream's PAT/PMT - Identify the new PIDs - Handle potential PID conflicts - Re-sync based on continuity counters
CMAF: When switching quality levels in DASH or HLS with fMP4: - Fetch the new init segment - Reconfigure decoder with new parameters - Start reading media segments - Track IDs remain consistent, timestamps align cleanly
CMAF makes adaptive bitrate switching significantly cleaner.
2. Random Access / Seeking
Transport Stream: To seek to a specific time: - Find the nearest sync byte - Hope you're at a random access point (IDR frame) - Parse packets until you find PAT/PMT - Build up the stream structure - Locate the correct PID - Start decoding
CMAF: To seek to a specific time: - You already have the init segment - Calculate which media segment contains that timestamp - Fetch that segment (if it starts with an IDR) - Start decoding immediately
3. Multiplexing Multiple Streams
Transport Stream: This is where TS actually shines. TS can multiplex video, audio, subtitles, data carousels, EPG info, and more into a single stream. The PID mechanism allows arbitrary interleaving:
[V1] [A1] [V1] [SUB] [V1] [A2] [V1] [EPG] [A1] ...
Each component has its own PID, and receivers can selectively decode what they need.
CMAF: Each track is typically in its own file/URL. Multiplexing happens at the manifest level (DASH MPD or HLS master playlist), not in the format itself. You request separate init segments and media segments for each track.
video/init.mp4
video/segment1.m4s
audio/en/init.mp4
audio/en/segment1.m4s
audio/es/init.mp4
audio/es/segment1.m4s
This is more HTTP-friendly but requires multiple requests.
4. Error Recovery
Transport Stream: TS was designed for lossy broadcast transmission. The sync byte and continuity counter help detect and recover from errors. You can re-sync by finding the next sync byte and continue parsing.
CMAF: Missing or corrupted data in a media segment often makes the entire segment unusable (unless using fragment-level recovery mechanisms). However, the next segment can be decoded independently if it starts with an IDR frame.
Where Each Format Excels
Transport Stream is Better For:
- Broadcast scenarios: Where receivers tune in at random times
- Linear streaming: Traditional TV-like experiences
- Single-stream multiplexing: When you want video, audio, and metadata in one stream
- Legacy infrastructure: Set-top boxes, broadcast equipment
- Low-latency scenarios: With careful tuning, TS can achieve very low latency
- Unreliable networks: Better error recovery mechanisms built-in
CMAF is Better For:
- HTTP-based streaming: Designed for modern CDNs and web delivery
- Adaptive bitrate: Clean switching between qualities
- Multi-language support: Easy to offer multiple audio tracks
- Modern codecs: Better support for new formats (AV1, VP9, etc.)
- Efficient storage: Init segment shared across qualities saves space
- Player simplicity: Less complex parsing logic needed
- DRM integration: Common Encryption (CENC) is native to fMP4
A More General-Purpose Format?
You mentioned that TS might be more general-purpose than CMAF, and you're right in several ways:
Transport Stream's Generality: - Can carry any type of data (not just audio/video) - DVB subtitles, teletext, data carousels, EPG, etc. - Designed for broadcast, cable, satellite, and IPTV - Can multiplex 100+ channels in a single transport stream (MPTS) - Used in broadcast equipment, cameras, encoders, professional workflows
CMAF's Specialization: - Specifically designed for HTTP adaptive streaming - Optimized for video-on-demand and live streaming over IP - Part of the modern streaming stack (DASH, HLS) - Less suited for traditional broadcast workflows - Focused on web and mobile delivery
In this sense, TS is indeed more "general-purpose" as a media container, while CMAF is more specialized for internet streaming.
Summary: Two Different Design Goals
| Aspect | Transport Stream | CMAF |
|---|---|---|
| Metadata Strategy | Continuous (every packet) | Front-loaded (init segment) |
| Parser Complexity | High (active header parsing) | Low (one-time setup) |
| Stream Structure | Self-contained, tunable anytime | Requires init segment first |
| Switching Efficiency | Moderate (re-parse stream) | High (swap init segment) |
| Multiplexing | Built-in (PIDs) | Manifest-level (separate files) |
| Error Recovery | Excellent (sync byte, counters) | Moderate (segment-level) |
| Primary Use Case | Broadcast, linear TV | HTTP adaptive streaming |
| Design Era | 1990s (MPEG-2) | 2010s (modern web) |
Conclusion
Transport Stream and CMAF represent two fundamentally different philosophies for delivering media:
TS says: "I will continuously tell you what I am, in every packet. You can join me at any time, and I'll help you figure out what's going on."
CMAF says: "Let me give you all the instructions once. After that, I'm just sending you data. Trust the init segment, and we'll be efficient together."
Neither is universally betterβthey were designed for different worlds. TS excels in broadcast and linear streaming where self-describing streams matter. CMAF excels in HTTP-based adaptive streaming where initialization overhead is acceptable and efficiency matters.
Understanding this fundamental differenceβheaders in every packet vs. init segment + bitstreamβis key to working with either format effectively. TS requires you to actively decode and parse continuously. CMAF lets you set up once and then consume mostly raw media data.
In the modern streaming landscape, we're seeing a shift toward CMAF for internet delivery, but TS remains deeply embedded in broadcast infrastructure and will continue to be relevant for years to come. Many workflows even involve converting between the two formats, which requires understanding these fundamental architectural differences.