Blame - zstd_compression_format.md - external_zstd

blob: c949937c81c1d63888774c9df97c96498e36a9f5 [file] [log] [blame] [view]

Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1	Zstandard Compression Format Description
				2	========================================
				3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
				9	for any purpose and without charge,
				10	including translations into other languages
				11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame^]	19	0.0.1 (30/06/2016 - Work in progress - unfinished)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
				28	File compression, Pipe and streaming compression
				29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
				61	Definitions
				62	-----------
				63	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				64	Multiple frames can be appended into a single file or stream.
				65	A frame is totally independent, has a defined beginning and end,
				66	and a set of parameters which tells the decoder how to decompress it.
				67
				68	A frame encapsulates one or multiple __blocks__.
				69	Each block can be compressed or not,
				70	and has a guaranteed maximum content size, which depends on frame parameters.
				71	Unlike frames, each block depends on previous blocks for proper decoding.
				72	However, each block can be decompressed without waiting for its successor,
				73	allowing streaming operations.
				74
				75
				76	General Structure of Zstandard Frame format
				77	-------------------------------------------
				78
				79	\| MagicNb \| F. Header \| Block \| (More blocks) \| EndMark \|
				80	\|:-------:\|:----------:\| ----- \| ------------- \| ------- \|
				81	\| 4 bytes \| 2-14 bytes \| \| \| 3 bytes \|
				82
				83	__Magic Number__
				84
				85	4 Bytes, Little endian format.
				86	Value : 0xFD2FB527
				87
				88	__Frame Header__
				89
				90	2 to 14 Bytes, to be detailed in the next part.
				91
				92	__Data Blocks__
				93
				94	To be detailed later on.
				95	That’s where compressed data is stored.
				96
				97	__EndMark__
				98
				99	The flow of blocks ends when the last block header brings an _end signal_ .
				100	This last block header may optionally host a __Content Checksum__ .
				101
				102	__Content Checksum__
				103
				104	Content Checksum verify that frame content has been regenrated correctly.
				105	The content checksum is the result
				106	of [xxh64() hash function](https://www.xxHash.com)
				107	digesting the original (decoded) data as input, and a seed of zero.
				108	Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
				109	stored into the last block header.
				110	```
				111	contentChecksum = (XXH64(content, size, 0) >> 11) & (1<<22)-1);
				112	```
				113	Content checksum is only present when its associated flag
				114	is set in the frame descriptor.
				115	Its usage is optional.
				116
				117	__Frame Concatenation__
				118
				119	In some circumstances, it may be required to append multiple frames,
				120	for example in order to add new data to an existing compressed file
				121	without re-framing it.
				122
				123	In such case, each frame brings its own set of descriptor flags.
				124	Each frame is considered independent.
				125	The only relation between frames is their sequential order.
				126
				127	The ability to decode multiple concatenated frames
				128	within a single stream or file is left outside of this specification.
				129	As an example, the reference `zstd` command line utility is able
				130	to decode all concatenated frames in their sequential order,
				131	delivering the final decompressed result as if it was a single content.
				132
				133
				134	Frame Header
				135	-------------
				136
				137	\| FHD \| (WD) \| (Content Size) \| (dictID) \|
				138	\| ------- \| --------- \|:--------------:\| --------- \|
				139	\| 1 byte \| 0-1 byte \| 0 - 8 bytes \| 0-4 bytes \|
				140
				141	Frame header has a variable size, which uses a minimum of 2 bytes,
				142	and up to 14 bytes depending on optional parameters.
				143
				144	__FHD byte__ (Frame Header Descriptor)
				145
				146	The first Header's byte is called the Frame Header Descriptor.
				147	It tells which other fields are present.
				148	Decoding this byte is enough to get the full size of the Frame Header.
				149
				150	\| BitNb \| 7-6 \| 5 \| 4 \| 3 \| 2 \| 1-0 \|
				151	\| ------- \| ------ \| ------- \| ------ \| -------- \| -------- \| -------- \|
				152	\|FieldName\| FCSize \| Segment \| Unused \| Reserved \| Checksum \| dictID \|
				153
				154	In this table, bit 7 is highest bit, while bit 0 is lowest.
				155
				156	__Frame Content Size flag__
				157
				158	This is a 2-bits flag (`= FHD >> 6`),
				159	specifying if decompressed data size is provided within the header.
				160
				161	\| Value \| 0 \| 1 \| 2 \| 3 \|
				162	\| ------- \| --- \| --- \| --- \| --- \|
				163	\|FieldSize\| 0-1 \| 2 \| 4 \| 8 \|
				164
				165	Value 0 has a double meaning :
				166	it either means `0` (size not provided) _if_ the `WD` byte is present,
				167	or it means `1` byte (size <= 255 bytes).
				168
				169	__Single Segment__
				170
				171	If this flag is set,
				172	data shall be regenerated within a single continuous memory segment.
				173	In which case, `WD` byte __is not present__,
				174	but `Frame Content Size` field necessarily is.
				175
				176	As a consequence, the decoder must allocate a memory segment
				177	of size `>= Frame Content Size`.
				178
				179	In order to preserve the decoder from unreasonable memory requirement,
				180	a decoder can refuse a compressed frame
				181	which requests a memory size beyond decoder's authorized range.
				182
				183	For broader compatibility, decoders are recommended to support
				184	memory sizes of 8 MB at least.
				185	However, this is merely a recommendation,
				186	and each decoder is free to support higher or lower limits,
				187	depending on local limitations.
				188
				189	__Unused bit__
				190
				191	The value of this bit is unimportant
				192	and not interpreted by a decoder compliant with this specification version.
				193	It may be used in a future revision,
				194	to signal a property which is not required to properly decode the frame.
				195
				196	__Reserved bit__
				197
				198	This bit is reserved for some future feature.
				199	Its value _must be zero_.
				200	A decoder compliant with this specification version must ensure it is not set.
				201	This bit may be used in a future revision,
				202	to signal a feature that must be interpreted in order to decode the frame.
				203
				204	__Content checksum flag__
				205
				206	If this flag is set, a content checksum will be present into the EndMark.
				207	The checksum is a 22 bits value extracted from the XXH64() of data.
				208	See __Content Checksum__ .
				209
				210	__Dictionary ID flag__
				211
				212	This is a 2-bits flag (`= FHD & 3`),
				213	telling if a dictionary ID is provided within the header
				214
				215	\| Value \| 0 \| 1 \| 2 \| 3 \|
				216	\| ------- \| --- \| --- \| --- \| --- \|
				217	\|FieldSize\| 0 \| 1 \| 2 \| 4 \|
				218
				219	__WD byte__ (Window Descriptor)
				220
				221	Provides guarantees on maximum back-reference distance
				222	that will be present within compressed data.
				223	This information is useful for decoders to allocate enough memory.
				224
				225	\| BitNb \| 7-3 \| 0-2 \|
				226	\| --------- \| -------- \| -------- \|
				227	\| FieldName \| Exponent \| Mantissa \|
				228
				229	Maximum distance is given by the following formulae :
				230	```
				231	windowLog = 10 + Exponent;
				232	windowBase = 1 << windowLog;
				233	windowAdd = (windowBase / 8) * Mantissa;
				234	windowSize = windowBase + windowAdd;
				235	```
				236	The minimum window size is 1 KB.
				237	The maximum size is (15*(2^38))-1 bytes, which is almost 1.875 TB.
				238
				239	To properly decode compressed data,
				240	a decoder will need to allocate a buffer of at least `windowSize` bytes.
				241
				242	Note that `WD` byte is optional. It's not present in `single segment` mode.
				243	In which case, the maximum back-reference distance is the content size itself,
				244	which can be any value from 1 to 2^64-1 bytes (16 EB).
				245
				246	In order to preserve decoder from unreasonable memory requirements,
				247	a decoder can refuse a compressed frame
				248	which requests a memory size beyond decoder's authorized range.
				249
				250	For better interoperability,
				251	decoders are recommended to be compatible with window sizes of 8 MB.
				252	Encoders are recommended to not request more than 8 MB.
				253	It's merely a recommendation though,
				254	decoders are free to support larger or lower limits,
				255	depending on local limitations.
				256
				257	__Frame Content Size__
				258
				259	This is the original (uncompressed) size.
				260	This information is optional, and only present if associated flag is set.
				261	Content size is provided using 1, 2, 4 or 8 Bytes.
				262	Format is Little endian.
				263
				264	\| Field Size \| Range \|
				265	\| ---------- \| ---------- \|
				266	\| 0 \| 0 \|
				267	\| 1 \| 0 - 255 \|
				268	\| 2 \| 256 - 65791\|
				269	\| 4 \| 0 - 2^32-1 \|
				270	\| 8 \| 0 - 2^64-1 \|
				271
				272	When field size is 1, 4 or 8 bytes, the value is read directly.
				273	When field size is 2, _an offset of 256 is added_.
				274	It's allowed to represent a small size (ex: `18`) using the 8-bytes variant.
				275	A size of `0` means `content size is unknown`.
				276	In which case, the `WD` byte will necessarily be present,
				277	and becomes the only hint to determine memory allocation.
				278
				279	In order to preserve decoder from unreasonable memory requirement,
				280	a decoder can refuse a compressed frame
				281	which requests a memory size beyond decoder's authorized range.
				282
				283	__Dictionary ID__
				284
				285	This is a variable size field, which contains a single ID.
				286	It checks if the correct dictionary is used for decoding.
				287	Note that this field is optional. If it's not present,
				288	it's up to the caller to make sure it uses the correct dictionary.
				289
				290	Field size depends on __Dictionary ID flag__.
				291	1 byte can represent an ID 0-255.
				292	2 bytes can represent an ID 0-65535.
				293	4 bytes can represent an ID 0-(2^32-1).
				294
				295	It's allowed to represent a small ID (for example `13`)
				296	with a large 4-bytes dictionary ID, losing some efficiency in the process.
				297
				298
				299	Data Blocks
				300	-----------
				301
				302	\| B. Header \| data \|
				303	\|:---------:\| ------ \|
				304	\| 3 bytes \| \|
				305
				306
				307	__Block Header__
				308
				309	This field uses 3-bytes, format is __big-endian__.
				310
				311	The 2 highest bits represent the `block type`,
				312	while the remaining 22 bits represent the (compressed) block size.
				313
				314	There are 4 block types :
				315
				316	\| Value \| 0 \| 1 \| 2 \| 3 \|
				317	\| ---------- \| ---------- \| --- \| --- \| ------- \|
				318	\| Block Type \| Compressed \| Raw \| RLE \| EndMark \|
				319
				320	- Compressed : this is a Zstandard compressed block,
				321	detailed in a later part of this specification.
				322	"block size" is the compressed size.
				323	Decompressed size is unknown,
				324	but its maximum possible value is guaranteed (see below)
				325	- Raw : this is an uncompressed block.
				326	"block size" is the number of bytes to read and copy.
				327	- RLE : this is a single byte, repeated N times.
				328	In which case, "block size" is the size to regenerate,
				329	while the "compressed" block is just 1 byte (the byte to repeat).
				330	- EndMark : this is not a block. Signal the end of the frame.
				331	The rest of the field may be optionally filled by a checksum
				332	(see frame checksum).
				333
				334	Block sizes must respect a few rules :
				335	- In compressed mode, compressed size if always strictly `< contentSize`.
				336	- Block decompressed size is necessarily <= maximum back-reference distance .
				337	- Block decompressed size is necessarily <= 128 KB
				338
				339
				340	__Data__
				341
				342	Where the actual data to decode stands.
				343	It might be compressed or not, depending on previous field indications.
				344	A data block is not necessarily "full" :
				345	since an arbitrary “flush” may happen anytime,
				346	block content can be any size, up to Block Maximum Size.
				347	Block Maximum Size is the smallest of :
				348	- Max back-reference distance
				349	- 128 KB
				350
				351
				352	Skippable Frames
				353	----------------
				354
				355	\| Magic Number \| Frame Size \| User Data \|
				356	\|:------------:\|:----------:\| --------- \|
				357	\| 4 bytes \| 4 bytes \| \|
				358
				359	Skippable frames allow the insertion of user-defined data
				360	into a flow of concatenated frames.
				361	Its design is pretty straightforward,
				362	with the sole objective to allow the decoder to quickly skip
				363	over user-defined data and continue decoding.
				364
				365	Skippable frames defined in this specification are compatible with LZ4 ones.
				366
				367
				368	__Magic Number__ :
				369
				370	4 Bytes, Little endian format.
				371	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				372	All 16 values are valid to identify a skippable frame.
				373
				374	__Frame Size__ :
				375
				376	This is the size, in bytes, of the following User Data
				377	(without including the magic number nor the size field itself).
				378	4 Bytes, Little endian format, unsigned 32-bits.
				379	This means User Data can’t be bigger than (2^32-1) Bytes.
				380
				381	__User Data__ :
				382
				383	User Data can be anything. Data will just be skipped by the decoder.
				384
				385
				386	Compressed block format
				387	-----------------------
				388	This specification details the content of a _compressed block_.
				389	A compressed block has a size, which must be known in order to decode it.
				390	It also has a guaranteed maximum regenerated size,
				391	in order to properly allocate destination buffer.
				392	See "Frame format" for more details.
				393
				394	A compressed block consists of 2 sections :
				395	- Literals section
				396	- Sequences section
				397
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame^]	398	### Prerequisite
				399	For proper decoding, a compressed block requires access to following elements :
				400	- Previous decoded blocks, up to a distance of `windowSize`,
				401	or all previous blocks in the same frame "single segment" mode.
				402	- List of "recent offsets" from previous compressed block.
				403
				404
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	405	### Compressed Literals
				406
				407	Literals are compressed using order-0 huffman compression.
				408	During sequence phase, literals will be entangled with match copy operations.
				409	All literals are regrouped in the first part of the block.
				410	They can be decoded first, and then copied during sequence operations,
				411	or they can be decoded on the flow, as needed by sequences.
				412
				413	\| Header \| (Tree Description) \| Stream1 \| (Stream2) \| (Stream3) \| (Stream4) \|
				414	\| ------ \| ------------------ \| ------- \| --------- \| --------- \| --------- \|
				415
				416	Literals can be compressed, or uncompressed.
				417	When compressed, an optional tree description can be present,
				418	followed by 1 or 4 streams.
				419
				420	#### Block Literal Header
				421
				422	Header is in charge of describing precisely how literals are packed.
				423	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
				424	using big-endian convention.
				425
				426	\| BlockType \| sizes format \| (compressed size) \| regenerated size \|
				427	\| --------- \| ------------ \| ----------------- \| ---------------- \|
				428	\| 2 bits \| 1 - 2 bits \| 0 - 18 bits \| 5 - 20 bits \|
				429
				430	__Block Type__ :
				431
				432	This is a 2-bits field, describing 4 different block types :
				433
				434	\| Value \| 0 \| 1 \| 2 \| 3 \|
				435	\| ---------- \| ---------- \| ------ \| --- \| ------- \|
				436	\| Block Type \| Compressed \| Repeat \| Raw \| RLE \|
				437
				438	- Compressed : This is a standard huffman-compressed block,
				439	starting with a huffman tree description.
				440	See details below.
				441	- Repeat Stats : This is a huffman-compressed block,
				442	using huffman tree from previous huffman-compressed block.
				443	Huffman tree description will be skipped.
				444	Compressed stream is equivalent to "compressed" block type.
				445	- Raw : Literals are stored uncompressed.
				446	- RLE : Literals consist of a single byte value repeated N times.
				447
				448	__Sizes format__ :
				449
				450	Sizes format are divided into 2 families :
				451
				452	- For compressed block, it requires to decode both the compressed size
				453	and the decompressed size. It will also decode the number of streams.
				454	- For Raw or RLE blocks, it's enough to decode the size to regenerate.
				455
				456	For values spanning several bytes, convention is Big-endian.
				457
				458	__Sizes format for Raw or RLE block__ :
				459
				460	- Value : 0x : Regenerated size uses 5 bits (0-31).
				461	Total literal header size is 1 byte.
				462	`size = h[0] & 31;`
				463	- Value : 10 : Regenerated size uses 12 bits (0-4095).
				464	Total literal header size is 2 bytes.
				465	`size = ((h[0] & 15) << 8) + h[1];`
				466	- Value : 11 : Regenerated size uses 20 bits (0-1048575).
				467	Total literal header size is 2 bytes.
				468	`size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];`
				469
				470	Note : it's allowed to represent a short value (ex : `13`)
				471	using a long format, accepting the reduced compacity.
				472
				473	__Sizes format for Compressed Block__ :
				474
				475	Note : also applicable to "repeat-stats" blocks.
				476	- Value : 00 : 4 streams
				477	Compressed and regenerated sizes use 10 bits (0-1023)
				478	Total literal header size is 3 bytes
				479	- Value : 01 : _Single stream_
				480	Compressed and regenerated sizes use 10 bits (0-1023)
				481	Total literal header size is 3 bytes
				482	- Value : 10 : 4 streams
				483	Compressed and regenerated sizes use 14 bits (0-16383)
				484	Total literal header size is 4 bytes
				485	- Value : 10 : 4 streams
				486	Compressed and regenerated sizes use 18 bits (0-262143)
				487	Total literal header size is 5 bytes
				488
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame^]	489	Compressed and regenerated size fields follow big endian convention.
				490
				491	#### Huffman Tree description
				492
				493	This section is only present when block type is _compressed_ (`0`).
				494	It describes the different leaf nodes of the huffman tree,
				495	and their relative weights.
				496
				497	##### Representation
				498
				499	All byte values from zero (included) to last present one (excluded)
				500	are represented by `weight` values, from 0 to `maxBits`.
				501	Transformation from `weight` to `nbBits` follows this formulae :
				502	`nbBits = weight ? maxBits + 1 - weight : 0;` .
				503	The last symbol's weight is deduced from previously decoded ones,
				504	by completing to the nearest power of 2.
				505	This power of 2 gives `maxBits`, the depth of the current tree.
				506
				507	__Example__ :
				508	Let's presume the following huffman tree must be described :
				509
				510	\| Value \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				511	\| ------ \| - \| - \| - \| - \| - \| - \|
				512	\| nbBits \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
				513
				514	The tree depth is 4, since its smallest element uses 4 bits.
				515	Value `5` will not be listed, nor will values above `5`.
				516	Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
				517	Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
				518	It gives the following serie of weights :
				519
				520	\| weight \| 4 \| 3 \| 2 \| 0 \| 1 \|
				521	\| ------ \| - \| - \| - \| - \| - \|
				522	\| Value \| 0 \| 1 \| 2 \| 3 \| 4 \|
				523
				524	The decoder will do the inverse operation :
				525	having collected weights of symbols from `0` to `4`,
				526	it knows the last symbol, `5`, is present with a non-zero weight.
				527	The weight of `5` can be deduced by joining to the nearest power of 2.
				528	Sum of 2^(weight-1) (excluding 0) is :
				529	8 + 4 + 2 + 0 + 1 = 15
				530	Nearest power of 2 is 16.
				531	Therefore, `maxBits = 4` and `weight[5] = 1`.
				532	It can then proceed to transform back weights into nbBits :
				533	`weight = nbBits ? maxBits + 1 - nbBits : 0;` .
				534
				535	##### Huffman Tree header
				536
				537	This is a single byte value (0-255), which tells how to decode the tree.
				538
				539	- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
				540	+ 242 : 1x1 (+ 1x1)
				541	+ 243 : 2x1 (+ 1x2)
				542	+ 244 : 3x1 (+ 1x1)
				543	+ 245 : 4x1 (+ 1x4)
				544	+ 246 : 7x1 (+ 1x1)
				545	+ 247 : 8x1 (+ 1x8)
				546	+ 248 : 15x1 (+ 1x1)
				547	+ 249 : 16x1 (+ 1x16)
				548	+ 250 : 31x1 (+ 1x1)
				549	+ 251 : 32x1 (+ 1x32)
				550	+ 252 : 63x1 (+ 1x1)
				551	+ 253 : 64x1 (+ 1x64)
				552	+ 254 :127x1 (+ 1x1)
				553	+ 255 :128x1 (+ 1x128)
				554
				555	- if headerByte >= 128 : this is a direct representation,
				556	where each weight is written directly as a 4 bits field (0-15).
				557	The full representation occupies (nbSymbols+1/2) bytes,
				558	meaning it uses a last full byte even if nbSymbols is odd.
				559	`nbSymbols = headerByte - 127;`
				560
				561	- if headerByte < 128 :
				562	the serie of weights is compressed by FSE.
				563	The length of the compressed serie is `headerByte` (0-127).
				564
				565	##### FSE (Finite State Entropy) compression of huffman weights
				566
				567	The serie of weights is compressed using standard FSE compression.
				568	It's a single bitstream with 2 interleaved states,
				569	using a single distribution table.
				570
				571	To decode an FSE bitstream, it is necessary to know its compressed size.
				572	Compressed size is provided by `headerByte`.
				573	It's also necessary to know its maximum decompressed size.
				574	In this case, it's `255`, since literal values range from `0` to `255`,
				575	and the last symbol value is not represented.
				576
				577	An FSE bitstream starts by a header, describing probabilities distribution.
				578	Result will create a Decoding Table.
				579	It is necessary to know the maximum accuracy of distribution
				580	to properly allocate space for the Table.
				581	For a list of huffman weights, this maximum is 8 bits.
				582
				583	FSE header and bitstreams are described in a separated chapter.
				584
				585	##### Conversion from weights to huffman prefix codes
				586
				587
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	588
				589
				590	Version changes
				591	---------------