Blame - zstd_compression_format.md - external_zstd

blob: b203dd1013f7afc5c059c598aab5048019c72d86 [file] [log] [blame] [view]

Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1	Zstandard Compression Format Description
				2	========================================
				3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
				9	for any purpose and without charge,
				10	including translations into other languages
				11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
				19	0.1.0 (30/06/2016 - unfinished)
				20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
				28	File compression, Pipe and streaming compression
				29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
				61	Definitions
				62	-----------
				63	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				64	Multiple frames can be appended into a single file or stream.
				65	A frame is totally independent, has a defined beginning and end,
				66	and a set of parameters which tells the decoder how to decompress it.
				67
				68	A frame encapsulates one or multiple __blocks__.
				69	Each block can be compressed or not,
				70	and has a guaranteed maximum content size, which depends on frame parameters.
				71	Unlike frames, each block depends on previous blocks for proper decoding.
				72	However, each block can be decompressed without waiting for its successor,
				73	allowing streaming operations.
				74
				75
				76	General Structure of Zstandard Frame format
				77	-------------------------------------------
				78
				79	\| MagicNb \| F. Header \| Block \| (More blocks) \| EndMark \|
				80	\|:-------:\|:----------:\| ----- \| ------------- \| ------- \|
				81	\| 4 bytes \| 2-14 bytes \| \| \| 3 bytes \|
				82
				83	__Magic Number__
				84
				85	4 Bytes, Little endian format.
				86	Value : 0xFD2FB527
				87
				88	__Frame Header__
				89
				90	2 to 14 Bytes, to be detailed in the next part.
				91
				92	__Data Blocks__
				93
				94	To be detailed later on.
				95	That’s where compressed data is stored.
				96
				97	__EndMark__
				98
				99	The flow of blocks ends when the last block header brings an _end signal_ .
				100	This last block header may optionally host a __Content Checksum__ .
				101
				102	__Content Checksum__
				103
				104	Content Checksum verify that frame content has been regenrated correctly.
				105	The content checksum is the result
				106	of [xxh64() hash function](https://www.xxHash.com)
				107	digesting the original (decoded) data as input, and a seed of zero.
				108	Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
				109	stored into the last block header.
				110	```
				111	contentChecksum = (XXH64(content, size, 0) >> 11) & (1<<22)-1);
				112	```
				113	Content checksum is only present when its associated flag
				114	is set in the frame descriptor.
				115	Its usage is optional.
				116
				117	__Frame Concatenation__
				118
				119	In some circumstances, it may be required to append multiple frames,
				120	for example in order to add new data to an existing compressed file
				121	without re-framing it.
				122
				123	In such case, each frame brings its own set of descriptor flags.
				124	Each frame is considered independent.
				125	The only relation between frames is their sequential order.
				126
				127	The ability to decode multiple concatenated frames
				128	within a single stream or file is left outside of this specification.
				129	As an example, the reference `zstd` command line utility is able
				130	to decode all concatenated frames in their sequential order,
				131	delivering the final decompressed result as if it was a single content.
				132
				133
				134	Frame Header
				135	-------------
				136
				137	\| FHD \| (WD) \| (Content Size) \| (dictID) \|
				138	\| ------- \| --------- \|:--------------:\| --------- \|
				139	\| 1 byte \| 0-1 byte \| 0 - 8 bytes \| 0-4 bytes \|
				140
				141	Frame header has a variable size, which uses a minimum of 2 bytes,
				142	and up to 14 bytes depending on optional parameters.
				143
				144	__FHD byte__ (Frame Header Descriptor)
				145
				146	The first Header's byte is called the Frame Header Descriptor.
				147	It tells which other fields are present.
				148	Decoding this byte is enough to get the full size of the Frame Header.
				149
				150	\| BitNb \| 7-6 \| 5 \| 4 \| 3 \| 2 \| 1-0 \|
				151	\| ------- \| ------ \| ------- \| ------ \| -------- \| -------- \| -------- \|
				152	\|FieldName\| FCSize \| Segment \| Unused \| Reserved \| Checksum \| dictID \|
				153
				154	In this table, bit 7 is highest bit, while bit 0 is lowest.
				155
				156	__Frame Content Size flag__
				157
				158	This is a 2-bits flag (`= FHD >> 6`),
				159	specifying if decompressed data size is provided within the header.
				160
				161	\| Value \| 0 \| 1 \| 2 \| 3 \|
				162	\| ------- \| --- \| --- \| --- \| --- \|
				163	\|FieldSize\| 0-1 \| 2 \| 4 \| 8 \|
				164
				165	Value 0 has a double meaning :
				166	it either means `0` (size not provided) _if_ the `WD` byte is present,
				167	or it means `1` byte (size <= 255 bytes).
				168
				169	__Single Segment__
				170
				171	If this flag is set,
				172	data shall be regenerated within a single continuous memory segment.
				173	In which case, `WD` byte __is not present__,
				174	but `Frame Content Size` field necessarily is.
				175
				176	As a consequence, the decoder must allocate a memory segment
				177	of size `>= Frame Content Size`.
				178
				179	In order to preserve the decoder from unreasonable memory requirement,
				180	a decoder can refuse a compressed frame
				181	which requests a memory size beyond decoder's authorized range.
				182
				183	For broader compatibility, decoders are recommended to support
				184	memory sizes of 8 MB at least.
				185	However, this is merely a recommendation,
				186	and each decoder is free to support higher or lower limits,
				187	depending on local limitations.
				188
				189	__Unused bit__
				190
				191	The value of this bit is unimportant
				192	and not interpreted by a decoder compliant with this specification version.
				193	It may be used in a future revision,
				194	to signal a property which is not required to properly decode the frame.
				195
				196	__Reserved bit__
				197
				198	This bit is reserved for some future feature.
				199	Its value _must be zero_.
				200	A decoder compliant with this specification version must ensure it is not set.
				201	This bit may be used in a future revision,
				202	to signal a feature that must be interpreted in order to decode the frame.
				203
				204	__Content checksum flag__
				205
				206	If this flag is set, a content checksum will be present into the EndMark.
				207	The checksum is a 22 bits value extracted from the XXH64() of data.
				208	See __Content Checksum__ .
				209
				210	__Dictionary ID flag__
				211
				212	This is a 2-bits flag (`= FHD & 3`),
				213	telling if a dictionary ID is provided within the header
				214
				215	\| Value \| 0 \| 1 \| 2 \| 3 \|
				216	\| ------- \| --- \| --- \| --- \| --- \|
				217	\|FieldSize\| 0 \| 1 \| 2 \| 4 \|
				218
				219	__WD byte__ (Window Descriptor)
				220
				221	Provides guarantees on maximum back-reference distance
				222	that will be present within compressed data.
				223	This information is useful for decoders to allocate enough memory.
				224
				225	\| BitNb \| 7-3 \| 0-2 \|
				226	\| --------- \| -------- \| -------- \|
				227	\| FieldName \| Exponent \| Mantissa \|
				228
				229	Maximum distance is given by the following formulae :
				230	```
				231	windowLog = 10 + Exponent;
				232	windowBase = 1 << windowLog;
				233	windowAdd = (windowBase / 8) * Mantissa;
				234	windowSize = windowBase + windowAdd;
				235	```
				236	The minimum window size is 1 KB.
				237	The maximum size is (15*(2^38))-1 bytes, which is almost 1.875 TB.
				238
				239	To properly decode compressed data,
				240	a decoder will need to allocate a buffer of at least `windowSize` bytes.
				241
				242	Note that `WD` byte is optional. It's not present in `single segment` mode.
				243	In which case, the maximum back-reference distance is the content size itself,
				244	which can be any value from 1 to 2^64-1 bytes (16 EB).
				245
				246	In order to preserve decoder from unreasonable memory requirements,
				247	a decoder can refuse a compressed frame
				248	which requests a memory size beyond decoder's authorized range.
				249
				250	For better interoperability,
				251	decoders are recommended to be compatible with window sizes of 8 MB.
				252	Encoders are recommended to not request more than 8 MB.
				253	It's merely a recommendation though,
				254	decoders are free to support larger or lower limits,
				255	depending on local limitations.
				256
				257	__Frame Content Size__
				258
				259	This is the original (uncompressed) size.
				260	This information is optional, and only present if associated flag is set.
				261	Content size is provided using 1, 2, 4 or 8 Bytes.
				262	Format is Little endian.
				263
				264	\| Field Size \| Range \|
				265	\| ---------- \| ---------- \|
				266	\| 0 \| 0 \|
				267	\| 1 \| 0 - 255 \|
				268	\| 2 \| 256 - 65791\|
				269	\| 4 \| 0 - 2^32-1 \|
				270	\| 8 \| 0 - 2^64-1 \|
				271
				272	When field size is 1, 4 or 8 bytes, the value is read directly.
				273	When field size is 2, _an offset of 256 is added_.
				274	It's allowed to represent a small size (ex: `18`) using the 8-bytes variant.
				275	A size of `0` means `content size is unknown`.
				276	In which case, the `WD` byte will necessarily be present,
				277	and becomes the only hint to determine memory allocation.
				278
				279	In order to preserve decoder from unreasonable memory requirement,
				280	a decoder can refuse a compressed frame
				281	which requests a memory size beyond decoder's authorized range.
				282
				283	__Dictionary ID__
				284
				285	This is a variable size field, which contains a single ID.
				286	It checks if the correct dictionary is used for decoding.
				287	Note that this field is optional. If it's not present,
				288	it's up to the caller to make sure it uses the correct dictionary.
				289
				290	Field size depends on __Dictionary ID flag__.
				291	1 byte can represent an ID 0-255.
				292	2 bytes can represent an ID 0-65535.
				293	4 bytes can represent an ID 0-(2^32-1).
				294
				295	It's allowed to represent a small ID (for example `13`)
				296	with a large 4-bytes dictionary ID, losing some efficiency in the process.
				297
				298
				299	Data Blocks
				300	-----------
				301
				302	\| B. Header \| data \|
				303	\|:---------:\| ------ \|
				304	\| 3 bytes \| \|
				305
				306
				307	__Block Header__
				308
				309	This field uses 3-bytes, format is __big-endian__.
				310
				311	The 2 highest bits represent the `block type`,
				312	while the remaining 22 bits represent the (compressed) block size.
				313
				314	There are 4 block types :
				315
				316	\| Value \| 0 \| 1 \| 2 \| 3 \|
				317	\| ---------- \| ---------- \| --- \| --- \| ------- \|
				318	\| Block Type \| Compressed \| Raw \| RLE \| EndMark \|
				319
				320	- Compressed : this is a Zstandard compressed block,
				321	detailed in a later part of this specification.
				322	"block size" is the compressed size.
				323	Decompressed size is unknown,
				324	but its maximum possible value is guaranteed (see below)
				325	- Raw : this is an uncompressed block.
				326	"block size" is the number of bytes to read and copy.
				327	- RLE : this is a single byte, repeated N times.
				328	In which case, "block size" is the size to regenerate,
				329	while the "compressed" block is just 1 byte (the byte to repeat).
				330	- EndMark : this is not a block. Signal the end of the frame.
				331	The rest of the field may be optionally filled by a checksum
				332	(see frame checksum).
				333
				334	Block sizes must respect a few rules :
				335	- In compressed mode, compressed size if always strictly `< contentSize`.
				336	- Block decompressed size is necessarily <= maximum back-reference distance .
				337	- Block decompressed size is necessarily <= 128 KB
				338
				339
				340	__Data__
				341
				342	Where the actual data to decode stands.
				343	It might be compressed or not, depending on previous field indications.
				344	A data block is not necessarily "full" :
				345	since an arbitrary “flush” may happen anytime,
				346	block content can be any size, up to Block Maximum Size.
				347	Block Maximum Size is the smallest of :
				348	- Max back-reference distance
				349	- 128 KB
				350
				351
				352	Skippable Frames
				353	----------------
				354
				355	\| Magic Number \| Frame Size \| User Data \|
				356	\|:------------:\|:----------:\| --------- \|
				357	\| 4 bytes \| 4 bytes \| \|
				358
				359	Skippable frames allow the insertion of user-defined data
				360	into a flow of concatenated frames.
				361	Its design is pretty straightforward,
				362	with the sole objective to allow the decoder to quickly skip
				363	over user-defined data and continue decoding.
				364
				365	Skippable frames defined in this specification are compatible with LZ4 ones.
				366
				367
				368	__Magic Number__ :
				369
				370	4 Bytes, Little endian format.
				371	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				372	All 16 values are valid to identify a skippable frame.
				373
				374	__Frame Size__ :
				375
				376	This is the size, in bytes, of the following User Data
				377	(without including the magic number nor the size field itself).
				378	4 Bytes, Little endian format, unsigned 32-bits.
				379	This means User Data can’t be bigger than (2^32-1) Bytes.
				380
				381	__User Data__ :
				382
				383	User Data can be anything. Data will just be skipped by the decoder.
				384
				385
				386	Compressed block format
				387	-----------------------
				388	This specification details the content of a _compressed block_.
				389	A compressed block has a size, which must be known in order to decode it.
				390	It also has a guaranteed maximum regenerated size,
				391	in order to properly allocate destination buffer.
				392	See "Frame format" for more details.
				393
				394	A compressed block consists of 2 sections :
				395	- Literals section
				396	- Sequences section
				397
				398	### Compressed Literals
				399
				400	Literals are compressed using order-0 huffman compression.
				401	During sequence phase, literals will be entangled with match copy operations.
				402	All literals are regrouped in the first part of the block.
				403	They can be decoded first, and then copied during sequence operations,
				404	or they can be decoded on the flow, as needed by sequences.
				405
				406	\| Header \| (Tree Description) \| Stream1 \| (Stream2) \| (Stream3) \| (Stream4) \|
				407	\| ------ \| ------------------ \| ------- \| --------- \| --------- \| --------- \|
				408
				409	Literals can be compressed, or uncompressed.
				410	When compressed, an optional tree description can be present,
				411	followed by 1 or 4 streams.
				412
				413	#### Block Literal Header
				414
				415	Header is in charge of describing precisely how literals are packed.
				416	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
				417	using big-endian convention.
				418
				419	\| BlockType \| sizes format \| (compressed size) \| regenerated size \|
				420	\| --------- \| ------------ \| ----------------- \| ---------------- \|
				421	\| 2 bits \| 1 - 2 bits \| 0 - 18 bits \| 5 - 20 bits \|
				422
				423	__Block Type__ :
				424
				425	This is a 2-bits field, describing 4 different block types :
				426
				427	\| Value \| 0 \| 1 \| 2 \| 3 \|
				428	\| ---------- \| ---------- \| ------ \| --- \| ------- \|
				429	\| Block Type \| Compressed \| Repeat \| Raw \| RLE \|
				430
				431	- Compressed : This is a standard huffman-compressed block,
				432	starting with a huffman tree description.
				433	See details below.
				434	- Repeat Stats : This is a huffman-compressed block,
				435	using huffman tree from previous huffman-compressed block.
				436	Huffman tree description will be skipped.
				437	Compressed stream is equivalent to "compressed" block type.
				438	- Raw : Literals are stored uncompressed.
				439	- RLE : Literals consist of a single byte value repeated N times.
				440
				441	__Sizes format__ :
				442
				443	Sizes format are divided into 2 families :
				444
				445	- For compressed block, it requires to decode both the compressed size
				446	and the decompressed size. It will also decode the number of streams.
				447	- For Raw or RLE blocks, it's enough to decode the size to regenerate.
				448
				449	For values spanning several bytes, convention is Big-endian.
				450
				451	__Sizes format for Raw or RLE block__ :
				452
				453	- Value : 0x : Regenerated size uses 5 bits (0-31).
				454	Total literal header size is 1 byte.
				455	`size = h[0] & 31;`
				456	- Value : 10 : Regenerated size uses 12 bits (0-4095).
				457	Total literal header size is 2 bytes.
				458	`size = ((h[0] & 15) << 8) + h[1];`
				459	- Value : 11 : Regenerated size uses 20 bits (0-1048575).
				460	Total literal header size is 2 bytes.
				461	`size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];`
				462
				463	Note : it's allowed to represent a short value (ex : `13`)
				464	using a long format, accepting the reduced compacity.
				465
				466	__Sizes format for Compressed Block__ :
				467
				468	Note : also applicable to "repeat-stats" blocks.
				469	- Value : 00 : 4 streams
				470	Compressed and regenerated sizes use 10 bits (0-1023)
				471	Total literal header size is 3 bytes
				472	- Value : 01 : _Single stream_
				473	Compressed and regenerated sizes use 10 bits (0-1023)
				474	Total literal header size is 3 bytes
				475	- Value : 10 : 4 streams
				476	Compressed and regenerated sizes use 14 bits (0-16383)
				477	Total literal header size is 4 bytes
				478	- Value : 10 : 4 streams
				479	Compressed and regenerated sizes use 18 bits (0-262143)
				480	Total literal header size is 5 bytes
				481
				482
				483
				484	Version changes
				485	---------------
				486	0.1 : initial release