Blame - zstd_compression_format.md - external_zstd

blob: 2fbe3fa4c018312c4b83f207a66a73439208349b [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
				9	for any purpose and without charge,
				10	including translations into other languages
				11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	19	0.0.2 (July 2016 - Work in progress - unfinished)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
				61	Definitions
				62	-----------
				63	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				64	Multiple frames can be appended into a single file or stream.
				65	A frame is totally independent, has a defined beginning and end,
				66	and a set of parameters which tells the decoder how to decompress it.
				67
				68	A frame encapsulates one or multiple __blocks__.
				69	Each block can be compressed or not,
				70	and has a guaranteed maximum content size, which depends on frame parameters.
				71	Unlike frames, each block depends on previous blocks for proper decoding.
				72	However, each block can be decompressed without waiting for its successor,
				73	allowing streaming operations.
				74
				75
				76	General Structure of Zstandard Frame format
				77	-------------------------------------------
				78
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	79	\| MagicNb \| Frame Header \| Block \| (More blocks) \| EndMark \|
				80	\|:-------:\|:-------------:\| ----- \| ------------- \| ------- \|
				81	\| 4 bytes \| 2-14 bytes \| \| \| 3 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	82
				83	__Magic Number__
				84
				85	4 Bytes, Little endian format.
				86	Value : 0xFD2FB527
				87
				88	__Frame Header__
				89
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	90	2 to 14 Bytes, detailed in [next part](#frame-header).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	91
				92	__Data Blocks__
				93
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	94	Detailed in [next chapter](#data-blocks).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	95	That’s where compressed data is stored.
				96
				97	__EndMark__
				98
				99	The flow of blocks ends when the last block header brings an _end signal_ .
				100	This last block header may optionally host a __Content Checksum__ .
				101
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	102	##### __Content Checksum__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	103
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	104	Content Checksum verify that frame content has been regenerated correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	105	The content checksum is the result
				106	of [xxh64() hash function](https://www.xxHash.com)
				107	digesting the original (decoded) data as input, and a seed of zero.
				108	Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	109	stored into the endmark body.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	110	```
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	111	mask22bits = (1<<22)-1;
				112	contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	113	```
				114	Content checksum is only present when its associated flag
				115	is set in the frame descriptor.
				116	Its usage is optional.
				117
				118	__Frame Concatenation__
				119
				120	In some circumstances, it may be required to append multiple frames,
				121	for example in order to add new data to an existing compressed file
				122	without re-framing it.
				123
				124	In such case, each frame brings its own set of descriptor flags.
				125	Each frame is considered independent.
				126	The only relation between frames is their sequential order.
				127
				128	The ability to decode multiple concatenated frames
				129	within a single stream or file is left outside of this specification.
				130	As an example, the reference `zstd` command line utility is able
				131	to decode all concatenated frames in their sequential order,
				132	delivering the final decompressed result as if it was a single content.
				133
				134
				135	Frame Header
				136	-------------
				137
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	138	\| FHD \| (WD) \| (dictID) \| (Content Size) \|
				139	\| ------- \| --------- \| --------- \|:--------------:\|
				140	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0 - 8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	141
				142	Frame header has a variable size, which uses a minimum of 2 bytes,
				143	and up to 14 bytes depending on optional parameters.
				144
				145	__FHD byte__ (Frame Header Descriptor)
				146
				147	The first Header's byte is called the Frame Header Descriptor.
				148	It tells which other fields are present.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	149	Decoding this byte is enough to tell the size of Frame Header.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	150
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	151	\| BitNb \| 7-6 \| 5 \| 4 \| 3 \| 2 \| 1-0 \|
				152	\| ------- \| ------ \| ------- \| ------ \| -------- \| -------- \| ------ \|
				153	\|FieldName\| FCSize \| Segment \| Unused \| Reserved \| Checksum \| dictID \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	154
				155	In this table, bit 7 is highest bit, while bit 0 is lowest.
				156
				157	__Frame Content Size flag__
				158
				159	This is a 2-bits flag (`= FHD >> 6`),
				160	specifying if decompressed data size is provided within the header.
				161
				162	\| Value \| 0 \| 1 \| 2 \| 3 \|
				163	\| ------- \| --- \| --- \| --- \| --- \|
				164	\|FieldSize\| 0-1 \| 2 \| 4 \| 8 \|
				165
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	166	Value 0 meaning depends on _single segment_ mode :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	167	it either means `0` (size not provided) _if_ the `WD` byte is present,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	168	or `1` (frame content size <= 255 bytes) otherwise.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	169
				170	__Single Segment__
				171
				172	If this flag is set,
				173	data shall be regenerated within a single continuous memory segment.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	174
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	175	In which case, `WD` byte __is not present__,
				176	but `Frame Content Size` field necessarily is.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	177	As a consequence, the decoder must allocate a memory segment
				178	of size `>= Frame Content Size`.
				179
				180	In order to preserve the decoder from unreasonable memory requirement,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	181	a decoder can reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	182	which requests a memory size beyond decoder's authorized range.
				183
				184	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	185	memory sizes of at least 8 MB.
				186	This is just a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	187	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	188	depending on local limitations.
				189
				190	__Unused bit__
				191
				192	The value of this bit is unimportant
				193	and not interpreted by a decoder compliant with this specification version.
				194	It may be used in a future revision,
				195	to signal a property which is not required to properly decode the frame.
				196
				197	__Reserved bit__
				198
				199	This bit is reserved for some future feature.
				200	Its value _must be zero_.
				201	A decoder compliant with this specification version must ensure it is not set.
				202	This bit may be used in a future revision,
				203	to signal a feature that must be interpreted in order to decode the frame.
				204
				205	__Content checksum flag__
				206
				207	If this flag is set, a content checksum will be present into the EndMark.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	208	The checksum is a 22 bits value extracted from the XXH64() of data,
				209	and stored into endMark. See [__Content Checksum__](#content-checksum) .
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	210
				211	__Dictionary ID flag__
				212
				213	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	214	telling if a dictionary ID is provided within the header.
				215	It also specifies the size of this field.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	216
				217	\| Value \| 0 \| 1 \| 2 \| 3 \|
				218	\| ------- \| --- \| --- \| --- \| --- \|
				219	\|FieldSize\| 0 \| 1 \| 2 \| 4 \|
				220
				221	__WD byte__ (Window Descriptor)
				222
				223	Provides guarantees on maximum back-reference distance
				224	that will be present within compressed data.
				225	This information is useful for decoders to allocate enough memory.
				226
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	227	`WD` byte is optional. It's not present in `single segment` mode.
				228	In which case, the maximum back-reference distance is the content size itself,
				229	which can be any value from 1 to 2^64-1 bytes (16 EB).
				230
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	231	\| BitNb \| 7-3 \| 0-2 \|
				232	\| --------- \| -------- \| -------- \|
				233	\| FieldName \| Exponent \| Mantissa \|
				234
				235	Maximum distance is given by the following formulae :
				236	```
				237	windowLog = 10 + Exponent;
				238	windowBase = 1 << windowLog;
				239	windowAdd = (windowBase / 8) * Mantissa;
				240	windowSize = windowBase + windowAdd;
				241	```
				242	The minimum window size is 1 KB.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	243	The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	244
				245	To properly decode compressed data,
				246	a decoder will need to allocate a buffer of at least `windowSize` bytes.
				247
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	248	In order to preserve decoder from unreasonable memory requirements,
				249	a decoder can refuse a compressed frame
				250	which requests a memory size beyond decoder's authorized range.
				251
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	252	For improved interoperability,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	253	decoders are recommended to be compatible with window sizes of 8 MB.
				254	Encoders are recommended to not request more than 8 MB.
				255	It's merely a recommendation though,
				256	decoders are free to support larger or lower limits,
				257	depending on local limitations.
				258
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	259	__Dictionary ID__
				260
				261	This is a variable size field, which contains an ID.
				262	It checks if the correct dictionary is used for decoding.
				263	Note that this field is optional. If it's not present,
				264	it's up to the caller to make sure it uses the correct dictionary.
				265
				266	Field size depends on __Dictionary ID flag__.
				267	1 byte can represent an ID 0-255.
				268	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	269	4 bytes can represent an ID 0-4294967295.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	270
				271	It's allowed to represent a small ID (for example `13`)
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	272	with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	273
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	274	__Frame Content Size__
				275
				276	This is the original (uncompressed) size.
				277	This information is optional, and only present if associated flag is set.
				278	Content size is provided using 1, 2, 4 or 8 Bytes.
				279	Format is Little endian.
				280
				281	\| Field Size \| Range \|
				282	\| ---------- \| ---------- \|
				283	\| 0 \| 0 \|
				284	\| 1 \| 0 - 255 \|
				285	\| 2 \| 256 - 65791\|
				286	\| 4 \| 0 - 2^32-1 \|
				287	\| 8 \| 0 - 2^64-1 \|
				288
				289	When field size is 1, 4 or 8 bytes, the value is read directly.
				290	When field size is 2, _an offset of 256 is added_.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	291	It's allowed to represent a small size (ex: `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	292	A size of `0` means `content size is unknown`.
				293	In which case, the `WD` byte will necessarily be present,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	294	and becomes the only hint to guide memory allocation.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	295
				296	In order to preserve decoder from unreasonable memory requirement,
				297	a decoder can refuse a compressed frame
				298	which requests a memory size beyond decoder's authorized range.
				299
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	300
				301	Data Blocks
				302	-----------
				303
				304	\| B. Header \| data \|
				305	\|:---------:\| ------ \|
				306	\| 3 bytes \| \|
				307
				308
				309	__Block Header__
				310
				311	This field uses 3-bytes, format is __big-endian__.
				312
				313	The 2 highest bits represent the `block type`,
				314	while the remaining 22 bits represent the (compressed) block size.
				315
				316	There are 4 block types :
				317
				318	\| Value \| 0 \| 1 \| 2 \| 3 \|
				319	\| ---------- \| ---------- \| --- \| --- \| ------- \|
				320	\| Block Type \| Compressed \| Raw \| RLE \| EndMark \|
				321
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	322	- Compressed : this is a [Zstandard compressed block](#compressed-block-format),
				323	detailed in another section of this specification.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	324	"block size" is the compressed size.
				325	Decompressed size is unknown,
				326	but its maximum possible value is guaranteed (see below)
				327	- Raw : this is an uncompressed block.
				328	"block size" is the number of bytes to read and copy.
				329	- RLE : this is a single byte, repeated N times.
				330	In which case, "block size" is the size to regenerate,
				331	while the "compressed" block is just 1 byte (the byte to repeat).
				332	- EndMark : this is not a block. Signal the end of the frame.
				333	The rest of the field may be optionally filled by a checksum
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	334	(see [Content Checksum](#content-checksum)).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	335
				336	Block sizes must respect a few rules :
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	337	- In compressed mode, compressed size if always strictly `< decompressed size`.
				338	- Block decompressed size is always <= maximum back-reference distance .
				339	- Block decompressed size is always <= 128 KB
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	340
				341
				342	__Data__
				343
				344	Where the actual data to decode stands.
				345	It might be compressed or not, depending on previous field indications.
				346	A data block is not necessarily "full" :
				347	since an arbitrary “flush” may happen anytime,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	348	block decompressed content can be any size,
				349	up to Block Maximum Decompressed Size, which is the smallest of :
				350	- Maximum back-reference distance
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	351	- 128 KB
				352
				353
				354	Skippable Frames
				355	----------------
				356
				357	\| Magic Number \| Frame Size \| User Data \|
				358	\|:------------:\|:----------:\| --------- \|
				359	\| 4 bytes \| 4 bytes \| \|
				360
				361	Skippable frames allow the insertion of user-defined data
				362	into a flow of concatenated frames.
				363	Its design is pretty straightforward,
				364	with the sole objective to allow the decoder to quickly skip
				365	over user-defined data and continue decoding.
				366
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	367	Skippable frames defined in this specification are compatible with [LZ4] ones.
				368
				369	[LZ4]:http://www.lz4.org
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	370
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	371	__Magic Number__ :
				372
				373	4 Bytes, Little endian format.
				374	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				375	All 16 values are valid to identify a skippable frame.
				376
				377	__Frame Size__ :
				378
				379	This is the size, in bytes, of the following User Data
				380	(without including the magic number nor the size field itself).
				381	4 Bytes, Little endian format, unsigned 32-bits.
				382	This means User Data can’t be bigger than (2^32-1) Bytes.
				383
				384	__User Data__ :
				385
				386	User Data can be anything. Data will just be skipped by the decoder.
				387
				388
				389	Compressed block format
				390	-----------------------
				391	This specification details the content of a _compressed block_.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	392	A compressed block has a size, which must be known.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	393	It also has a guaranteed maximum regenerated size,
				394	in order to properly allocate destination buffer.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	395	See [Data Blocks](#data-blocks) for more details.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	396
				397	A compressed block consists of 2 sections :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	398	- [Literals section](#literals-section)
				399	- [Sequences section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	400
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	401	### Prerequisites
				402	To decode a compressed block, the following elements are necessary :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	403	- Previous decoded blocks, up to a distance of `windowSize`,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	404	or all previous blocks in "single segment" mode.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	405	- List of "recent offsets" from previous compressed block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	406	- Decoding tables of previous compressed block for each symbol type
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	407	(literals, litLength, matchLength, offset).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	408
				409
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	410	### Literals section
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	411
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	412	Literals are compressed using huffman compression.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	413	During sequence phase, literals will be entangled with match copy operations.
				414	All literals are regrouped in the first part of the block.
				415	They can be decoded first, and then copied during sequence operations,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	416	or they can be decoded on the flow, as needed by sequence commands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	417
				418	\| Header \| (Tree Description) \| Stream1 \| (Stream2) \| (Stream3) \| (Stream4) \|
				419	\| ------ \| ------------------ \| ------- \| --------- \| --------- \| --------- \|
				420
				421	Literals can be compressed, or uncompressed.
				422	When compressed, an optional tree description can be present,
				423	followed by 1 or 4 streams.
				424
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	425	#### Literals section header
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	426
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	427	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	428	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
				429	using big-endian convention.
				430
				431	\| BlockType \| sizes format \| (compressed size) \| regenerated size \|
				432	\| --------- \| ------------ \| ----------------- \| ---------------- \|
				433	\| 2 bits \| 1 - 2 bits \| 0 - 18 bits \| 5 - 20 bits \|
				434
				435	__Block Type__ :
				436
				437	This is a 2-bits field, describing 4 different block types :
				438
				439	\| Value \| 0 \| 1 \| 2 \| 3 \|
				440	\| ---------- \| ---------- \| ------ \| --- \| ------- \|
				441	\| Block Type \| Compressed \| Repeat \| Raw \| RLE \|
				442
				443	- Compressed : This is a standard huffman-compressed block,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	444	starting with a huffman tree description.
				445	See details below.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446	- Repeat Stats : This is a huffman-compressed block,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	447	using huffman tree _from previous huffman-compressed literals block_.
				448	Huffman tree description will be skipped.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	449	- Raw : Literals are stored uncompressed.
				450	- RLE : Literals consist of a single byte value repeated N times.
				451
				452	__Sizes format__ :
				453
				454	Sizes format are divided into 2 families :
				455
				456	- For compressed block, it requires to decode both the compressed size
				457	and the decompressed size. It will also decode the number of streams.
				458	- For Raw or RLE blocks, it's enough to decode the size to regenerate.
				459
				460	For values spanning several bytes, convention is Big-endian.
				461
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	462	__Sizes format for Raw or RLE literals block__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	463
				464	- Value : 0x : Regenerated size uses 5 bits (0-31).
				465	Total literal header size is 1 byte.
				466	`size = h[0] & 31;`
				467	- Value : 10 : Regenerated size uses 12 bits (0-4095).
				468	Total literal header size is 2 bytes.
				469	`size = ((h[0] & 15) << 8) + h[1];`
				470	- Value : 11 : Regenerated size uses 20 bits (0-1048575).
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	471	Total literal header size is 3 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	472	`size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];`
				473
				474	Note : it's allowed to represent a short value (ex : `13`)
				475	using a long format, accepting the reduced compacity.
				476
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	477	__Sizes format for Compressed literals block__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	478
				479	Note : also applicable to "repeat-stats" blocks.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	480	- Value : 00 : 4 streams.
				481	Compressed and regenerated sizes use 10 bits (0-1023).
				482	Total literal header size is 3 bytes.
				483	- Value : 01 : _Single stream_.
				484	Compressed and regenerated sizes use 10 bits (0-1023).
				485	Total literal header size is 3 bytes.
				486	- Value : 10 : 4 streams.
				487	Compressed and regenerated sizes use 14 bits (0-16383).
				488	Total literal header size is 4 bytes.
				489	- Value : 10 : 4 streams.
				490	Compressed and regenerated sizes use 18 bits (0-262143).
				491	Total literal header size is 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	492
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	493	Compressed and regenerated size fields follow big endian convention.
				494
				495	#### Huffman Tree description
				496
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	497	This section is only present when literals block type is `Compressed` (`0`).
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	498
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	499	Prefix coding represents symbols from an a priori known alphabet
				500	by bit sequences (codes), one code for each symbol,
				501	in a manner such that different symbols may be represented
				502	by bit sequences of different lengths,
				503	but a parser can always parse an encoded string
				504	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	505
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	506	Given an alphabet with known symbol frequencies,
				507	the Huffman algorithm allows the construction of an optimal prefix code
				508	using the fewest bits of any possible prefix codes for that alphabet.
				509	Such a code is called a Huffman code.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	510
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	511	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	512	More bits improve accuracy but cost more header size,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	513	and require more memory for decoding operations.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	514
				515	The current format limits the maximum depth to 15 bits.
				516	The reference decoder goes further, by limiting it to 11 bits.
				517	It is recommended to remain compatible with reference decoder.
				518
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	519
				520	##### Representation
				521
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	522	All literal values from zero (included) to last present one (excluded)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	523	are represented by `weight` values, from 0 to `maxBits`.
				524	Transformation from `weight` to `nbBits` follows this formulae :
				525	`nbBits = weight ? maxBits + 1 - weight : 0;` .
				526	The last symbol's weight is deduced from previously decoded ones,
				527	by completing to the nearest power of 2.
				528	This power of 2 gives `maxBits`, the depth of the current tree.
				529
				530	__Example__ :
				531	Let's presume the following huffman tree must be described :
				532
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	533	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				534	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				535	\| nbBits \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	536
				537	The tree depth is 4, since its smallest element uses 4 bits.
				538	Value `5` will not be listed, nor will values above `5`.
				539	Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
				540	Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
				541	It gives the following serie of weights :
				542
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	543	\| weights \| 4 \| 3 \| 2 \| 0 \| 1 \|
				544	\| ------- \| --- \| --- \| --- \| --- \| --- \|
				545	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	546
				547	The decoder will do the inverse operation :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	548	having collected weights of literals from `0` to `4`,
				549	it knows the last literal, `5`, is present with a non-zero weight.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	550	The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	551	Sum of 2^(weight-1) (excluding 0) is :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	552	`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	553	Nearest power of 2 is 16.
				554	Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	555
				556	##### Huffman Tree header
				557
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	558	This is a single byte value (0-255),
				559	which tells how to decode the list of weights.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	560
				561	- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	562
				563	\| value \|242\|243\|244\|245\|246\|247\|248\|249\|250\|251\|252\|253\|254\|255\|
Yann Collet	e0ce5b0	2016-07-06 01:50:44 +0200	[diff] [blame^]	564	\| -------- \|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	565	\| Nb of 1s \| 1 \| 2 \| 3 \| 4 \| 7 \| 8 \| 15\| 16\| 31\| 32\| 63\| 64\|127\|128\|
				566	\|Complement\| 1 \| 2 \| 1 \| 4 \| 1 \| 8 \| 1 \| 16\| 1 \| 32\| 1 \| 64\| 1 \|128\|
				567
				568	_Note_ : complement is by using the "join to nearest power of 2" rule.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	569
				570	- if headerByte >= 128 : this is a direct representation,
				571	where each weight is written directly as a 4 bits field (0-15).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	572	The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	573	meaning it uses a last full byte even if nbSymbols is odd.
				574	`nbSymbols = headerByte - 127;`
				575
				576	- if headerByte < 128 :
				577	the serie of weights is compressed by FSE.
				578	The length of the compressed serie is `headerByte` (0-127).
				579
				580	##### FSE (Finite State Entropy) compression of huffman weights
				581
				582	The serie of weights is compressed using standard FSE compression.
				583	It's a single bitstream with 2 interleaved states,
				584	using a single distribution table.
				585
				586	To decode an FSE bitstream, it is necessary to know its compressed size.
				587	Compressed size is provided by `headerByte`.
				588	It's also necessary to know its maximum decompressed size.
				589	In this case, it's `255`, since literal values range from `0` to `255`,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	590	and last symbol value is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	591
				592	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	593	It will create a Decoding Table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	594	It is necessary to know the maximum accuracy of distribution
				595	to properly allocate space for the Table.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	596	For a list of huffman weights, this maximum is 7 bits.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	597
				598	FSE header and bitstreams are described in a separated chapter.
				599
				600	##### Conversion from weights to huffman prefix codes
				601
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	602	All present symbols shall now have a `weight` value.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	603	A `weight` directly represents a `range` of prefix codes,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	604	following the formulae : `range = weight ? 1 << (weight-1) : 0 ;`
				605	Symbols are sorted by weight.
				606	Within same weight, symbols keep natural order.
				607	Starting from lowest weight,
				608	symbols are being allocated to a range of prefix codes.
				609	Symbols with a weight of zero are not present.
				610
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	611	It is then possible to transform weights into nbBits :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	612	`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
				613
				614
				615	__Example__ :
				616	Let's presume the following huffman tree has been decoded :
				617
				618	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				619	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				620	\| weight \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
				621
				622	Sorted by weight and then natural order,
				623	it gives the following distribution :
				624
				625	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				626	\| ------------ \| --- \| --- \| --- \| --- \| --- \| ---- \|
				627	\| weight \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
				628	\| range \| 0 \| 1 \| 1 \| 2 \| 4 \| 8 \|
				629	\| prefix codes \| N/A \| 0 \| 1 \| 2-3 \| 4-7 \| 8-15 \|
				630	\| nb bits \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
				631
				632
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	633	#### Literals bitstreams
				634
				635	##### Bitstreams sizes
				636
				637	As seen in a previous paragraph,
				638	there are 2 flavors of huffman-compressed literals :
				639	single stream, and 4-streams.
				640
				641	4-streams is useful for CPU with multiple execution units and OoO operations.
				642	Since each stream can be decoded independently,
				643	it's possible to decode them up to 4x faster than a single stream,
				644	presuming the CPU has enough parallelism available.
				645
				646	For single stream, header provides both the compressed and regenerated size.
				647	For 4-streams though,
				648	header only provides compressed and regenerated size of all 4 streams combined.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	649	In order to properly decode the 4 streams,
				650	it's necessary to know the compressed and regenerated size of each stream.
				651
				652	Regenerated size is easiest :
				653	each stream has a size of `(totalSize+3)/4`,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	654	except the last one, which is up to 3 bytes smaller, to reach `totalSize`.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	655
				656	Compressed size must be provided explicitly : in the 4-streams variant,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	657	bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
				658	Each value represents the compressed size of one stream, in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	659	The last stream size is deducted from total compressed size
				660	and from already known stream sizes :
				661	`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
				662
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	663	##### Bitstreams read and decode
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	664
				665	Each bitstream must be read _backward_,
				666	that is starting from the end down to the beginning.
				667	Therefore it's necessary to know the size of each bitstream.
				668
				669	It's also necessary to know exactly which _bit_ is the latest.
				670	This is detected by a final bit flag :
				671	the highest bit of latest byte is a final-bit-flag.
				672	Consequently, a last byte of `0` is not possible.
				673	And the final-bit-flag itself is not part of the useful bitstream.
				674	Hence, the last byte contain between 0 and 7 useful bits.
				675
				676	Starting from the end,
				677	it's possible to read the bitstream in a little-endian fashion,
				678	keeping track of already used bits.
				679
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	680	Reading the last `maxBits` bits,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	681	it's then possible to compare extracted value to the prefix codes table,
				682	determining the symbol to decode and number of bits to discard.
				683
				684	The process continues up to reading the required number of symbols per stream.
				685	If a bitstream is not entirely and exactly consumed,
				686	hence reaching exactly its beginning position with all bits consumed,
				687	the decoding process is considered faulty.
				688
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	689
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	690	### Sequences section
				691
				692	A compressed block is a succession of _sequences_ .
				693	A sequence is a literal copy command, followed by a match copy command.
				694	A literal copy command specifies a length.
				695	It is the number of bytes to be copied (or extracted) from the literal section.
				696	A match copy command specifies an offset and a length.
				697	The offset gives the position to copy from,
				698	which can stand within a previous block.
				699
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	700	There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	701	which are encoded together, interleaved in a single _bitstream_.
				702
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	703	Each symbol is a _code_ in its own context,
				704	which specifies a baseline and a number of bits to add.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	705	_Codes_ are FSE compressed,
				706	and interleaved with raw additional bits in the same bitstream.
				707
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	708	The Sequences section starts by a header,
				709	followed by optional Probability tables for each symbol type,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	710	followed by the bitstream.
				711
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	712	To decode the Sequence section, it's required to know its size.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	713	This size is deducted from `blockSize - literalSectionSize`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	714
				715
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	716	#### Sequences section header
				717
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	718	Consists in 2 items :
				719	- Nb of Sequences
				720	- Flags providing Symbol compression types
				721
				722	__Nb of Sequences__
				723
				724	This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
				725	Let's call its first byte `byte0`.
				726	- `if (byte0 == 0)` : there are no sequences.
				727	The sequence section stops there.
				728	Regenerated content is defined entirely by literals section.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	729	- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
				730	- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
				731	- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	732
				733	__Symbol compression modes__
				734
				735	This is a single byte, defining the compression mode of each symbol type.
				736
				737	\| BitNb \| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
				738	\| ------- \| ------ \| ------ \| ------ \| -------- \|
				739	\|FieldName\| LLtype \| OFType \| MLType \| Reserved \|
				740
				741	The last field, `Reserved`, must be all-zeroes.
				742
				743	`LLtype`, `OFType` and `MLType` define the compression mode of
				744	Literal Lengths, Offsets and Match Lengths respectively.
				745
				746	They follow the same enumeration :
				747
				748	\| Value \| 0 \| 1 \| 2 \| 3 \|
				749	\| ---------------- \| ------ \| --- \| ------ \| --- \|
				750	\| Compression Mode \| predef \| RLE \| Repeat \| FSE \|
				751
				752	- "predef" : uses a pre-defined distribution table.
				753	- "RLE" : it's a single code, repeated `nbSeqs` times.
				754	- "Repeat" : re-use distribution table from previous compressed block.
				755	- "FSE" : standard FSE compression.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	756	A distribution table will be present.
				757	It will be described in [next part](#distribution-tables).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	758
				759	#### Symbols decoding
				760
				761	##### Literal Lengths codes
				762
				763	Literal lengths codes are values ranging from `0` to `35` included.
				764	They define lengths from 0 to 131071 bytes.
				765
				766	\| Code \| 0-15 \|
				767	\| ------ \| ---- \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	768	\| value \| Code \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	769	\| nbBits \| 0 \|
				770
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	771
				772	\| Code \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				773	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				774	\| Baseline \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				775	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				776
				777	\| Code \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				778	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				779	\| Baseline \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				780	\| nb Bits \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				781
				782	\| Code \| 32 \| 33 \| 34 \| 35 \|
				783	\| -------- \| ---- \| ---- \| ---- \| ---- \|
				784	\| Baseline \| 8192 \|16384 \|32768 \|65536 \|
				785	\| nb Bits \| 13 \| 14 \| 15 \| 16 \|
				786
				787	__Default distribution__
				788
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	789	When "compression mode" is "predef"",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	790	a pre-defined distribution is used for FSE compression.
				791
				792	Here is its definition. It uses an accuracy of 6 bits (64 states).
				793	```
				794	short literalLengths_defaultDistribution[36] =
				795	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				796	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				797	-1,-1,-1,-1 };
				798	```
				799
				800	##### Match Lengths codes
				801
				802	Match lengths codes are values ranging from `0` to `52` included.
				803	They define lengths from 3 to 131074 bytes.
				804
				805	\| Code \| 0-31 \|
				806	\| ------ \| -------- \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	807	\| value \| Code + 3 \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	808	\| nbBits \| 0 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	809
				810	\| Code \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				811	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				812	\| Baseline \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				813	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				814
				815	\| Code \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				816	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				817	\| Baseline \| 67 \| 83 \| 99 \| 131 \| 258 \| 514 \| 1026 \| 2050 \|
				818	\| nb Bits \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				819
				820	\| Code \| 48 \| 49 \| 50 \| 51 \| 52 \|
				821	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				822	\| Baseline \| 4098 \| 8194 \|16486 \|32770 \|65538 \|
				823	\| nb Bits \| 12 \| 13 \| 14 \| 15 \| 16 \|
				824
				825	__Default distribution__
				826
				827	When "compression mode" is defined as "default distribution",
				828	a pre-defined distribution is used for FSE compression.
				829
				830	Here is its definition. It uses an accuracy of 6 bits (64 states).
				831	```
				832	short matchLengths_defaultDistribution[53] =
				833	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				834	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				835	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				836	-1,-1,-1,-1,-1 };
				837	```
				838
				839	##### Offset codes
				840
				841	Offset codes are values ranging from `0` to `N`,
				842	with `N` being limited by maximum backreference distance.
				843
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	844	A decoder is free to limit its maximum `N` supported.
				845	Recommendation is to support at least up to `22`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	846	For information, at the time of this writing.
				847	the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
				848
				849	An offset code is also the nb of additional bits to read,
				850	and can be translated into an `OFValue` using the following formulae :
				851
				852	```
				853	OFValue = (1 << offsetCode) + readNBits(offsetCode);
				854	if (OFValue > 3) offset = OFValue - 3;
				855	```
				856
				857	OFValue from 1 to 3 are special : they define "repeat codes",
				858	which means one of the previous offsets will be repeated.
				859	They are sorted in recency order, with 1 meaning the most recent one.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	860	See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	861
				862	__Default distribution__
				863
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	864	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	865	a pre-defined distribution is used for FSE compression.
				866
				867	Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	868	and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	869
				870	If any sequence in the compressed block requires an offset larger than this,
				871	it's not possible to use the default distribution to represent it.
				872
				873	```
				874	short offsetCodes_defaultDistribution[53] =
				875	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				876	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				877	```
				878
				879	#### Distribution tables
				880
				881	Following the header, up to 3 distribution tables can be described.
				882	They are, in order :
				883	- Literal lengthes
				884	- Offsets
				885	- Match Lengthes
				886
				887	The content to decode depends on their respective compression mode :
				888	- Repeat mode : no content. Re-use distribution from previous compressed block.
				889	- Predef : no content. Use pre-defined distribution table.
				890	- RLE : 1 byte. This is the only code to use across the whole compressed block.
				891	- FSE : A distribution table is present.
				892
				893	##### FSE distribution table : condensed format
				894
				895	An FSE distribution table describes the probabilities of all symbols
				896	from `0` to the last present one (included)
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	897	on a normalized scale of `1 << AccuracyLog` .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	898
				899	It's a bitstream which is read forward, in little-endian fashion.
				900	It's not necessary to know its exact size,
				901	since it will be discovered and reported by the decoding process.
				902
				903	The bitstream starts by reporting on which scale it operates.
				904	`AccuracyLog = low4bits + 5;`
				905	In theory, it can define a scale from 5 to 20.
				906	In practice, decoders are allowed to limit the maximum supported `AccuracyLog`.
				907	Recommended maximum are `9` for literal and match lengthes, and `8` for offsets.
				908	The reference decoder uses these limits.
				909
				910	Then follow each symbol value, from `0` to last present one.
				911	The nb of bits used by each field is variable.
				912	It depends on :
				913
				914	- Remaining probabilities + 1 :
				915	__example__ :
				916	Presuming an AccuracyLog of 8,
				917	and presuming 100 probabilities points have already been distributed,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	918	the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	919	Therefore, it must read `log2sup(156) == 8` bits.
				920
				921	- Value decoded : small values use 1 less bit :
				922	__example__ :
				923	Presuming values from 0 to 156 (included) are possible,
				924	255-156 = 99 values are remaining in an 8-bits field.
				925	They are used this way :
				926	first 99 values (hence from 0 to 98) use only 7 bits,
				927	values from 99 to 156 use 8 bits.
				928	This is achieved through this scheme :
				929
				930	\| Value read \| Value decoded \| nb Bits used \|
				931	\| ---------- \| ------------- \| ------------ \|
				932	\| 0 - 98 \| 0 - 98 \| 7 \|
				933	\| 99 - 127 \| 99 - 127 \| 8 \|
				934	\| 128 - 226 \| 0 - 98 \| 7 \|
				935	\| 227 - 255 \| 128 - 156 \| 8 \|
				936
				937	Symbols probabilities are read one by one, in order.
				938
				939	Probability is obtained from Value decoded by following formulae :
				940	`Proba = value - 1;`
				941
				942	It means value `0` becomes negative probability `-1`.
				943	`-1` is a special probability, which means `less than 1`.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	944	Its effect on distribution table is described in next paragraph.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	945	For the purpose of calculating cumulated distribution, it counts as one.
				946
				947	When a symbol has a probability of `zero`,
				948	it is followed by a 2-bits repeat flag.
				949	This repeat flag tells how many probabilities of zeroes follow the current one.
				950	It provides a number ranging from 0 to 3.
				951	If it is a 3, another 2-bits repeat flag follows, and so on.
				952
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	953	When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	954	decoding is complete.
				955	Then the decoder can tell how many bytes were used in this process,
				956	and how many symbols are present.
				957
				958	The bitstream consumes a round number of bytes.
				959	Any remaining bit within the last byte is just unused.
				960
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	961	If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	962	distribution is considered corrupted.
				963
				964	##### FSE decoding : from normalized distribution to decoding tables
				965
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	966	The distribution of normalized probabilities is enough
				967	to create a unique decoding table.
				968
				969	It follows the following build rule :
				970
				971	The table has a size of `tableSize = 1 << AccuracyLog;`.
				972	Each cell describes the symbol decoded,
				973	and instructions to get the next state.
				974
				975	Symbols are scanned in their natural order for `less than 1` probabilities.
				976	Symbols with this probability are being attributed a single cell,
				977	starting from the end of the table.
				978	These symbols define a full state reset, reading `AccuracyLog` bits.
				979
				980	All remaining symbols are sorted in their natural order.
				981	Starting from symbol `0` and table position `0`,
				982	each symbol gets attributed as many cells as its probability.
				983	Cell allocation is spreaded, not linear :
				984	each successor position follow this rule :
				985
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	986	```
				987	position += (tableSize>>1) + (tableSize>>3) + 3;
				988	position &= tableSize-1;
				989	```
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	990
				991	A position is skipped if already occupied,
				992	typically by a "less than 1" probability symbol.
				993
				994	The result is a list of state values.
				995	Each state will decode the current symbol.
				996
				997	To get the Number of bits and baseline required for next state,
				998	it's first necessary to sort all states in their natural order.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	999	The lower states will need 1 more bit than higher ones.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1000
				1001	__Example__ :
				1002	Presuming a symbol has a probability of 5.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1003	It receives 5 state values. States are sorted in natural order.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1004
				1005	Next power of 2 is 8.
				1006	Space of probabilities is divided into 8 equal parts.
				1007	Presuming the AccuracyLog is 7, it defines 128 states.
				1008	Divided by 8, each share is 16 large.
				1009
				1010	In order to reach 8, 8-5=3 lowest states will count "double",
				1011	taking shares twice larger,
				1012	requiring one more bit in the process.
				1013
				1014	Numbering starts from higher states using less bits.
				1015
				1016	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1017	\| ----------- \| ----- \| ----- \| ------ \| ---- \| ----- \|
				1018	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1019	\| nb Bits \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1020	\| range nb \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1021	\| baseline \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1022	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
				1023
				1024	Next state is determined from current state
				1025	by reading the required number of bits, and adding the specified baseline.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	1026
				1027
				1028	#### Bitstream
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1029
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1030	All sequences are stored in a single bitstream, read _backward_.
				1031	It is therefore necessary to know the bitstream size,
				1032	which is deducted from compressed block size.
				1033
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1034	The bit of the stream is followed by a set-bit-flag.
				1035	Highest bit of last byte is this flag.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1036	It does not belong to the useful part of the bitstream.
				1037	Therefore, last byte has 0-7 useful bits.
				1038	Note that it also means that last byte cannot be `0`.
				1039
				1040	##### Starting states
				1041
				1042	The bitstream starts with initial state values,
				1043	each using the required number of bits in their respective _accuracy_,
				1044	decoded previously from their normalized distribution.
				1045
				1046	It starts by `Literal Length State`,
				1047	followed by `Offset State`,
				1048	and finally `Match Length State`.
				1049
				1050	Reminder : always keep in mind that all values are read _backward_.
				1051
				1052	##### Decoding a sequence
				1053
				1054	A state gives a code.
				1055	A code provides a baseline and number of bits to add.
				1056	See [Symbol Decoding] section for details on each symbol.
				1057
				1058	Decoding starts by reading the nb of bits required to decode offset.
				1059	It then does the same for match length,
				1060	and then for literal length.
				1061
				1062	Offset / matchLength / litLength define a sequence, which can be applied.
				1063
				1064	The next operation is to update states.
				1065	Using rules pre-calculated in the decoding tables,
				1066	`Literal Length State` is updated,
				1067	followed by `Match Length State`,
				1068	and then `Offset State`.
				1069
				1070	This operation will be repeated `NbSeqs` times.
				1071	At the end, the bitstream shall be entirely consumed,
				1072	otherwise bitstream is considered corrupted.
				1073
				1074	[Symbol Decoding]:#symbols-decoding
				1075
				1076	##### Repeat offsets
				1077
				1078	As seen in [Offset Codes], the first 3 values define a repeated offset.
				1079	They are sorted in recency order, with 1 meaning "most recent one".
				1080
				1081	There is an exception though, when current sequence's literal length is `0`.
				1082	In which case, 1 would just make previous match longer.
				1083	Therefore, in such case, 1 means in fact 2, and 2 is impossible.
				1084	Meaning of 3 is unmodified.
				1085
				1086	Repeat offsets start with the following values : 1, 4 and 8 (in order).
				1087
				1088	Then each block receives its start value from previous compressed block.
				1089	Note that non-compressed blocks are skipped,
				1090	they do not contribute to offset history.
				1091
				1092	[Offset Codes]: #offset-codes
				1093
				1094	###### Offset updates rules
				1095
				1096	When the new offset is a normal one,
				1097	offset history is simply translated by one position,
				1098	with the new offset taking first spot.
				1099
				1100	- When repeat offset 1 (most recent) is used, history is unmodified.
				1101	- When repeat offset 2 is used, it's swapped with offset 1.
				1102	- When repeat offset 3 is used, it takes first spot,
				1103	pushing the other ones by one position.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1104
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1105
				1106
				1107	Version changes
				1108	---------------