Blame - zstd_compression_format.md - external_zstd

blob: 13276d09a2ba008643719b25b0d8b2caa27a5368 [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	19	0.2.0 (22/07/16)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	61	Overall conventions
				62	-----------
				63	In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
				64
				65
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	66	Definitions
				67	-----------
				68	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				69	Multiple frames can be appended into a single file or stream.
				70	A frame is totally independent, has a defined beginning and end,
				71	and a set of parameters which tells the decoder how to decompress it.
				72
				73	A frame encapsulates one or multiple __blocks__.
				74	Each block can be compressed or not,
				75	and has a guaranteed maximum content size, which depends on frame parameters.
				76	Unlike frames, each block depends on previous blocks for proper decoding.
				77	However, each block can be decompressed without waiting for its successor,
				78	allowing streaming operations.
				79
				80
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	81	Frame Concatenation
				82	-------------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	83
				84	In some circumstances, it may be required to append multiple frames,
				85	for example in order to add new data to an existing compressed file
				86	without re-framing it.
				87
				88	In such case, each frame brings its own set of descriptor flags.
				89	Each frame is considered independent.
				90	The only relation between frames is their sequential order.
				91
				92	The ability to decode multiple concatenated frames
				93	within a single stream or file is left outside of this specification.
				94	As an example, the reference `zstd` command line utility is able
				95	to decode all concatenated frames in their sequential order,
				96	delivering the final decompressed result as if it was a single content.
				97
				98
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	99	General Structure of Zstandard Frame format
				100	-------------------------------------------
				101	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	102
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	103	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \|`End_Marker`\|
				104	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:----------:\|
				105	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 3 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	107	__`Magic_Number`__
				108
				109	4 Bytes, Little endian format.
				110	Value : 0xFD2FB527
				111
				112	__`Frame_Header`__
				113
				114	2 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
				115
				116	__`Data_Block`__
				117
				118	Detailed in [next chapter](#the-structure-of-data_block).
				119	That’s where compressed data is stored.
				120
				121	__`End_Marker`__
				122
				123	The flow of blocks ends when the last block header brings an _end signal_.
				124	This last block header may optionally host a `Content_Checksum`.
				125
				126	##### __`Content_Checksum`__
				127
				128	`Content_Checksum` allow to verify that frame content has been regenerated correctly.
				129	The content checksum is the result
				130	of [xxh64() hash function](https://www.xxHash.com)
				131	digesting the original (decoded) data as input, and a seed of zero.
				132	Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
				133	stored within `End_Marker`.
				134	```
				135	mask22bits = (1<<22)-1;
				136	contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
				137	```
				138	`Content_Checksum` is only present when its associated flag
				139	is set in the frame descriptor.
				140	Its usage is optional.
				141
				142
				143
				144	The structure of `Frame_Header`
				145	-------------------------------
				146	The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	147	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	148	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	149
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	150	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				151	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				152	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	153
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	154	### `Frame_Header_Descriptor`
				155
				156	The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	157	It tells which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	158	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	159
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	160	\| Bit number \| Field name \|
				161	\| ---------- \| ---------- \|
				162	\| 7-6 \| `Frame_Content_Size_flag` \|
				163	\| 5 \| `Single_Segment_flag` \|
				164	\| 4 \| `Unused_bit` \|
				165	\| 3 \| `Reserved_bit` \|
				166	\| 2 \| `Content_Checksum_flag` \|
				167	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	168
				169	In this table, bit 7 is highest bit, while bit 0 is lowest.
				170
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	171	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	172
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	173	If `Single_Segment_flag` is not set then `Window_Descriptor` is mandatory and `Frame_Content_Size_flag` will be ignored.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	174
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	175	If `Single_Segment_flag` is set then `Window_Descriptor` should be absent and `Frame_Content_Size_flag` will be used along with a mandatory `Frame_Content_Size` field.
				176	As a consequence, the decoder must allocate a single continuous memory segment of size equal or bigger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	177
				178	In order to preserve the decoder from unreasonable memory requirement,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	179	a decoder can reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	180	which requests a memory size beyond decoder's authorized range.
				181
				182	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	183	memory sizes of at least 8 MB.
				184	This is just a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	185	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	186	depending on local limitations.
				187
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	188	__`Frame_Content_Size_flag`__
				189
				190	This is a 2-bits flag (`= FHD >> 6`) used only if `Single_Segment_flag` is set.
				191	In this case Value can be converted to Field size that is number of bytes used by `Frame_Content_Size` according to the following table:
				192
				193	\| Value \| 0 \| 1 \| 2 \| 3 \|
				194	\|----------\| --- \| --- \| --- \| --- \|
				195	\|Field size\| 1 \| 2 \| 4 \| 8 \|
				196
				197	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	198
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	199	The value of this bit should be set to zero.
				200	A decoder compliant with this specification version should not interpret it.
				201	It might be used in a future version,
				202	to signal a property which is not mandatory to properly decode the frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	203
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	204	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	205
				206	This bit is reserved for some future feature.
				207	Its value _must be zero_.
				208	A decoder compliant with this specification version must ensure it is not set.
				209	This bit may be used in a future revision,
				210	to signal a feature that must be interpreted in order to decode the frame.
				211
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	212	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	213
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	214	If this flag is set, a content checksum will be present within `End_Marker`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	215	The checksum is a 22 bits value extracted from the XXH64() of data,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	216	and stored within `End_Marker`. See [`Content_Checksum`](#content_checksum) .
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	217
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	218	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	219
				220	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	221	telling if a dictionary ID is provided within the header.
				222	It also specifies the size of this field.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	223
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	224	\| Value \| 0 \| 1 \| 2 \| 3 \|
				225	\| -------- \| --- \| --- \| --- \| --- \|
				226	\|Field size\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	227
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	228	### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	229
				230	Provides guarantees on maximum back-reference distance
				231	that will be present within compressed data.
				232	This information is useful for decoders to allocate enough memory.
				233
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	234	The `Window_Descriptor` byte is optional. It should be absent if `Single_Segment_flag` is set.
				235	In this case, the maximum back-reference distance is the content size itself,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	236	which can be any value from 1 to 2^64-1 bytes (16 EB).
				237
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	238	\| Bit numbers \| 7-3 \| 0-2 \|
				239	\| ----------- \| -------- \| -------- \|
				240	\| Field name \| Exponent \| Mantissa \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	241
				242	Maximum distance is given by the following formulae :
				243	```
				244	windowLog = 10 + Exponent;
				245	windowBase = 1 << windowLog;
				246	windowAdd = (windowBase / 8) * Mantissa;
				247	windowSize = windowBase + windowAdd;
				248	```
				249	The minimum window size is 1 KB.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	250	The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	251
				252	To properly decode compressed data,
				253	a decoder will need to allocate a buffer of at least `windowSize` bytes.
				254
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	255	In order to preserve decoder from unreasonable memory requirements,
				256	a decoder can refuse a compressed frame
				257	which requests a memory size beyond decoder's authorized range.
				258
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	259	For improved interoperability,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	260	decoders are recommended to be compatible with window sizes of 8 MB.
				261	Encoders are recommended to not request more than 8 MB.
				262	It's merely a recommendation though,
				263	decoders are free to support larger or lower limits,
				264	depending on local limitations.
				265
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	266	### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	267
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	268	This is a variable size field, which contains
				269	the ID of the dictionary required to properly decode the frame.
				270	Note that this field is optional. When it's not present,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	271	it's up to the caller to make sure it uses the correct dictionary.
				272
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	273	Field size depends on `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	274	1 byte can represent an ID 0-255.
				275	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	276	4 bytes can represent an ID 0-4294967295.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	277
				278	It's allowed to represent a small ID (for example `13`)
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	279	with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	280
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	281	_Reserved ranges :_
				282	If the frame is going to be distributed in a private environment,
				283	any dictionary ID can be used.
				284	However, for public distribution of compressed frames using a dictionary,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	285	the following ranges are reserved for future use and should not be used :
				286	- low range : 1 - 32767
				287	- high range : >= (2^31)
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	288
				289
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	290	### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	291
				292	This is the original (uncompressed) size.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	293	This information is optional, and only present if `Single_Segment_flag` is set.
				294	Content size is provided using 1, 2, 4 or 8 bytes according to `Frame_Content_Size_flag`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	295	Format is Little endian.
				296
				297	\| Field Size \| Range \|
				298	\| ---------- \| ---------- \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	299	\| 1 \| 0 - 255 \|
				300	\| 2 \| 256 - 65791\|
				301	\| 4 \| 0 - 2^32-1 \|
				302	\| 8 \| 0 - 2^64-1 \|
				303
				304	When field size is 1, 4 or 8 bytes, the value is read directly.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	305	When field size is 2, _the offset of 256 is added_.
				306	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	307
				308	In order to preserve decoder from unreasonable memory requirement,
				309	a decoder can refuse a compressed frame
				310	which requests a memory size beyond decoder's authorized range.
				311
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	312
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	313	The structure of `Data_Block`
				314	-----------------------------
				315	The structure of `Data_Block` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	316
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	317	\| `Block_Type` \| `Block_Size` \| `Block_Content` \|
				318	\|:------------:\|:------------:\|:---------------:\|
				319	\| 2 bits \| 22 bits \| n bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	320
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	321	__`Block_Type` and `Block_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	322
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	323	The block header uses 3-bytes, format is __little-endian__.
				324	The 2 highest bits represent the `Block_Type`,
				325	while the remaining 22 bits represent the (compressed) `Block_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	326
				327	There are 4 block types :
				328
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	329	\| Value \| 0 \| 1 \| 2 \| 3 \|
				330	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
				331	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `EndMark` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	332
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	333	- `Raw_Block` - this is an uncompressed block.
				334	`Block_Size` is the number of bytes to read and copy.
				335	- `RLE_Block` - this is a single byte, repeated N times.
				336	In which case, `Block_Size` is the size to regenerate,
				337	while the "compressed" block is just 1 byte (the byte to repeat).
				338	- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	339	detailed in another section of this specification.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	340	`Block_Size` is the compressed size.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	341	Decompressed size is unknown,
				342	but its maximum possible value is guaranteed (see below)
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	343	- `EndMark` - this is not a block. It signals the end of the frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	344	The rest of the field may be optionally filled by a checksum
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	345	(see [`Content_Checksum`](#content_checksum)).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	346
				347	Block sizes must respect a few rules :
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	348	- In compressed mode, compressed size if always strictly `< decompressed size`.
				349	- Block decompressed size is always <= maximum back-reference distance .
				350	- Block decompressed size is always <= 128 KB
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	351
				352
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	353	__`Block_Content`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	354
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	355	The `Block_Content` is where the actual data to decode stands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	356	It might be compressed or not, depending on previous field indications.
				357	A data block is not necessarily "full" :
				358	since an arbitrary “flush” may happen anytime,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	359	block decompressed content can be any size,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	360	up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	361	- Maximum back-reference distance
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	362	- 128 KB
				363
				364
				365	Skippable Frames
				366	----------------
				367
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	368	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				369	\|:--------------:\|:------------:\|:-----------:\|
				370	\| 4 bytes \| 4 bytes \| n bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	371
				372	Skippable frames allow the insertion of user-defined data
				373	into a flow of concatenated frames.
				374	Its design is pretty straightforward,
				375	with the sole objective to allow the decoder to quickly skip
				376	over user-defined data and continue decoding.
				377
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	378	Skippable frames defined in this specification are compatible with [LZ4] ones.
				379
				380	[LZ4]:http://www.lz4.org
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	381
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	382	__`Magic_Number`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	383
				384	4 Bytes, Little endian format.
				385	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				386	All 16 values are valid to identify a skippable frame.
				387
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	388	__`Frame_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	389
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	390	This is the size, in bytes, of the following `User_Data`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	391	(without including the magic number nor the size field itself).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	392	This field is represented using 4 Bytes, Little endian format, unsigned 32-bits.
				393	This means `User_Data` can’t be bigger than (2^32-1) bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	394
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	395	__`User_Data`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	396
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	397	The `User_Data` can be anything. Data will just be skipped by the decoder.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	398
				399
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	400	The format of `Compressed_Block`
				401	--------------------------------
				402	The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
				403	The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	404	in order to properly allocate destination buffer.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	405	See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	406
				407	A compressed block consists of 2 sections :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	408	- [Literals section](#literals-section)
				409	- [Sequences section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	410
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	411	### Prerequisites
				412	To decode a compressed block, the following elements are necessary :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	413	- Previous decoded blocks, up to a distance of `windowSize`,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	414	or all previous blocks when `Single_Segment_flag` is set.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	415	- List of "recent offsets" from previous compressed block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	416	- Decoding tables of previous compressed block for each symbol type
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	417	(literals, litLength, matchLength, offset).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	418
				419
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	420	### Literals section
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	421
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	422	During sequence phase, literals will be entangled with match copy operations.
				423	All literals are regrouped in the first part of the block.
				424	They can be decoded first, and then copied during sequence operations,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	425	or they can be decoded on the flow, as needed by sequence commands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	426
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	427	\| Literals section header \| [Huffman Tree Description] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				428	\| ----------------------- \| -------------------------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	429
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	430	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	431	When compressed, an optional tree description can be present,
				432	followed by 1 or 4 streams.
				433
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	434
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	435	#### Literals section header
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	437	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	438	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	439	using little-endian convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	440
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	441	\| Literals Block Type \| sizes format \| regenerated size \| [compressed size] \|
				442	\| ------------------- \| ------------ \| ---------------- \| ----------------- \|
				443	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	444
				445	In this representation, bits on the left are smallest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	447	__Literals Block Type__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	449	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	450
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	451	\| Value \| 0 \| 1 \| 2 \| 3 \|
				452	\| ------------------- \| --- \| --- \| ---------- \| ----------- \|
				453	\| Literals Block Type \| Raw \| RLE \| Compressed \| RepeatStats \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	454
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	455	- Raw literals block - Literals are stored uncompressed.
				456	- RLE literals block - Literals consist of a single byte value repeated N times.
				457	- Compressed literals block - This is a standard huffman-compressed block,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	458	starting with a huffman tree description.
				459	See details below.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	460	- Repeat Stats literals block - This is a huffman-compressed block,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	461	using huffman tree _from previous huffman-compressed literals block_.
				462	Huffman tree description will be skipped.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	463
				464	__Sizes format__ :
				465
				466	Sizes format are divided into 2 families :
				467
				468	- For compressed block, it requires to decode both the compressed size
				469	and the decompressed size. It will also decode the number of streams.
				470	- For Raw or RLE blocks, it's enough to decode the size to regenerate.
				471
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	472	For values spanning several bytes, convention is Little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	473
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	474	__Sizes format for Raw and RLE literals block__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	475
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	476	- Value : x0 : Regenerated size uses 5 bits (0-31).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	477	Total literal header size is 1 byte.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	478	`size = h[0]>>3;`
				479	- Value : 01 : Regenerated size uses 12 bits (0-4095).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	480	Total literal header size is 2 bytes.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	481	`size = (h[0]>>4) + (h[1]<<4);`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	482	- Value : 11 : Regenerated size uses 20 bits (0-1048575).
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	483	Total literal header size is 3 bytes.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	484	`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	485
				486	Note : it's allowed to represent a short value (ex : `13`)
				487	using a long format, accepting the reduced compacity.
				488
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	489	__Sizes format for Compressed literals block and Repeat Stats literals block__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	490
Yann Collet	c2e1a68	2016-07-22 17:30:52 +0200	[diff] [blame]	491	- Value : 00 : _Single stream_.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	492	Compressed and regenerated sizes use 10 bits (0-1023).
				493	Total literal header size is 3 bytes.
Yann Collet	c2e1a68	2016-07-22 17:30:52 +0200	[diff] [blame]	494	- Value : 01 : 4 streams.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	495	Compressed and regenerated sizes use 10 bits (0-1023).
				496	Total literal header size is 3 bytes.
				497	- Value : 10 : 4 streams.
				498	Compressed and regenerated sizes use 14 bits (0-16383).
				499	Total literal header size is 4 bytes.
Yann Collet	d9cc442	2016-07-22 19:15:27 +0200	[diff] [blame]	500	- Value : 11 : 4 streams.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	501	Compressed and regenerated sizes use 18 bits (0-262143).
				502	Total literal header size is 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	503
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	504	Compressed and regenerated size fields follow little-endian convention.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	505
				506	#### Huffman Tree description
				507
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	508	This section is only present when literals block type is `Compressed` (`0`).
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	509
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	510	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	511	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	512	in a manner such that different symbols may be represented
				513	by bit sequences of different lengths,
				514	but a parser can always parse an encoded string
				515	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	516
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	517	Given an alphabet with known symbol frequencies,
				518	the Huffman algorithm allows the construction of an optimal prefix code
				519	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	520
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	521	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	522	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	523	and require more memory or more complex decoding operations.
				524	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	525
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	526
				527	##### Representation
				528
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	529	All literal values from zero (included) to last present one (excluded)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	530	are represented by `weight` values, from 0 to `maxBits`.
				531	Transformation from `weight` to `nbBits` follows this formulae :
				532	`nbBits = weight ? maxBits + 1 - weight : 0;` .
				533	The last symbol's weight is deduced from previously decoded ones,
				534	by completing to the nearest power of 2.
				535	This power of 2 gives `maxBits`, the depth of the current tree.
				536
				537	__Example__ :
				538	Let's presume the following huffman tree must be described :
				539
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	540	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				541	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				542	\| nbBits \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	543
				544	The tree depth is 4, since its smallest element uses 4 bits.
				545	Value `5` will not be listed, nor will values above `5`.
				546	Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
				547	Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
				548	It gives the following serie of weights :
				549
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	550	\| weights \| 4 \| 3 \| 2 \| 0 \| 1 \|
				551	\| ------- \| --- \| --- \| --- \| --- \| --- \|
				552	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	553
				554	The decoder will do the inverse operation :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	555	having collected weights of literals from `0` to `4`,
				556	it knows the last literal, `5`, is present with a non-zero weight.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	557	The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	558	Sum of 2^(weight-1) (excluding 0) is :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	559	`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	560	Nearest power of 2 is 16.
				561	Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	562
				563	##### Huffman Tree header
				564
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	565	This is a single byte value (0-255),
				566	which tells how to decode the list of weights.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	567
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	568	- if headerByte >= 128 : this is a direct representation,
				569	where each weight is written directly as a 4 bits field (0-15).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	570	The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	571	meaning it uses a last full byte even if nbSymbols is odd.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	572	`nbSymbols = headerByte - 127;`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	573	Note that maximum nbSymbols is 255-127 = 128.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	574	A larger serie must necessarily use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	575
				576	- if headerByte < 128 :
				577	the serie of weights is compressed by FSE.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	578	The length of the FSE-compressed serie is `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	579
				580	##### FSE (Finite State Entropy) compression of huffman weights
				581
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	582	The serie of weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	583	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	584	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	585
				586	To decode an FSE bitstream, it is necessary to know its compressed size.
				587	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	588	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	589	which is `255`, since literal values span from `0` to `255`,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	590	and last symbol value is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	591
				592	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	593	It will create a Decoding Table.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	594	Table must be pre-allocated, which requires to support a maximum accuracy.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	595	For a list of huffman weights, maximum accuracy is 7 bits.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	596
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	597	FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
				598	and so is [FSE bitstream](#bitstream).
				599	The main difference is that Huffman header compression uses 2 states,
				600	which share the same FSE distribution table.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	601	Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	602	The number of symbols to decode is discovered
				603	by tracking bitStream overflow condition.
				604	When both states have overflowed the bitstream, end is reached.
				605
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	606
				607	##### Conversion from weights to huffman prefix codes
				608
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	609	All present symbols shall now have a `weight` value.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	610	It is possible to transform weights into nbBits, using this formula :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	611	`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
				612
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	613	Symbols are sorted by weight. Within same weight, symbols keep natural order.
				614	Symbols with a weight of zero are removed.
				615	Then, starting from lowest weight, prefix codes are distributed in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	616
				617	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	618	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	619
				620	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				621	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				622	\| weight \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
				623
				624	Sorted by weight and then natural order,
				625	it gives the following distribution :
				626
				627	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				628	\| ------------ \| --- \| --- \| --- \| --- \| --- \| ---- \|
				629	\| weight \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	630	\| nb bits \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	631	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	632
				633
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	634	#### Literals bitstreams
				635
				636	##### Bitstreams sizes
				637
				638	As seen in a previous paragraph,
				639	there are 2 flavors of huffman-compressed literals :
				640	single stream, and 4-streams.
				641
				642	4-streams is useful for CPU with multiple execution units and OoO operations.
				643	Since each stream can be decoded independently,
				644	it's possible to decode them up to 4x faster than a single stream,
				645	presuming the CPU has enough parallelism available.
				646
				647	For single stream, header provides both the compressed and regenerated size.
				648	For 4-streams though,
				649	header only provides compressed and regenerated size of all 4 streams combined.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	650	In order to properly decode the 4 streams,
				651	it's necessary to know the compressed and regenerated size of each stream.
				652
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	653	Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
				654	except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	655
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	656	Compressed size is provided explicitly : in the 4-streams variant,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	657	bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
				658	Each value represents the compressed size of one stream, in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	659	The last stream size is deducted from total compressed size
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	660	and from previously decoded stream sizes :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	661	`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
				662
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	663	##### Bitstreams read and decode
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	664
				665	Each bitstream must be read _backward_,
				666	that is starting from the end down to the beginning.
				667	Therefore it's necessary to know the size of each bitstream.
				668
				669	It's also necessary to know exactly which _bit_ is the latest.
				670	This is detected by a final bit flag :
				671	the highest bit of latest byte is a final-bit-flag.
				672	Consequently, a last byte of `0` is not possible.
				673	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	674	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	675
				676	Starting from the end,
				677	it's possible to read the bitstream in a little-endian fashion,
				678	keeping track of already used bits.
				679
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	680	Reading the last `maxBits` bits,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	681	it's then possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	682	determining the symbol to decode and number of bits to discard.
				683
				684	The process continues up to reading the required number of symbols per stream.
				685	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	686	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	687	the decoding process is considered faulty.
				688
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	689
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	690	### Sequences section
				691
				692	A compressed block is a succession of _sequences_ .
				693	A sequence is a literal copy command, followed by a match copy command.
				694	A literal copy command specifies a length.
				695	It is the number of bytes to be copied (or extracted) from the literal section.
				696	A match copy command specifies an offset and a length.
				697	The offset gives the position to copy from,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	698	which can be within a previous block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	699
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	700	There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	701	which are encoded together, interleaved in a single _bitstream_.
				702
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	703	Each symbol is a _code_ in its own context,
				704	which specifies a baseline and a number of bits to add.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	705	_Codes_ are FSE compressed,
				706	and interleaved with raw additional bits in the same bitstream.
				707
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	708	The Sequences section starts by a header,
				709	followed by optional Probability tables for each symbol type,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	710	followed by the bitstream.
				711
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	712	\| Header \| [LitLengthTable] \| [OffsetTable] \| [MatchLengthTable] \| bitStream \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	713	\| ------ \| ---------------- \| ------------- \| ------------------ \| --------- \|
				714
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	715	To decode the Sequence section, it's required to know its size.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	716	This size is deducted from `blockSize - literalSectionSize`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	717
				718
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	719	#### Sequences section header
				720
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	721	Consists in 2 items :
				722	- Nb of Sequences
				723	- Flags providing Symbol compression types
				724
				725	__Nb of Sequences__
				726
				727	This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
				728	Let's call its first byte `byte0`.
				729	- `if (byte0 == 0)` : there are no sequences.
				730	The sequence section stops there.
				731	Regenerated content is defined entirely by literals section.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	732	- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
				733	- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
				734	- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	735
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	736	__Symbol encoding modes__
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	737
				738	This is a single byte, defining the compression mode of each symbol type.
				739
				740	\| BitNb \| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
				741	\| ------- \| ------ \| ------ \| ------ \| -------- \|
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	742	\|FieldName\| LLType \| OFType \| MLType \| Reserved \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	743
				744	The last field, `Reserved`, must be all-zeroes.
				745
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	746	`LLType`, `OFType` and `MLType` define the compression mode of
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	747	Literal Lengths, Offsets and Match Lengths respectively.
				748
				749	They follow the same enumeration :
				750
Yann Collet	f8e7b53	2016-07-23 16:31:49 +0200	[diff] [blame]	751	\| Value \| 0 \| 1 \| 2 \| 3 \|
				752	\| ---------------- \| ------ \| --- \| ---------- \| ------ \|
				753	\| Compression Mode \| predef \| RLE \| Compressed \| Repeat \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	754
				755	- "predef" : uses a pre-defined distribution table.
				756	- "RLE" : it's a single code, repeated `nbSeqs` times.
				757	- "Repeat" : re-use distribution table from previous compressed block.
Yann Collet	f8e7b53	2016-07-23 16:31:49 +0200	[diff] [blame]	758	- "Compressed" : standard FSE compression.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	759	A distribution table will be present.
				760	It will be described in [next part](#distribution-tables).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	761
				762	#### Symbols decoding
				763
				764	##### Literal Lengths codes
				765
				766	Literal lengths codes are values ranging from `0` to `35` included.
				767	They define lengths from 0 to 131071 bytes.
				768
				769	\| Code \| 0-15 \|
				770	\| ------ \| ---- \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	771	\| length \| Code \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	772	\| nbBits \| 0 \|
				773
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	774
				775	\| Code \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				776	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				777	\| Baseline \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				778	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				779
				780	\| Code \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				781	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				782	\| Baseline \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				783	\| nb Bits \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				784
				785	\| Code \| 32 \| 33 \| 34 \| 35 \|
				786	\| -------- \| ---- \| ---- \| ---- \| ---- \|
				787	\| Baseline \| 8192 \|16384 \|32768 \|65536 \|
				788	\| nb Bits \| 13 \| 14 \| 15 \| 16 \|
				789
				790	__Default distribution__
				791
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	792	When "compression mode" is "predef"",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	793	a pre-defined distribution is used for FSE compression.
				794
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	795	Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	796	```
				797	short literalLengths_defaultDistribution[36] =
				798	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				799	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				800	-1,-1,-1,-1 };
				801	```
				802
				803	##### Match Lengths codes
				804
				805	Match lengths codes are values ranging from `0` to `52` included.
				806	They define lengths from 3 to 131074 bytes.
				807
				808	\| Code \| 0-31 \|
				809	\| ------ \| -------- \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	810	\| value \| Code + 3 \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	811	\| nbBits \| 0 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	812
				813	\| Code \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				814	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				815	\| Baseline \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				816	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				817
				818	\| Code \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				819	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				820	\| Baseline \| 67 \| 83 \| 99 \| 131 \| 258 \| 514 \| 1026 \| 2050 \|
				821	\| nb Bits \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				822
				823	\| Code \| 48 \| 49 \| 50 \| 51 \| 52 \|
				824	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				825	\| Baseline \| 4098 \| 8194 \|16486 \|32770 \|65538 \|
				826	\| nb Bits \| 12 \| 13 \| 14 \| 15 \| 16 \|
				827
				828	__Default distribution__
				829
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	830	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	831	a pre-defined distribution is used for FSE compression.
				832
				833	Here is its definition. It uses an accuracy of 6 bits (64 states).
				834	```
				835	short matchLengths_defaultDistribution[53] =
				836	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				837	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				838	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				839	-1,-1,-1,-1,-1 };
				840	```
				841
				842	##### Offset codes
				843
				844	Offset codes are values ranging from `0` to `N`,
				845	with `N` being limited by maximum backreference distance.
				846
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	847	A decoder is free to limit its maximum `N` supported.
				848	Recommendation is to support at least up to `22`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	849	For information, at the time of this writing.
				850	the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
				851
				852	An offset code is also the nb of additional bits to read,
				853	and can be translated into an `OFValue` using the following formulae :
				854
				855	```
				856	OFValue = (1 << offsetCode) + readNBits(offsetCode);
				857	if (OFValue > 3) offset = OFValue - 3;
				858	```
				859
				860	OFValue from 1 to 3 are special : they define "repeat codes",
				861	which means one of the previous offsets will be repeated.
				862	They are sorted in recency order, with 1 meaning the most recent one.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	863	See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	864
				865	__Default distribution__
				866
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	867	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	868	a pre-defined distribution is used for FSE compression.
				869
				870	Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	871	and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	872
				873	If any sequence in the compressed block requires an offset larger than this,
				874	it's not possible to use the default distribution to represent it.
				875
				876	```
				877	short offsetCodes_defaultDistribution[53] =
				878	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				879	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				880	```
				881
				882	#### Distribution tables
				883
				884	Following the header, up to 3 distribution tables can be described.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	885	When present, they are in this order :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	886	- Literal lengthes
				887	- Offsets
				888	- Match Lengthes
				889
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	890	The content to decode depends on their respective encoding mode :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	891	- Predef : no content. Use pre-defined distribution table.
				892	- RLE : 1 byte. This is the only code to use across the whole compressed block.
				893	- FSE : A distribution table is present.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	894	- Repeat mode : no content. Re-use distribution from previous compressed block.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	895
				896	##### FSE distribution table : condensed format
				897
				898	An FSE distribution table describes the probabilities of all symbols
				899	from `0` to the last present one (included)
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	900	on a normalized scale of `1 << AccuracyLog` .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	901
				902	It's a bitstream which is read forward, in little-endian fashion.
				903	It's not necessary to know its exact size,
				904	since it will be discovered and reported by the decoding process.
				905
				906	The bitstream starts by reporting on which scale it operates.
				907	`AccuracyLog = low4bits + 5;`
Yann Collet	9d6e949	2016-07-22 19:32:07 +0200	[diff] [blame]	908	Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
				909	and for offsets it is `8`. Higher values are considered errors.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	910
				911	Then follow each symbol value, from `0` to last present one.
				912	The nb of bits used by each field is variable.
				913	It depends on :
				914
				915	- Remaining probabilities + 1 :
				916	__example__ :
				917	Presuming an AccuracyLog of 8,
				918	and presuming 100 probabilities points have already been distributed,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	919	the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	920	Therefore, it must read `log2sup(156) == 8` bits.
				921
				922	- Value decoded : small values use 1 less bit :
				923	__example__ :
				924	Presuming values from 0 to 156 (included) are possible,
				925	255-156 = 99 values are remaining in an 8-bits field.
				926	They are used this way :
				927	first 99 values (hence from 0 to 98) use only 7 bits,
				928	values from 99 to 156 use 8 bits.
				929	This is achieved through this scheme :
				930
				931	\| Value read \| Value decoded \| nb Bits used \|
				932	\| ---------- \| ------------- \| ------------ \|
				933	\| 0 - 98 \| 0 - 98 \| 7 \|
				934	\| 99 - 127 \| 99 - 127 \| 8 \|
				935	\| 128 - 226 \| 0 - 98 \| 7 \|
				936	\| 227 - 255 \| 128 - 156 \| 8 \|
				937
				938	Symbols probabilities are read one by one, in order.
				939
				940	Probability is obtained from Value decoded by following formulae :
				941	`Proba = value - 1;`
				942
				943	It means value `0` becomes negative probability `-1`.
				944	`-1` is a special probability, which means `less than 1`.
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	945	Its effect on distribution table is described in [next paragraph].
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	946	For the purpose of calculating cumulated distribution, it counts as one.
				947
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	948	[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
				949
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	950	When a symbol has a probability of `zero`,
				951	it is followed by a 2-bits repeat flag.
				952	This repeat flag tells how many probabilities of zeroes follow the current one.
				953	It provides a number ranging from 0 to 3.
				954	If it is a 3, another 2-bits repeat flag follows, and so on.
				955
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	956	When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	957	decoding is complete.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	958	If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	959	distribution is considered corrupted.
				960
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	961	Then the decoder can tell how many bytes were used in this process,
				962	and how many symbols are present.
				963	The bitstream consumes a round number of bytes.
				964	Any remaining bit within the last byte is just unused.
				965
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	966	##### FSE decoding : from normalized distribution to decoding tables
				967
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	968	The distribution of normalized probabilities is enough
				969	to create a unique decoding table.
				970
				971	It follows the following build rule :
				972
				973	The table has a size of `tableSize = 1 << AccuracyLog;`.
				974	Each cell describes the symbol decoded,
				975	and instructions to get the next state.
				976
				977	Symbols are scanned in their natural order for `less than 1` probabilities.
				978	Symbols with this probability are being attributed a single cell,
				979	starting from the end of the table.
				980	These symbols define a full state reset, reading `AccuracyLog` bits.
				981
				982	All remaining symbols are sorted in their natural order.
				983	Starting from symbol `0` and table position `0`,
				984	each symbol gets attributed as many cells as its probability.
				985	Cell allocation is spreaded, not linear :
				986	each successor position follow this rule :
				987
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	988	```
				989	position += (tableSize>>1) + (tableSize>>3) + 3;
				990	position &= tableSize-1;
				991	```
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	992
				993	A position is skipped if already occupied,
				994	typically by a "less than 1" probability symbol.
				995
				996	The result is a list of state values.
				997	Each state will decode the current symbol.
				998
				999	To get the Number of bits and baseline required for next state,
				1000	it's first necessary to sort all states in their natural order.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1001	The lower states will need 1 more bit than higher ones.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1002
				1003	__Example__ :
				1004	Presuming a symbol has a probability of 5.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1005	It receives 5 state values. States are sorted in natural order.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1006
				1007	Next power of 2 is 8.
				1008	Space of probabilities is divided into 8 equal parts.
				1009	Presuming the AccuracyLog is 7, it defines 128 states.
				1010	Divided by 8, each share is 16 large.
				1011
				1012	In order to reach 8, 8-5=3 lowest states will count "double",
				1013	taking shares twice larger,
				1014	requiring one more bit in the process.
				1015
				1016	Numbering starts from higher states using less bits.
				1017
				1018	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1019	\| ----------- \| ----- \| ----- \| ------ \| ---- \| ----- \|
				1020	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1021	\| nb Bits \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1022	\| range nb \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1023	\| baseline \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1024	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
				1025
				1026	Next state is determined from current state
				1027	by reading the required number of bits, and adding the specified baseline.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	1028
				1029
				1030	#### Bitstream
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1031
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1032	All sequences are stored in a single bitstream, read _backward_.
				1033	It is therefore necessary to know the bitstream size,
				1034	which is deducted from compressed block size.
				1035
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1036	The last useful bit of the stream is followed by an end-bit-flag.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1037	Highest bit of last byte is this flag.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1038	It does not belong to the useful part of the bitstream.
				1039	Therefore, last byte has 0-7 useful bits.
				1040	Note that it also means that last byte cannot be `0`.
				1041
				1042	##### Starting states
				1043
				1044	The bitstream starts with initial state values,
				1045	each using the required number of bits in their respective _accuracy_,
				1046	decoded previously from their normalized distribution.
				1047
				1048	It starts by `Literal Length State`,
				1049	followed by `Offset State`,
				1050	and finally `Match Length State`.
				1051
				1052	Reminder : always keep in mind that all values are read _backward_.
				1053
				1054	##### Decoding a sequence
				1055
				1056	A state gives a code.
				1057	A code provides a baseline and number of bits to add.
				1058	See [Symbol Decoding] section for details on each symbol.
				1059
				1060	Decoding starts by reading the nb of bits required to decode offset.
				1061	It then does the same for match length,
				1062	and then for literal length.
				1063
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1064	Offset / matchLength / litLength define a sequence.
				1065	It starts by inserting the number of literals defined by `litLength`,
				1066	then continue by copying `matchLength` bytes from `currentPos - offset`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1067
				1068	The next operation is to update states.
				1069	Using rules pre-calculated in the decoding tables,
				1070	`Literal Length State` is updated,
				1071	followed by `Match Length State`,
				1072	and then `Offset State`.
				1073
				1074	This operation will be repeated `NbSeqs` times.
				1075	At the end, the bitstream shall be entirely consumed,
				1076	otherwise bitstream is considered corrupted.
				1077
				1078	[Symbol Decoding]:#symbols-decoding
				1079
				1080	##### Repeat offsets
				1081
				1082	As seen in [Offset Codes], the first 3 values define a repeated offset.
				1083	They are sorted in recency order, with 1 meaning "most recent one".
				1084
				1085	There is an exception though, when current sequence's literal length is `0`.
				1086	In which case, 1 would just make previous match longer.
				1087	Therefore, in such case, 1 means in fact 2, and 2 is impossible.
				1088	Meaning of 3 is unmodified.
				1089
				1090	Repeat offsets start with the following values : 1, 4 and 8 (in order).
				1091
				1092	Then each block receives its start value from previous compressed block.
				1093	Note that non-compressed blocks are skipped,
				1094	they do not contribute to offset history.
				1095
				1096	[Offset Codes]: #offset-codes
				1097
				1098	###### Offset updates rules
				1099
				1100	When the new offset is a normal one,
				1101	offset history is simply translated by one position,
				1102	with the new offset taking first spot.
				1103
				1104	- When repeat offset 1 (most recent) is used, history is unmodified.
				1105	- When repeat offset 2 is used, it's swapped with offset 1.
				1106	- When repeat offset 3 is used, it takes first spot,
				1107	pushing the other ones by one position.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1108
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1109
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1110	Dictionary format
				1111	-----------------
				1112
				1113	`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
				1114	But dictionaries created by `zstd --train` follow a format, described here.
				1115
				1116	__Pre-requisites__ : a dictionary has a known length,
				1117	defined either by a buffer limit, or a file size.
				1118
				1119	\| Header \| DictID \| Stats \| Content \|
				1120	\| ------ \| ------ \| ----- \| ------- \|
				1121
				1122	__Header__ : 4 bytes ID, value 0xEC30A437, Little Endian format
				1123
				1124	__Dict_ID__ : 4 bytes, stored in Little Endian format.
				1125	DictID can be any value, except 0 (which means no DictID).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1126	It's used by decoders to check if they use the correct dictionary.
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	1127	_Reserved ranges :_
				1128	If the frame is going to be distributed in a private environment,
				1129	any dictionary ID can be used.
				1130	However, for public distribution of compressed frames,
				1131	some ranges are reserved for future use :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1132
				1133	- low range : 1 - 32767 : reserved
				1134	- high range : >= (2^31) : reserved
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1135
				1136	__Stats__ : Entropy tables, following the same format as a [compressed blocks].
				1137	They are stored in following order :
				1138	Huffman tables for literals, FSE table for offset,
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1139	FSE table for matchLenth, and FSE table for litLength.
				1140	It's finally followed by 3 offset values, populating recent offsets,
				1141	stored in order, 4-bytes little endian each, for a total of 12 bytes.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1142
				1143	__Content__ : Where the actual dictionary content is.
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1144	Content size depends on Dictionary size.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1145
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame^]	1146	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1147
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1148
				1149	Version changes
				1150	---------------
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	1151	- 0.2.0 : numerous format adjustments for zstd v0.8
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1152	- 0.1.2 : limit huffman tree depth to 11 bits
				1153	- 0.1.1 : reserved dictID ranges
				1154	- 0.1.0 : initial release