Blame - zstd_compression_format.md - external_zstd

blob: b6ace36171fe35f0ef720bc0c847c79821872707 [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	19	0.2.0 (22/07/16)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	61	Overall conventions
				62	-----------
				63	In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
				64
				65
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	66	Definitions
				67	-----------
				68	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				69	Multiple frames can be appended into a single file or stream.
				70	A frame is totally independent, has a defined beginning and end,
				71	and a set of parameters which tells the decoder how to decompress it.
				72
				73	A frame encapsulates one or multiple __blocks__.
				74	Each block can be compressed or not,
				75	and has a guaranteed maximum content size, which depends on frame parameters.
				76	Unlike frames, each block depends on previous blocks for proper decoding.
				77	However, each block can be decompressed without waiting for its successor,
				78	allowing streaming operations.
				79
				80
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	81	Frame Concatenation
				82	-------------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	83
				84	In some circumstances, it may be required to append multiple frames,
				85	for example in order to add new data to an existing compressed file
				86	without re-framing it.
				87
				88	In such case, each frame brings its own set of descriptor flags.
				89	Each frame is considered independent.
				90	The only relation between frames is their sequential order.
				91
				92	The ability to decode multiple concatenated frames
				93	within a single stream or file is left outside of this specification.
				94	As an example, the reference `zstd` command line utility is able
				95	to decode all concatenated frames in their sequential order,
				96	delivering the final decompressed result as if it was a single content.
				97
				98
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	99	General Structure of Zstandard Frame format
				100	-------------------------------------------
				101	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	102
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	103	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				104	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
				105	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	107	__`Magic_Number`__
				108
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	109	4 Bytes, Little-endian format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	110	Value : 0xFD2FB527
				111
				112	__`Frame_Header`__
				113
				114	2 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
				115
				116	__`Data_Block`__
				117
				118	Detailed in [next chapter](#the-structure-of-data_block).
				119	That’s where compressed data is stored.
				120
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	121	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	122
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	123	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	124	The content checksum is the result
				125	of [xxh64() hash function](https://www.xxHash.com)
				126	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	127	The low 4 bytes of the checksum are stored in little endian format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	128
				129
				130	The structure of `Frame_Header`
				131	-------------------------------
				132	The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	133	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	134	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	135
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				137	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				138	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	139
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	140	### `Frame_Header_Descriptor`
				141
				142	The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	143	It tells which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	144	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	145
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	146	\| Bit number \| Field name \|
				147	\| ---------- \| ---------- \|
				148	\| 7-6 \| `Frame_Content_Size_flag` \|
				149	\| 5 \| `Single_Segment_flag` \|
				150	\| 4 \| `Unused_bit` \|
				151	\| 3 \| `Reserved_bit` \|
				152	\| 2 \| `Content_Checksum_flag` \|
				153	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	154
				155	In this table, bit 7 is highest bit, while bit 0 is lowest.
				156
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	157	__`Frame_Content_Size_flag`__
				158
				159	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
				160	specifying if decompressed data size is provided within the header.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	161	The `Flag_Value` can be converted into `Field_Size`,
				162	which is the number of bytes used by `Frame_Content_Size`
				163	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	164
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	165	\|`Flag_Value`\| 0 \| 1 \| 2 \| 3 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	166	\| ---------- \| --- \| --- \| --- \| --- \|
				167	\|`Field_Size`\| 0-1 \| 2 \| 4 \| 8 \|
				168
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	169	When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
				170	if `Single_Segment_flag` is set, `Field_Size` is 1.
				171	Otherwise, `Field_Size` is 0 (content size not provided).
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	172
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	173	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	174
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	175	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	176	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	177
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	178	In this case, `Frame_Content_Size` is necessarily present,
				179	but `Window_Descriptor` byte is skipped.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	180	As a consequence, the decoder must allocate a memory segment
				181	of size equal or bigger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	182
				183	In order to preserve the decoder from unreasonable memory requirement,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	184	a decoder can reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	185	which requests a memory size beyond decoder's authorized range.
				186
				187	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	188	memory sizes of at least 8 MB.
				189	This is just a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	190	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	191	depending on local limitations.
				192
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	193	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	194
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	195	The value of this bit should be set to zero.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	196	A decoder compliant with this specification version shall not interpret it.
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	197	It might be used in a future version,
				198	to signal a property which is not mandatory to properly decode the frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	199
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	200	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	201
				202	This bit is reserved for some future feature.
				203	Its value _must be zero_.
				204	A decoder compliant with this specification version must ensure it is not set.
				205	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	206	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	207
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	208	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	209
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	210	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				211	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	212
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	213	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	214
				215	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	216	telling if a dictionary ID is provided within the header.
				217	It also specifies the size of this field.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	218
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	219	\| Value \| 0 \| 1 \| 2 \| 3 \|
				220	\| -------- \| --- \| --- \| --- \| --- \|
				221	\|Field size\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	222
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	223	### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	224
				225	Provides guarantees on maximum back-reference distance
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	226	that will be used within compressed data.
				227	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	228
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	229	The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	230	In this case, the maximum back-reference distance is the content size itself,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	231	which can be any value from 1 to 2^64-1 bytes (16 EB).
				232
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	233	\| Bit numbers \| 7-3 \| 0-2 \|
				234	\| ----------- \| -------- \| -------- \|
				235	\| Field name \| Exponent \| Mantissa \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
				237	Maximum distance is given by the following formulae :
				238	```
				239	windowLog = 10 + Exponent;
				240	windowBase = 1 << windowLog;
				241	windowAdd = (windowBase / 8) * Mantissa;
				242	windowSize = windowBase + windowAdd;
				243	```
				244	The minimum window size is 1 KB.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	245	The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	246
				247	To properly decode compressed data,
				248	a decoder will need to allocate a buffer of at least `windowSize` bytes.
				249
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	250	In order to preserve decoder from unreasonable memory requirements,
				251	a decoder can refuse a compressed frame
				252	which requests a memory size beyond decoder's authorized range.
				253
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	254	For improved interoperability,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	255	decoders are recommended to be compatible with window sizes of 8 MB,
				256	and encoders are recommended to not request more than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	257	It's merely a recommendation though,
				258	decoders are free to support larger or lower limits,
				259	depending on local limitations.
				260
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	261	### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	262
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	263	This is a variable size field, which contains
				264	the ID of the dictionary required to properly decode the frame.
				265	Note that this field is optional. When it's not present,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	266	it's up to the caller to make sure it uses the correct dictionary.
				267
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	268	Field size depends on `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	269	1 byte can represent an ID 0-255.
				270	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	271	4 bytes can represent an ID 0-4294967295.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	272
				273	It's allowed to represent a small ID (for example `13`)
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	274	with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	275
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	276	_Reserved ranges :_
				277	If the frame is going to be distributed in a private environment,
				278	any dictionary ID can be used.
				279	However, for public distribution of compressed frames using a dictionary,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	280	the following ranges are reserved for future use and should not be used :
				281	- low range : 1 - 32767
				282	- high range : >= (2^31)
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	283
				284
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	285	### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	286
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	287	This is the original (uncompressed) size. This information is optional.
				288	The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
				289	The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
				290	Format is Little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	291
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	292	\| `Field_Size` \| Range \|
				293	\| ------------ \| ---------- \|
				294	\| 1 \| 0 - 255 \|
				295	\| 2 \| 256 - 65791\|
				296	\| 4 \| 0 - 2^32-1 \|
				297	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	298
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	299	When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				300	When `Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	301	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	302
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	303
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	304	The structure of `Data_Block`
				305	-----------------------------
				306	The structure of `Data_Block` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	307
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	308	\| `Last_Block` \| `Block_Type` \| `Block_Size` \| `Block_Content` \|
				309	\|:------------:\|:------------:\|:------------:\|:---------------:\|
				310	\| 1 bit \| 2 bits \| 21 bits \| n bytes \|
				311
				312	The block header uses 3-bytes.
				313
				314	__`Last_Block`__
				315
				316	The lowest bit signals if this block is the last one.
				317	Frame ends right after this block.
				318	It may be followed by an optional `Content_Checksum` .
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	319
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	320	__`Block_Type` and `Block_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	321
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	322	The next 2 bits represent the `Block_Type`,
				323	while the remaining 21 bits represent the `Block_Size`.
				324	Format is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	325
				326	There are 4 block types :
				327
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	328	\| Value \| 0 \| 1 \| 2 \| 3 \|
				329	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	330	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	331
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	332	- `Raw_Block` - this is an uncompressed block.
				333	`Block_Size` is the number of bytes to read and copy.
				334	- `RLE_Block` - this is a single byte, repeated N times.
				335	In which case, `Block_Size` is the size to regenerate,
				336	while the "compressed" block is just 1 byte (the byte to repeat).
				337	- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	338	detailed in another section of this specification.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	339	`Block_Size` is the compressed size.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	340	Decompressed size is unknown,
				341	but its maximum possible value is guaranteed (see below)
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	342	- `Reserved` - this is not a block.
				343	This value cannot be used with current version of this specification.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	344
				345	Block sizes must respect a few rules :
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	346	- In compressed mode, compressed size if always strictly `< decompressed size`.
				347	- Block decompressed size is always <= maximum back-reference distance .
				348	- Block decompressed size is always <= 128 KB
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	349
				350
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	351	__`Block_Content`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	352
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	353	The `Block_Content` is where the actual data to decode stands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	354	It might be compressed or not, depending on previous field indications.
				355	A data block is not necessarily "full" :
				356	since an arbitrary “flush” may happen anytime,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	357	block decompressed content can be any size,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	358	up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	359	- Maximum back-reference distance
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	360	- 128 KB
				361
				362
				363	Skippable Frames
				364	----------------
				365
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	366	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				367	\|:--------------:\|:------------:\|:-----------:\|
				368	\| 4 bytes \| 4 bytes \| n bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	369
				370	Skippable frames allow the insertion of user-defined data
				371	into a flow of concatenated frames.
				372	Its design is pretty straightforward,
				373	with the sole objective to allow the decoder to quickly skip
				374	over user-defined data and continue decoding.
				375
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	376	Skippable frames defined in this specification are compatible with [LZ4] ones.
				377
				378	[LZ4]:http://www.lz4.org
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	379
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	380	__`Magic_Number`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	381
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	382	4 Bytes, Little-endian format.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	383	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				384	All 16 values are valid to identify a skippable frame.
				385
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	386	__`Frame_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	387
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	388	This is the size, in bytes, of the following `User_Data`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	389	(without including the magic number nor the size field itself).
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	390	This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	391	This means `User_Data` can’t be bigger than (2^32-1) bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	392
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	393	__`User_Data`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	394
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	395	The `User_Data` can be anything. Data will just be skipped by the decoder.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	396
				397
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	398	The format of `Compressed_Block`
				399	--------------------------------
				400	The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
				401	The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	402	in order to properly allocate destination buffer.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	403	See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	404
				405	A compressed block consists of 2 sections :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	406	- [Literals section](#literals-section)
				407	- [Sequences section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	408
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	409	### Prerequisites
				410	To decode a compressed block, the following elements are necessary :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	411	- Previous decoded blocks, up to a distance of `windowSize`,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	412	or all previous blocks when `Single_Segment_flag` is set.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	413	- List of "recent offsets" from previous compressed block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	414	- Decoding tables of previous compressed block for each symbol type
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	415	(literals, litLength, matchLength, offset).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	416
				417
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	418	### Literals section
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	419
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	420	During sequence phase, literals will be entangled with match copy operations.
				421	All literals are regrouped in the first part of the block.
				422	They can be decoded first, and then copied during sequence operations,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	423	or they can be decoded on the flow, as needed by sequence commands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	424
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	425	\| Literals section header \| [Huffman Tree Description] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				426	\| ----------------------- \| -------------------------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	427
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	428	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	429	When compressed, an optional tree description can be present,
				430	followed by 1 or 4 streams.
				431
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	432
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	433	#### Literals section header
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	434
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	435	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	437	using little-endian convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	438
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	439	\| Literals Block Type \| sizes format \| regenerated size \| [compressed size] \|
				440	\| ------------------- \| ------------ \| ---------------- \| ----------------- \|
				441	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	442
				443	In this representation, bits on the left are smallest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	444
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	445	__Literals Block Type__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	447	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	449	\| Value \| 0 \| 1 \| 2 \| 3 \|
				450	\| ------------------- \| --- \| --- \| ---------- \| ----------- \|
				451	\| Literals Block Type \| Raw \| RLE \| Compressed \| RepeatStats \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	452
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	453	- Raw literals block - Literals are stored uncompressed.
				454	- RLE literals block - Literals consist of a single byte value repeated N times.
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	455	- Compressed literals block - This is a standard Huffman-compressed block,
				456	starting with a Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	457	See details below.
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	458	- Repeat Stats literals block - This is a Huffman-compressed block,
				459	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	460	Huffman tree description will be skipped.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	461
				462	__Sizes format__ :
				463
				464	Sizes format are divided into 2 families :
				465
				466	- For compressed block, it requires to decode both the compressed size
				467	and the decompressed size. It will also decode the number of streams.
				468	- For Raw or RLE blocks, it's enough to decode the size to regenerate.
				469
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	470	For values spanning several bytes, convention is Little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	471
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	472	__Sizes format for Raw and RLE literals block__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	473
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	474	- Value : x0 : Regenerated size uses 5 bits (0-31).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	475	Total literal header size is 1 byte.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	476	`size = h[0]>>3;`
				477	- Value : 01 : Regenerated size uses 12 bits (0-4095).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	478	Total literal header size is 2 bytes.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	479	`size = (h[0]>>4) + (h[1]<<4);`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	480	- Value : 11 : Regenerated size uses 20 bits (0-1048575).
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	481	Total literal header size is 3 bytes.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	482	`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	483
				484	Note : it's allowed to represent a short value (ex : `13`)
				485	using a long format, accepting the reduced compacity.
				486
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	487	__Sizes format for Compressed literals block and Repeat Stats literals block__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	488
Yann Collet	c2e1a68	2016-07-22 17:30:52 +0200	[diff] [blame]	489	- Value : 00 : _Single stream_.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	490	Compressed and regenerated sizes use 10 bits (0-1023).
				491	Total literal header size is 3 bytes.
Yann Collet	c2e1a68	2016-07-22 17:30:52 +0200	[diff] [blame]	492	- Value : 01 : 4 streams.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	493	Compressed and regenerated sizes use 10 bits (0-1023).
				494	Total literal header size is 3 bytes.
				495	- Value : 10 : 4 streams.
				496	Compressed and regenerated sizes use 14 bits (0-16383).
				497	Total literal header size is 4 bytes.
Yann Collet	d9cc442	2016-07-22 19:15:27 +0200	[diff] [blame]	498	- Value : 11 : 4 streams.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	499	Compressed and regenerated sizes use 18 bits (0-262143).
				500	Total literal header size is 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	501
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	502	Compressed and regenerated size fields follow little-endian convention.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	503
				504	#### Huffman Tree description
				505
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	506	This section is only present when literals block type is `Compressed` (`0`).
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	507
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	508	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	509	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	510	in a manner such that different symbols may be represented
				511	by bit sequences of different lengths,
				512	but a parser can always parse an encoded string
				513	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	514
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	515	Given an alphabet with known symbol frequencies,
				516	the Huffman algorithm allows the construction of an optimal prefix code
				517	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	518
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	519	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	520	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	521	and require more memory or more complex decoding operations.
				522	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	523
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	524
				525	##### Representation
				526
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	527	All literal values from zero (included) to last present one (excluded)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	528	are represented by `weight` values, from 0 to `maxBits`.
				529	Transformation from `weight` to `nbBits` follows this formulae :
				530	`nbBits = weight ? maxBits + 1 - weight : 0;` .
				531	The last symbol's weight is deduced from previously decoded ones,
				532	by completing to the nearest power of 2.
				533	This power of 2 gives `maxBits`, the depth of the current tree.
				534
				535	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	536	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	537
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	538	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				539	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				540	\| nbBits \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	541
				542	The tree depth is 4, since its smallest element uses 4 bits.
				543	Value `5` will not be listed, nor will values above `5`.
				544	Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
				545	Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
				546	It gives the following serie of weights :
				547
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	548	\| weights \| 4 \| 3 \| 2 \| 0 \| 1 \|
				549	\| ------- \| --- \| --- \| --- \| --- \| --- \|
				550	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	551
				552	The decoder will do the inverse operation :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	553	having collected weights of literals from `0` to `4`,
				554	it knows the last literal, `5`, is present with a non-zero weight.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	555	The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	556	Sum of 2^(weight-1) (excluding 0) is :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	557	`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	558	Nearest power of 2 is 16.
				559	Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	560
				561	##### Huffman Tree header
				562
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	563	This is a single byte value (0-255),
				564	which tells how to decode the list of weights.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	565
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	566	- if headerByte >= 128 : this is a direct representation,
				567	where each weight is written directly as a 4 bits field (0-15).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	568	The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	569	meaning it uses a last full byte even if nbSymbols is odd.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	570	`nbSymbols = headerByte - 127;`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	571	Note that maximum nbSymbols is 255-127 = 128.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	572	A larger serie must necessarily use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	573
				574	- if headerByte < 128 :
				575	the serie of weights is compressed by FSE.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	576	The length of the FSE-compressed serie is `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	577
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	578	##### FSE (Finite State Entropy) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	579
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	580	The serie of weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	581	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	582	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	583
				584	To decode an FSE bitstream, it is necessary to know its compressed size.
				585	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	586	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	587	which is `255`, since literal values span from `0` to `255`,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	588	and last symbol value is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	589
				590	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	591	It will create a Decoding Table.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	592	Table must be pre-allocated, which requires to support a maximum accuracy.
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	593	For a list of Huffman weights, maximum accuracy is 7 bits.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	594
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	595	FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
				596	and so is [FSE bitstream](#bitstream).
				597	The main difference is that Huffman header compression uses 2 states,
				598	which share the same FSE distribution table.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	599	Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	600	The number of symbols to decode is discovered
				601	by tracking bitStream overflow condition.
				602	When both states have overflowed the bitstream, end is reached.
				603
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	604
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	605	##### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	606
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	607	All present symbols shall now have a `weight` value.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	608	It is possible to transform weights into nbBits, using this formula :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	609	`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
				610
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	611	Symbols are sorted by weight. Within same weight, symbols keep natural order.
				612	Symbols with a weight of zero are removed.
				613	Then, starting from lowest weight, prefix codes are distributed in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	614
				615	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	616	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	617
				618	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				619	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				620	\| weight \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
				621
				622	Sorted by weight and then natural order,
				623	it gives the following distribution :
				624
				625	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				626	\| ------------ \| --- \| --- \| --- \| --- \| --- \| ---- \|
				627	\| weight \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	628	\| nb bits \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	629	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	630
				631
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	632	#### Literals bitstreams
				633
				634	##### Bitstreams sizes
				635
				636	As seen in a previous paragraph,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	637	there are 2 flavors of Huffman-compressed literals :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	638	single stream, and 4-streams.
				639
				640	4-streams is useful for CPU with multiple execution units and OoO operations.
				641	Since each stream can be decoded independently,
				642	it's possible to decode them up to 4x faster than a single stream,
				643	presuming the CPU has enough parallelism available.
				644
				645	For single stream, header provides both the compressed and regenerated size.
				646	For 4-streams though,
				647	header only provides compressed and regenerated size of all 4 streams combined.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	648	In order to properly decode the 4 streams,
				649	it's necessary to know the compressed and regenerated size of each stream.
				650
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	651	Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
				652	except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	653
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	654	Compressed size is provided explicitly : in the 4-streams variant,
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	655	bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	656	Each value represents the compressed size of one stream, in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	657	The last stream size is deducted from total compressed size
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	658	and from previously decoded stream sizes :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	659	`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
				660
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	661	##### Bitstreams read and decode
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	662
				663	Each bitstream must be read _backward_,
				664	that is starting from the end down to the beginning.
				665	Therefore it's necessary to know the size of each bitstream.
				666
				667	It's also necessary to know exactly which _bit_ is the latest.
				668	This is detected by a final bit flag :
				669	the highest bit of latest byte is a final-bit-flag.
				670	Consequently, a last byte of `0` is not possible.
				671	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	672	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	673
				674	Starting from the end,
				675	it's possible to read the bitstream in a little-endian fashion,
				676	keeping track of already used bits.
				677
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	678	Reading the last `maxBits` bits,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	679	it's then possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	680	determining the symbol to decode and number of bits to discard.
				681
				682	The process continues up to reading the required number of symbols per stream.
				683	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	684	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	685	the decoding process is considered faulty.
				686
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	687
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	688	### Sequences section
				689
				690	A compressed block is a succession of _sequences_ .
				691	A sequence is a literal copy command, followed by a match copy command.
				692	A literal copy command specifies a length.
				693	It is the number of bytes to be copied (or extracted) from the literal section.
				694	A match copy command specifies an offset and a length.
				695	The offset gives the position to copy from,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	696	which can be within a previous block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	697
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	698	There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	699	which are encoded together, interleaved in a single _bitstream_.
				700
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	701	Each symbol is a _code_ in its own context,
				702	which specifies a baseline and a number of bits to add.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	703	_Codes_ are FSE compressed,
				704	and interleaved with raw additional bits in the same bitstream.
				705
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	706	The Sequences section starts by a header,
				707	followed by optional Probability tables for each symbol type,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	708	followed by the bitstream.
				709
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	710	\| Header \| [LitLengthTable] \| [OffsetTable] \| [MatchLengthTable] \| bitStream \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	711	\| ------ \| ---------------- \| ------------- \| ------------------ \| --------- \|
				712
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	713	To decode the Sequence section, it's required to know its size.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	714	This size is deducted from `blockSize - literalSectionSize`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	715
				716
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	717	#### Sequences section header
				718
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	719	Consists in 2 items :
				720	- Nb of Sequences
				721	- Flags providing Symbol compression types
				722
				723	__Nb of Sequences__
				724
				725	This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
				726	Let's call its first byte `byte0`.
				727	- `if (byte0 == 0)` : there are no sequences.
				728	The sequence section stops there.
				729	Regenerated content is defined entirely by literals section.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	730	- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
				731	- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
				732	- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	733
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	734	__Symbol encoding modes__
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	735
				736	This is a single byte, defining the compression mode of each symbol type.
				737
				738	\| BitNb \| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
				739	\| ------- \| ------ \| ------ \| ------ \| -------- \|
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	740	\|FieldName\| LLType \| OFType \| MLType \| Reserved \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	741
				742	The last field, `Reserved`, must be all-zeroes.
				743
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	744	`LLType`, `OFType` and `MLType` define the compression mode of
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	745	Literal Lengths, Offsets and Match Lengths respectively.
				746
				747	They follow the same enumeration :
				748
Yann Collet	f8e7b53	2016-07-23 16:31:49 +0200	[diff] [blame]	749	\| Value \| 0 \| 1 \| 2 \| 3 \|
				750	\| ---------------- \| ------ \| --- \| ---------- \| ------ \|
				751	\| Compression Mode \| predef \| RLE \| Compressed \| Repeat \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	752
				753	- "predef" : uses a pre-defined distribution table.
				754	- "RLE" : it's a single code, repeated `nbSeqs` times.
				755	- "Repeat" : re-use distribution table from previous compressed block.
Yann Collet	f8e7b53	2016-07-23 16:31:49 +0200	[diff] [blame]	756	- "Compressed" : standard FSE compression.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	757	A distribution table will be present.
				758	It will be described in [next part](#distribution-tables).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	759
				760	#### Symbols decoding
				761
				762	##### Literal Lengths codes
				763
				764	Literal lengths codes are values ranging from `0` to `35` included.
				765	They define lengths from 0 to 131071 bytes.
				766
				767	\| Code \| 0-15 \|
				768	\| ------ \| ---- \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	769	\| length \| Code \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	770	\| nbBits \| 0 \|
				771
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	772
				773	\| Code \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				774	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				775	\| Baseline \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				776	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				777
				778	\| Code \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				779	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				780	\| Baseline \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				781	\| nb Bits \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				782
				783	\| Code \| 32 \| 33 \| 34 \| 35 \|
				784	\| -------- \| ---- \| ---- \| ---- \| ---- \|
				785	\| Baseline \| 8192 \|16384 \|32768 \|65536 \|
				786	\| nb Bits \| 13 \| 14 \| 15 \| 16 \|
				787
				788	__Default distribution__
				789
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	790	When "compression mode" is "predef"",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	791	a pre-defined distribution is used for FSE compression.
				792
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	793	Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	794	```
				795	short literalLengths_defaultDistribution[36] =
				796	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				797	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				798	-1,-1,-1,-1 };
				799	```
				800
				801	##### Match Lengths codes
				802
				803	Match lengths codes are values ranging from `0` to `52` included.
				804	They define lengths from 3 to 131074 bytes.
				805
				806	\| Code \| 0-31 \|
				807	\| ------ \| -------- \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	808	\| value \| Code + 3 \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	809	\| nbBits \| 0 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	810
				811	\| Code \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				812	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				813	\| Baseline \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				814	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				815
				816	\| Code \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				817	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				818	\| Baseline \| 67 \| 83 \| 99 \| 131 \| 258 \| 514 \| 1026 \| 2050 \|
				819	\| nb Bits \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				820
				821	\| Code \| 48 \| 49 \| 50 \| 51 \| 52 \|
				822	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				823	\| Baseline \| 4098 \| 8194 \|16486 \|32770 \|65538 \|
				824	\| nb Bits \| 12 \| 13 \| 14 \| 15 \| 16 \|
				825
				826	__Default distribution__
				827
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	828	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	829	a pre-defined distribution is used for FSE compression.
				830
				831	Here is its definition. It uses an accuracy of 6 bits (64 states).
				832	```
				833	short matchLengths_defaultDistribution[53] =
				834	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				835	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				836	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				837	-1,-1,-1,-1,-1 };
				838	```
				839
				840	##### Offset codes
				841
				842	Offset codes are values ranging from `0` to `N`,
				843	with `N` being limited by maximum backreference distance.
				844
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	845	A decoder is free to limit its maximum `N` supported.
				846	Recommendation is to support at least up to `22`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	847	For information, at the time of this writing.
				848	the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
				849
				850	An offset code is also the nb of additional bits to read,
				851	and can be translated into an `OFValue` using the following formulae :
				852
				853	```
				854	OFValue = (1 << offsetCode) + readNBits(offsetCode);
				855	if (OFValue > 3) offset = OFValue - 3;
				856	```
				857
				858	OFValue from 1 to 3 are special : they define "repeat codes",
				859	which means one of the previous offsets will be repeated.
				860	They are sorted in recency order, with 1 meaning the most recent one.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	861	See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	862
				863	__Default distribution__
				864
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	865	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	866	a pre-defined distribution is used for FSE compression.
				867
				868	Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	869	and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	870
				871	If any sequence in the compressed block requires an offset larger than this,
				872	it's not possible to use the default distribution to represent it.
				873
				874	```
				875	short offsetCodes_defaultDistribution[53] =
				876	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				877	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				878	```
				879
				880	#### Distribution tables
				881
				882	Following the header, up to 3 distribution tables can be described.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	883	When present, they are in this order :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	884	- Literal lengthes
				885	- Offsets
				886	- Match Lengthes
				887
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	888	The content to decode depends on their respective encoding mode :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	889	- Predef : no content. Use pre-defined distribution table.
				890	- RLE : 1 byte. This is the only code to use across the whole compressed block.
				891	- FSE : A distribution table is present.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	892	- Repeat mode : no content. Re-use distribution from previous compressed block.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	893
				894	##### FSE distribution table : condensed format
				895
				896	An FSE distribution table describes the probabilities of all symbols
				897	from `0` to the last present one (included)
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	898	on a normalized scale of `1 << AccuracyLog` .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	899
				900	It's a bitstream which is read forward, in little-endian fashion.
				901	It's not necessary to know its exact size,
				902	since it will be discovered and reported by the decoding process.
				903
				904	The bitstream starts by reporting on which scale it operates.
				905	`AccuracyLog = low4bits + 5;`
Yann Collet	9d6e949	2016-07-22 19:32:07 +0200	[diff] [blame]	906	Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
				907	and for offsets it is `8`. Higher values are considered errors.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	908
				909	Then follow each symbol value, from `0` to last present one.
				910	The nb of bits used by each field is variable.
				911	It depends on :
				912
				913	- Remaining probabilities + 1 :
				914	__example__ :
				915	Presuming an AccuracyLog of 8,
				916	and presuming 100 probabilities points have already been distributed,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	917	the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	918	Therefore, it must read `log2sup(156) == 8` bits.
				919
				920	- Value decoded : small values use 1 less bit :
				921	__example__ :
				922	Presuming values from 0 to 156 (included) are possible,
				923	255-156 = 99 values are remaining in an 8-bits field.
				924	They are used this way :
				925	first 99 values (hence from 0 to 98) use only 7 bits,
				926	values from 99 to 156 use 8 bits.
				927	This is achieved through this scheme :
				928
				929	\| Value read \| Value decoded \| nb Bits used \|
				930	\| ---------- \| ------------- \| ------------ \|
				931	\| 0 - 98 \| 0 - 98 \| 7 \|
				932	\| 99 - 127 \| 99 - 127 \| 8 \|
				933	\| 128 - 226 \| 0 - 98 \| 7 \|
				934	\| 227 - 255 \| 128 - 156 \| 8 \|
				935
				936	Symbols probabilities are read one by one, in order.
				937
				938	Probability is obtained from Value decoded by following formulae :
				939	`Proba = value - 1;`
				940
				941	It means value `0` becomes negative probability `-1`.
				942	`-1` is a special probability, which means `less than 1`.
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	943	Its effect on distribution table is described in [next paragraph].
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	944	For the purpose of calculating cumulated distribution, it counts as one.
				945
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	946	[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
				947
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	948	When a symbol has a probability of `zero`,
				949	it is followed by a 2-bits repeat flag.
				950	This repeat flag tells how many probabilities of zeroes follow the current one.
				951	It provides a number ranging from 0 to 3.
				952	If it is a 3, another 2-bits repeat flag follows, and so on.
				953
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	954	When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	955	decoding is complete.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	956	If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	957	distribution is considered corrupted.
				958
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	959	Then the decoder can tell how many bytes were used in this process,
				960	and how many symbols are present.
				961	The bitstream consumes a round number of bytes.
				962	Any remaining bit within the last byte is just unused.
				963
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	964	##### FSE decoding : from normalized distribution to decoding tables
				965
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	966	The distribution of normalized probabilities is enough
				967	to create a unique decoding table.
				968
				969	It follows the following build rule :
				970
				971	The table has a size of `tableSize = 1 << AccuracyLog;`.
				972	Each cell describes the symbol decoded,
				973	and instructions to get the next state.
				974
				975	Symbols are scanned in their natural order for `less than 1` probabilities.
				976	Symbols with this probability are being attributed a single cell,
				977	starting from the end of the table.
				978	These symbols define a full state reset, reading `AccuracyLog` bits.
				979
				980	All remaining symbols are sorted in their natural order.
				981	Starting from symbol `0` and table position `0`,
				982	each symbol gets attributed as many cells as its probability.
				983	Cell allocation is spreaded, not linear :
				984	each successor position follow this rule :
				985
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	986	```
				987	position += (tableSize>>1) + (tableSize>>3) + 3;
				988	position &= tableSize-1;
				989	```
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	990
				991	A position is skipped if already occupied,
				992	typically by a "less than 1" probability symbol.
				993
				994	The result is a list of state values.
				995	Each state will decode the current symbol.
				996
				997	To get the Number of bits and baseline required for next state,
				998	it's first necessary to sort all states in their natural order.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	999	The lower states will need 1 more bit than higher ones.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1000
				1001	__Example__ :
				1002	Presuming a symbol has a probability of 5.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1003	It receives 5 state values. States are sorted in natural order.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1004
				1005	Next power of 2 is 8.
				1006	Space of probabilities is divided into 8 equal parts.
				1007	Presuming the AccuracyLog is 7, it defines 128 states.
				1008	Divided by 8, each share is 16 large.
				1009
				1010	In order to reach 8, 8-5=3 lowest states will count "double",
				1011	taking shares twice larger,
				1012	requiring one more bit in the process.
				1013
				1014	Numbering starts from higher states using less bits.
				1015
				1016	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1017	\| ----------- \| ----- \| ----- \| ------ \| ---- \| ----- \|
				1018	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1019	\| nb Bits \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1020	\| range nb \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1021	\| baseline \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1022	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
				1023
				1024	Next state is determined from current state
				1025	by reading the required number of bits, and adding the specified baseline.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	1026
				1027
				1028	#### Bitstream
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1029
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1030	All sequences are stored in a single bitstream, read _backward_.
				1031	It is therefore necessary to know the bitstream size,
				1032	which is deducted from compressed block size.
				1033
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1034	The last useful bit of the stream is followed by an end-bit-flag.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1035	Highest bit of last byte is this flag.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1036	It does not belong to the useful part of the bitstream.
				1037	Therefore, last byte has 0-7 useful bits.
				1038	Note that it also means that last byte cannot be `0`.
				1039
				1040	##### Starting states
				1041
				1042	The bitstream starts with initial state values,
				1043	each using the required number of bits in their respective _accuracy_,
				1044	decoded previously from their normalized distribution.
				1045
				1046	It starts by `Literal Length State`,
				1047	followed by `Offset State`,
				1048	and finally `Match Length State`.
				1049
				1050	Reminder : always keep in mind that all values are read _backward_.
				1051
				1052	##### Decoding a sequence
				1053
				1054	A state gives a code.
				1055	A code provides a baseline and number of bits to add.
				1056	See [Symbol Decoding] section for details on each symbol.
				1057
				1058	Decoding starts by reading the nb of bits required to decode offset.
				1059	It then does the same for match length,
				1060	and then for literal length.
				1061
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1062	Offset / matchLength / litLength define a sequence.
				1063	It starts by inserting the number of literals defined by `litLength`,
				1064	then continue by copying `matchLength` bytes from `currentPos - offset`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1065
				1066	The next operation is to update states.
				1067	Using rules pre-calculated in the decoding tables,
				1068	`Literal Length State` is updated,
				1069	followed by `Match Length State`,
				1070	and then `Offset State`.
				1071
				1072	This operation will be repeated `NbSeqs` times.
				1073	At the end, the bitstream shall be entirely consumed,
				1074	otherwise bitstream is considered corrupted.
				1075
				1076	[Symbol Decoding]:#symbols-decoding
				1077
				1078	##### Repeat offsets
				1079
				1080	As seen in [Offset Codes], the first 3 values define a repeated offset.
				1081	They are sorted in recency order, with 1 meaning "most recent one".
				1082
				1083	There is an exception though, when current sequence's literal length is `0`.
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1084	In which case, repcodes are "pushed by one",
				1085	so 1 becomes 2, 2 becomes 3,
				1086	and 3 becomes "offset_1 - 1_byte".
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1087
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1088	On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1089
				1090	Then each block receives its start value from previous compressed block.
				1091	Note that non-compressed blocks are skipped,
				1092	they do not contribute to offset history.
				1093
				1094	[Offset Codes]: #offset-codes
				1095
				1096	###### Offset updates rules
				1097
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1098	New offset take the lead in offset history,
				1099	up to its previous place if it was already present.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1100
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1101	It means that when repeat offset 1 (most recent) is used, history is unmodified.
				1102	When repeat offset 2 is used, it's swapped with offset 1.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1103
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1104
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1105	Dictionary format
				1106	-----------------
				1107
				1108	`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
				1109	But dictionaries created by `zstd --train` follow a format, described here.
				1110
				1111	__Pre-requisites__ : a dictionary has a known length,
				1112	defined either by a buffer limit, or a file size.
				1113
				1114	\| Header \| DictID \| Stats \| Content \|
				1115	\| ------ \| ------ \| ----- \| ------- \|
				1116
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	1117	__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1118
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	1119	__Dict_ID__ : 4 bytes, stored in Little-Endian format.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1120	DictID can be any value, except 0 (which means no DictID).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1121	It's used by decoders to check if they use the correct dictionary.
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	1122	_Reserved ranges :_
				1123	If the frame is going to be distributed in a private environment,
				1124	any dictionary ID can be used.
				1125	However, for public distribution of compressed frames,
				1126	some ranges are reserved for future use :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1127
				1128	- low range : 1 - 32767 : reserved
				1129	- high range : >= (2^31) : reserved
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1130
				1131	__Stats__ : Entropy tables, following the same format as a [compressed blocks].
				1132	They are stored in following order :
				1133	Huffman tables for literals, FSE table for offset,
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1134	FSE table for matchLenth, and FSE table for litLength.
				1135	It's finally followed by 3 offset values, populating recent offsets,
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	1136	stored in order, 4-bytes little-endian each, for a total of 12 bytes.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1137
				1138	__Content__ : Where the actual dictionary content is.
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1139	Content size depends on Dictionary size.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1140
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1141	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1142
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1143
				1144	Version changes
				1145	---------------
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	1146	- 0.2.0 : numerous format adjustments for zstd v0.8
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame^]	1147	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1148	- 0.1.1 : reserved dictID ranges
				1149	- 0.1.0 : initial release