Blame - zstd_compression_format.md - external_zstd

blob: 1c5908fa558c100c7a73698a651dabc791f29093 [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	19	0.2.0 (22/07/16)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	61	Overall conventions
				62	-----------
				63	In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
				64
				65
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	66	Definitions
				67	-----------
				68	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				69	Multiple frames can be appended into a single file or stream.
				70	A frame is totally independent, has a defined beginning and end,
				71	and a set of parameters which tells the decoder how to decompress it.
				72
				73	A frame encapsulates one or multiple __blocks__.
				74	Each block can be compressed or not,
				75	and has a guaranteed maximum content size, which depends on frame parameters.
				76	Unlike frames, each block depends on previous blocks for proper decoding.
				77	However, each block can be decompressed without waiting for its successor,
				78	allowing streaming operations.
				79
				80
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	81	Frame Concatenation
				82	-------------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	83
				84	In some circumstances, it may be required to append multiple frames,
				85	for example in order to add new data to an existing compressed file
				86	without re-framing it.
				87
				88	In such case, each frame brings its own set of descriptor flags.
				89	Each frame is considered independent.
				90	The only relation between frames is their sequential order.
				91
				92	The ability to decode multiple concatenated frames
				93	within a single stream or file is left outside of this specification.
				94	As an example, the reference `zstd` command line utility is able
				95	to decode all concatenated frames in their sequential order,
				96	delivering the final decompressed result as if it was a single content.
				97
				98
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	99	General Structure of Zstandard Frame format
				100	-------------------------------------------
				101	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	102
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	103	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				104	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
				105	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	107	__`Magic_Number`__
				108
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	109	4 Bytes, Little-endian format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	110	Value : 0xFD2FB527
				111
				112	__`Frame_Header`__
				113
				114	2 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
				115
				116	__`Data_Block`__
				117
				118	Detailed in [next chapter](#the-structure-of-data_block).
				119	That’s where compressed data is stored.
				120
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	121	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	122
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	123	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	124	The content checksum is the result
				125	of [xxh64() hash function](https://www.xxHash.com)
				126	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	127	The low 4 bytes of the checksum are stored in little endian format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	128
				129
				130	The structure of `Frame_Header`
				131	-------------------------------
				132	The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	133	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	134	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	135
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				137	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				138	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	139
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	140	### `Frame_Header_Descriptor`
				141
				142	The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	143	It tells which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	144	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	145
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	146	\| Bit number \| Field name \|
				147	\| ---------- \| ---------- \|
				148	\| 7-6 \| `Frame_Content_Size_flag` \|
				149	\| 5 \| `Single_Segment_flag` \|
				150	\| 4 \| `Unused_bit` \|
				151	\| 3 \| `Reserved_bit` \|
				152	\| 2 \| `Content_Checksum_flag` \|
				153	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	154
				155	In this table, bit 7 is highest bit, while bit 0 is lowest.
				156
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	157	__`Frame_Content_Size_flag`__
				158
				159	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
				160	specifying if decompressed data size is provided within the header.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	161	The `Flag_Value` can be converted into `Field_Size`,
				162	which is the number of bytes used by `Frame_Content_Size`
				163	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	164
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	165	\|`Flag_Value`\| 0 \| 1 \| 2 \| 3 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	166	\| ---------- \| --- \| --- \| --- \| --- \|
				167	\|`Field_Size`\| 0-1 \| 2 \| 4 \| 8 \|
				168
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	169	When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
				170	if `Single_Segment_flag` is set, `Field_Size` is 1.
				171	Otherwise, `Field_Size` is 0 (content size not provided).
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	172
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	173	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	174
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	175	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	176	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	177
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	178	In this case, `Frame_Content_Size` is necessarily present,
				179	but `Window_Descriptor` byte is skipped.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	180	As a consequence, the decoder must allocate a memory segment
				181	of size equal or bigger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	182
				183	In order to preserve the decoder from unreasonable memory requirement,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	184	a decoder can reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	185	which requests a memory size beyond decoder's authorized range.
				186
				187	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	188	memory sizes of at least 8 MB.
				189	This is just a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	190	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	191	depending on local limitations.
				192
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	193	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	194
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	195	The value of this bit should be set to zero.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	196	A decoder compliant with this specification version shall not interpret it.
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	197	It might be used in a future version,
				198	to signal a property which is not mandatory to properly decode the frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	199
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	200	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	201
				202	This bit is reserved for some future feature.
				203	Its value _must be zero_.
				204	A decoder compliant with this specification version must ensure it is not set.
				205	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	206	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	207
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	208	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	209
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	210	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				211	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	212
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	213	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	214
				215	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	216	telling if a dictionary ID is provided within the header.
				217	It also specifies the size of this field.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	218
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	219	\| Value \| 0 \| 1 \| 2 \| 3 \|
				220	\| -------- \| --- \| --- \| --- \| --- \|
				221	\|Field size\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	222
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	223	### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	224
				225	Provides guarantees on maximum back-reference distance
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	226	that will be used within compressed data.
				227	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	228
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	229	The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	230	In this case, the maximum back-reference distance is the content size itself,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	231	which can be any value from 1 to 2^64-1 bytes (16 EB).
				232
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	233	\| Bit numbers \| 7-3 \| 0-2 \|
				234	\| ----------- \| -------- \| -------- \|
				235	\| Field name \| Exponent \| Mantissa \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
				237	Maximum distance is given by the following formulae :
				238	```
				239	windowLog = 10 + Exponent;
				240	windowBase = 1 << windowLog;
				241	windowAdd = (windowBase / 8) * Mantissa;
				242	windowSize = windowBase + windowAdd;
				243	```
				244	The minimum window size is 1 KB.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	245	The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	246
				247	To properly decode compressed data,
				248	a decoder will need to allocate a buffer of at least `windowSize` bytes.
				249
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	250	In order to preserve decoder from unreasonable memory requirements,
				251	a decoder can refuse a compressed frame
				252	which requests a memory size beyond decoder's authorized range.
				253
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	254	For improved interoperability,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	255	decoders are recommended to be compatible with window sizes of 8 MB,
				256	and encoders are recommended to not request more than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	257	It's merely a recommendation though,
				258	decoders are free to support larger or lower limits,
				259	depending on local limitations.
				260
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	261	### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	262
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	263	This is a variable size field, which contains
				264	the ID of the dictionary required to properly decode the frame.
				265	Note that this field is optional. When it's not present,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	266	it's up to the caller to make sure it uses the correct dictionary.
				267
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	268	Field size depends on `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	269	1 byte can represent an ID 0-255.
				270	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	271	4 bytes can represent an ID 0-4294967295.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	272
				273	It's allowed to represent a small ID (for example `13`)
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	274	with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	275
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	276	_Reserved ranges :_
				277	If the frame is going to be distributed in a private environment,
				278	any dictionary ID can be used.
				279	However, for public distribution of compressed frames using a dictionary,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	280	the following ranges are reserved for future use and should not be used :
				281	- low range : 1 - 32767
				282	- high range : >= (2^31)
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	283
				284
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	285	### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	286
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	287	This is the original (uncompressed) size. This information is optional.
				288	The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
				289	The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
				290	Format is Little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	291
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	292	\| `Field_Size` \| Range \|
				293	\| ------------ \| ---------- \|
				294	\| 1 \| 0 - 255 \|
				295	\| 2 \| 256 - 65791\|
				296	\| 4 \| 0 - 2^32-1 \|
				297	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	298
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	299	When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				300	When `Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	301	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	302
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	303
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	304	The structure of `Data_Block`
				305	-----------------------------
				306	The structure of `Data_Block` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	307
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	308	\| `Last_Block` \| `Block_Type` \| `Block_Size` \| `Block_Content` \|
				309	\|:------------:\|:------------:\|:------------:\|:---------------:\|
				310	\| 1 bit \| 2 bits \| 21 bits \| n bytes \|
				311
				312	The block header uses 3-bytes.
				313
				314	__`Last_Block`__
				315
				316	The lowest bit signals if this block is the last one.
				317	Frame ends right after this block.
				318	It may be followed by an optional `Content_Checksum` .
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	319
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	320	__`Block_Type` and `Block_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	321
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	322	The next 2 bits represent the `Block_Type`,
				323	while the remaining 21 bits represent the `Block_Size`.
				324	Format is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	325
				326	There are 4 block types :
				327
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	328	\| Value \| 0 \| 1 \| 2 \| 3 \|
				329	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	330	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	331
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	332	- `Raw_Block` - this is an uncompressed block.
				333	`Block_Size` is the number of bytes to read and copy.
				334	- `RLE_Block` - this is a single byte, repeated N times.
				335	In which case, `Block_Size` is the size to regenerate,
				336	while the "compressed" block is just 1 byte (the byte to repeat).
				337	- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	338	detailed in another section of this specification.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	339	`Block_Size` is the compressed size.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	340	Decompressed size is unknown,
				341	but its maximum possible value is guaranteed (see below)
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	342	- `Reserved` - this is not a block.
				343	This value cannot be used with current version of this specification.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	344
				345	Block sizes must respect a few rules :
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	346	- In compressed mode, compressed size if always strictly `< decompressed size`.
				347	- Block decompressed size is always <= maximum back-reference distance .
				348	- Block decompressed size is always <= 128 KB
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	349
				350
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	351	__`Block_Content`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	352
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	353	The `Block_Content` is where the actual data to decode stands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	354	It might be compressed or not, depending on previous field indications.
				355	A data block is not necessarily "full" :
				356	since an arbitrary “flush” may happen anytime,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	357	block decompressed content can be any size,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	358	up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	359	- Maximum back-reference distance
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	360	- 128 KB
				361
				362
				363	Skippable Frames
				364	----------------
				365
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	366	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				367	\|:--------------:\|:------------:\|:-----------:\|
				368	\| 4 bytes \| 4 bytes \| n bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	369
				370	Skippable frames allow the insertion of user-defined data
				371	into a flow of concatenated frames.
				372	Its design is pretty straightforward,
				373	with the sole objective to allow the decoder to quickly skip
				374	over user-defined data and continue decoding.
				375
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	376	Skippable frames defined in this specification are compatible with [LZ4] ones.
				377
				378	[LZ4]:http://www.lz4.org
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	379
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	380	__`Magic_Number`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	381
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	382	4 Bytes, Little-endian format.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	383	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				384	All 16 values are valid to identify a skippable frame.
				385
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	386	__`Frame_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	387
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	388	This is the size, in bytes, of the following `User_Data`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	389	(without including the magic number nor the size field itself).
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	390	This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	391	This means `User_Data` can’t be bigger than (2^32-1) bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	392
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	393	__`User_Data`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	394
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	395	The `User_Data` can be anything. Data will just be skipped by the decoder.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	396
				397
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	398	The format of `Compressed_Block`
				399	--------------------------------
				400	The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
				401	The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	402	in order to properly allocate destination buffer.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	403	See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	404
				405	A compressed block consists of 2 sections :
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	406	- [Literals_Section](#literals_section)
				407	- [Sequences_Section](#sequences_section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	408
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	409	### Prerequisites
				410	To decode a compressed block, the following elements are necessary :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	411	- Previous decoded blocks, up to a distance of `windowSize`,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	412	or all previous blocks when `Single_Segment_flag` is set.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	413	- List of "recent offsets" from previous compressed block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	414	- Decoding tables of previous compressed block for each symbol type
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	415	(literals, litLength, matchLength, offset).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	416
				417
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	418	### `Literals_Section`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	419
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	420	During sequence phase, literals will be entangled with match copy operations.
				421	All literals are regrouped in the first part of the block.
				422	They can be decoded first, and then copied during sequence operations,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	423	or they can be decoded on the flow, as needed by sequence commands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	424
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame^]	425	\| `Literals_Section_Header` \| [`Huffman_Tree_Description`] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				426	\| ------------------------- \| ---------------------------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	427
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	428	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	429	When compressed, an optional tree description can be present,
				430	followed by 1 or 4 streams.
				431
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	432
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	433	#### `Literals_Section_Header`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	434
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	435	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	437	using little-endian convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	438
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	439	\| `Literals_Block_Type` \| `Size_Format` \| `Regenerated_Size` \| [`Compressed_Size`] \|
				440	\| --------------------- \| ------------- \| ------------------ \| ----------------- \|
				441	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	442
				443	In this representation, bits on the left are smallest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	444
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	445	__`Literals_Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	447	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	449	\| Value \| 0 \| 1 \| 2 \| 3 \|
				450	\| --------------------- \| -------------------- \| -------------------- \| --------------------------- \| ----------------------------- \|
				451	\| `Literals_Block_Type` \| `Raw_Literals_Block` \| `RLE_Literals_Block` \| `Compressed_Literals_Block` \| `Repeat_Stats_Literals_Block` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	452
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	453	- `Raw_Literals_Block` - Literals are stored uncompressed.
				454	- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
				455	- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	456	starting with a Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	457	See details below.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	458	- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	459	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	460	Huffman tree description will be skipped.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	461
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	462	__`Size_Format`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	463
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	464	`Size_Format` is divided into 2 families :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	465
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	466	- For `Compressed_Block`, it requires to decode both `Compressed_Size`
				467	and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
				468	- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	469
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	470	For values spanning several bytes, convention is Little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	471
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	472	__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	473
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	474	- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	475	Total literal header size is 1 byte.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	476	`size = h[0]>>3;`
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	477	- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	478	Total literal header size is 2 bytes.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	479	`size = (h[0]>>4) + (h[1]<<4);`
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	480	- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	481	Total literal header size is 3 bytes.
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	482	`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	483
				484	Note : it's allowed to represent a short value (ex : `13`)
				485	using a long format, accepting the reduced compacity.
				486
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	487	__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	488
Yann Collet	c2e1a68	2016-07-22 17:30:52 +0200	[diff] [blame]	489	- Value : 00 : _Single stream_.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	490	`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	491	Total literal header size is 3 bytes.
Yann Collet	c2e1a68	2016-07-22 17:30:52 +0200	[diff] [blame]	492	- Value : 01 : 4 streams.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	493	`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	494	Total literal header size is 3 bytes.
				495	- Value : 10 : 4 streams.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	496	`Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	497	Total literal header size is 4 bytes.
Yann Collet	d9cc442	2016-07-22 19:15:27 +0200	[diff] [blame]	498	- Value : 11 : 4 streams.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	499	`Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	500	Total literal header size is 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	501
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame^]	502	`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	503
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	504
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame^]	505	#### `Huffman_Tree_Description`
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	506
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame^]	507	This section is only present when literals block type is `Compressed_Block` (`2`).
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	508
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	509	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	510	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	511	in a manner such that different symbols may be represented
				512	by bit sequences of different lengths,
				513	but a parser can always parse an encoded string
				514	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	515
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	516	Given an alphabet with known symbol frequencies,
				517	the Huffman algorithm allows the construction of an optimal prefix code
				518	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	519
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	520	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	521	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	522	and require more memory or more complex decoding operations.
				523	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	524
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	525
				526	##### Representation
				527
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	528	All literal values from zero (included) to last present one (excluded)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	529	are represented by `weight` values, from 0 to `maxBits`.
				530	Transformation from `weight` to `nbBits` follows this formulae :
				531	`nbBits = weight ? maxBits + 1 - weight : 0;` .
				532	The last symbol's weight is deduced from previously decoded ones,
				533	by completing to the nearest power of 2.
				534	This power of 2 gives `maxBits`, the depth of the current tree.
				535
				536	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	537	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	538
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	539	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				540	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				541	\| nbBits \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	542
				543	The tree depth is 4, since its smallest element uses 4 bits.
				544	Value `5` will not be listed, nor will values above `5`.
				545	Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
				546	Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
				547	It gives the following serie of weights :
				548
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	549	\| weights \| 4 \| 3 \| 2 \| 0 \| 1 \|
				550	\| ------- \| --- \| --- \| --- \| --- \| --- \|
				551	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	552
				553	The decoder will do the inverse operation :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	554	having collected weights of literals from `0` to `4`,
				555	it knows the last literal, `5`, is present with a non-zero weight.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	556	The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	557	Sum of 2^(weight-1) (excluding 0) is :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	558	`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	559	Nearest power of 2 is 16.
				560	Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	561
				562	##### Huffman Tree header
				563
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	564	This is a single byte value (0-255),
				565	which tells how to decode the list of weights.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	566
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	567	- if headerByte >= 128 : this is a direct representation,
				568	where each weight is written directly as a 4 bits field (0-15).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	569	The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	570	meaning it uses a last full byte even if nbSymbols is odd.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	571	`nbSymbols = headerByte - 127;`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	572	Note that maximum nbSymbols is 255-127 = 128.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	573	A larger serie must necessarily use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	574
				575	- if headerByte < 128 :
				576	the serie of weights is compressed by FSE.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	577	The length of the FSE-compressed serie is `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	578
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	579	##### FSE (Finite State Entropy) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	580
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	581	The serie of weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	582	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	583	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	584
				585	To decode an FSE bitstream, it is necessary to know its compressed size.
				586	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	587	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	588	which is `255`, since literal values span from `0` to `255`,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	589	and last symbol value is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	590
				591	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	592	It will create a Decoding Table.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	593	Table must be pre-allocated, which requires to support a maximum accuracy.
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	594	For a list of Huffman weights, maximum accuracy is 7 bits.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	595
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	596	FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
				597	and so is [FSE bitstream](#bitstream).
				598	The main difference is that Huffman header compression uses 2 states,
				599	which share the same FSE distribution table.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	600	Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	601	The number of symbols to decode is discovered
				602	by tracking bitStream overflow condition.
				603	When both states have overflowed the bitstream, end is reached.
				604
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	605
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	606	##### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	607
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	608	All present symbols shall now have a `weight` value.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	609	It is possible to transform weights into nbBits, using this formula :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	610	`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
				611
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	612	Symbols are sorted by weight. Within same weight, symbols keep natural order.
				613	Symbols with a weight of zero are removed.
				614	Then, starting from lowest weight, prefix codes are distributed in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	615
				616	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	617	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	618
				619	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				620	\| ------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				621	\| weight \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
				622
				623	Sorted by weight and then natural order,
				624	it gives the following distribution :
				625
				626	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				627	\| ------------ \| --- \| --- \| --- \| --- \| --- \| ---- \|
				628	\| weight \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	629	\| nb bits \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	630	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	631
				632
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	633	#### Literals bitstreams
				634
				635	##### Bitstreams sizes
				636
				637	As seen in a previous paragraph,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	638	there are 2 flavors of Huffman-compressed literals :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	639	single stream, and 4-streams.
				640
				641	4-streams is useful for CPU with multiple execution units and OoO operations.
				642	Since each stream can be decoded independently,
				643	it's possible to decode them up to 4x faster than a single stream,
				644	presuming the CPU has enough parallelism available.
				645
				646	For single stream, header provides both the compressed and regenerated size.
				647	For 4-streams though,
				648	header only provides compressed and regenerated size of all 4 streams combined.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	649	In order to properly decode the 4 streams,
				650	it's necessary to know the compressed and regenerated size of each stream.
				651
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	652	Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
				653	except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	654
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	655	Compressed size is provided explicitly : in the 4-streams variant,
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	656	bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	657	Each value represents the compressed size of one stream, in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	658	The last stream size is deducted from total compressed size
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	659	and from previously decoded stream sizes :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	660	`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
				661
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	662	##### Bitstreams read and decode
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	663
				664	Each bitstream must be read _backward_,
				665	that is starting from the end down to the beginning.
				666	Therefore it's necessary to know the size of each bitstream.
				667
				668	It's also necessary to know exactly which _bit_ is the latest.
				669	This is detected by a final bit flag :
				670	the highest bit of latest byte is a final-bit-flag.
				671	Consequently, a last byte of `0` is not possible.
				672	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	673	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	674
				675	Starting from the end,
				676	it's possible to read the bitstream in a little-endian fashion,
				677	keeping track of already used bits.
				678
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	679	Reading the last `maxBits` bits,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	680	it's then possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	681	determining the symbol to decode and number of bits to discard.
				682
				683	The process continues up to reading the required number of symbols per stream.
				684	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	685	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	686	the decoding process is considered faulty.
				687
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	688
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame^]	689	### `Sequences_Section`
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	690
				691	A compressed block is a succession of _sequences_ .
				692	A sequence is a literal copy command, followed by a match copy command.
				693	A literal copy command specifies a length.
				694	It is the number of bytes to be copied (or extracted) from the literal section.
				695	A match copy command specifies an offset and a length.
				696	The offset gives the position to copy from,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	697	which can be within a previous block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	698
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	699	There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	700	which are encoded together, interleaved in a single _bitstream_.
				701
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	702	Each symbol is a _code_ in its own context,
				703	which specifies a baseline and a number of bits to add.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	704	_Codes_ are FSE compressed,
				705	and interleaved with raw additional bits in the same bitstream.
				706
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	707	The Sequences section starts by a header,
				708	followed by optional Probability tables for each symbol type,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	709	followed by the bitstream.
				710
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	711	\| Header \| [LitLengthTable] \| [OffsetTable] \| [MatchLengthTable] \| bitStream \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	712	\| ------ \| ---------------- \| ------------- \| ------------------ \| --------- \|
				713
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	714	To decode the Sequence section, it's required to know its size.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	715	This size is deducted from `blockSize - literalSectionSize`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	716
				717
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	718	#### Sequences section header
				719
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	720	Consists in 2 items :
				721	- Nb of Sequences
				722	- Flags providing Symbol compression types
				723
				724	__Nb of Sequences__
				725
				726	This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
				727	Let's call its first byte `byte0`.
				728	- `if (byte0 == 0)` : there are no sequences.
				729	The sequence section stops there.
				730	Regenerated content is defined entirely by literals section.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	731	- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
				732	- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
				733	- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	734
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	735	__Symbol encoding modes__
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	736
				737	This is a single byte, defining the compression mode of each symbol type.
				738
				739	\| BitNb \| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
				740	\| ------- \| ------ \| ------ \| ------ \| -------- \|
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	741	\|FieldName\| LLType \| OFType \| MLType \| Reserved \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	742
				743	The last field, `Reserved`, must be all-zeroes.
				744
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	745	`LLType`, `OFType` and `MLType` define the compression mode of
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	746	Literal Lengths, Offsets and Match Lengths respectively.
				747
				748	They follow the same enumeration :
				749
Yann Collet	f8e7b53	2016-07-23 16:31:49 +0200	[diff] [blame]	750	\| Value \| 0 \| 1 \| 2 \| 3 \|
				751	\| ---------------- \| ------ \| --- \| ---------- \| ------ \|
				752	\| Compression Mode \| predef \| RLE \| Compressed \| Repeat \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	753
				754	- "predef" : uses a pre-defined distribution table.
				755	- "RLE" : it's a single code, repeated `nbSeqs` times.
				756	- "Repeat" : re-use distribution table from previous compressed block.
Yann Collet	f8e7b53	2016-07-23 16:31:49 +0200	[diff] [blame]	757	- "Compressed" : standard FSE compression.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	758	A distribution table will be present.
				759	It will be described in [next part](#distribution-tables).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	760
				761	#### Symbols decoding
				762
				763	##### Literal Lengths codes
				764
				765	Literal lengths codes are values ranging from `0` to `35` included.
				766	They define lengths from 0 to 131071 bytes.
				767
				768	\| Code \| 0-15 \|
				769	\| ------ \| ---- \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	770	\| length \| Code \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	771	\| nbBits \| 0 \|
				772
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	773
				774	\| Code \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				775	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				776	\| Baseline \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				777	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				778
				779	\| Code \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				780	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				781	\| Baseline \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				782	\| nb Bits \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				783
				784	\| Code \| 32 \| 33 \| 34 \| 35 \|
				785	\| -------- \| ---- \| ---- \| ---- \| ---- \|
				786	\| Baseline \| 8192 \|16384 \|32768 \|65536 \|
				787	\| nb Bits \| 13 \| 14 \| 15 \| 16 \|
				788
				789	__Default distribution__
				790
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	791	When "compression mode" is "predef"",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	792	a pre-defined distribution is used for FSE compression.
				793
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	794	Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	795	```
				796	short literalLengths_defaultDistribution[36] =
				797	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				798	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				799	-1,-1,-1,-1 };
				800	```
				801
				802	##### Match Lengths codes
				803
				804	Match lengths codes are values ranging from `0` to `52` included.
				805	They define lengths from 3 to 131074 bytes.
				806
				807	\| Code \| 0-31 \|
				808	\| ------ \| -------- \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	809	\| value \| Code + 3 \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	810	\| nbBits \| 0 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	811
				812	\| Code \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				813	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				814	\| Baseline \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				815	\| nb Bits \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				816
				817	\| Code \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				818	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				819	\| Baseline \| 67 \| 83 \| 99 \| 131 \| 258 \| 514 \| 1026 \| 2050 \|
				820	\| nb Bits \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				821
				822	\| Code \| 48 \| 49 \| 50 \| 51 \| 52 \|
				823	\| -------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				824	\| Baseline \| 4098 \| 8194 \|16486 \|32770 \|65538 \|
				825	\| nb Bits \| 12 \| 13 \| 14 \| 15 \| 16 \|
				826
				827	__Default distribution__
				828
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	829	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	830	a pre-defined distribution is used for FSE compression.
				831
				832	Here is its definition. It uses an accuracy of 6 bits (64 states).
				833	```
				834	short matchLengths_defaultDistribution[53] =
				835	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				836	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				837	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				838	-1,-1,-1,-1,-1 };
				839	```
				840
				841	##### Offset codes
				842
				843	Offset codes are values ranging from `0` to `N`,
				844	with `N` being limited by maximum backreference distance.
				845
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	846	A decoder is free to limit its maximum `N` supported.
				847	Recommendation is to support at least up to `22`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	848	For information, at the time of this writing.
				849	the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
				850
				851	An offset code is also the nb of additional bits to read,
				852	and can be translated into an `OFValue` using the following formulae :
				853
				854	```
				855	OFValue = (1 << offsetCode) + readNBits(offsetCode);
				856	if (OFValue > 3) offset = OFValue - 3;
				857	```
				858
				859	OFValue from 1 to 3 are special : they define "repeat codes",
				860	which means one of the previous offsets will be repeated.
				861	They are sorted in recency order, with 1 meaning the most recent one.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	862	See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	863
				864	__Default distribution__
				865
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	866	When "compression mode" is defined as "predef",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	867	a pre-defined distribution is used for FSE compression.
				868
				869	Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	870	and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	871
				872	If any sequence in the compressed block requires an offset larger than this,
				873	it's not possible to use the default distribution to represent it.
				874
				875	```
				876	short offsetCodes_defaultDistribution[53] =
				877	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				878	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				879	```
				880
				881	#### Distribution tables
				882
				883	Following the header, up to 3 distribution tables can be described.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	884	When present, they are in this order :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	885	- Literal lengthes
				886	- Offsets
				887	- Match Lengthes
				888
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	889	The content to decode depends on their respective encoding mode :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	890	- Predef : no content. Use pre-defined distribution table.
				891	- RLE : 1 byte. This is the only code to use across the whole compressed block.
				892	- FSE : A distribution table is present.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	893	- Repeat mode : no content. Re-use distribution from previous compressed block.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	894
				895	##### FSE distribution table : condensed format
				896
				897	An FSE distribution table describes the probabilities of all symbols
				898	from `0` to the last present one (included)
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	899	on a normalized scale of `1 << AccuracyLog` .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	900
				901	It's a bitstream which is read forward, in little-endian fashion.
				902	It's not necessary to know its exact size,
				903	since it will be discovered and reported by the decoding process.
				904
				905	The bitstream starts by reporting on which scale it operates.
				906	`AccuracyLog = low4bits + 5;`
Yann Collet	9d6e949	2016-07-22 19:32:07 +0200	[diff] [blame]	907	Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
				908	and for offsets it is `8`. Higher values are considered errors.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	909
				910	Then follow each symbol value, from `0` to last present one.
				911	The nb of bits used by each field is variable.
				912	It depends on :
				913
				914	- Remaining probabilities + 1 :
				915	__example__ :
				916	Presuming an AccuracyLog of 8,
				917	and presuming 100 probabilities points have already been distributed,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	918	the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	919	Therefore, it must read `log2sup(156) == 8` bits.
				920
				921	- Value decoded : small values use 1 less bit :
				922	__example__ :
				923	Presuming values from 0 to 156 (included) are possible,
				924	255-156 = 99 values are remaining in an 8-bits field.
				925	They are used this way :
				926	first 99 values (hence from 0 to 98) use only 7 bits,
				927	values from 99 to 156 use 8 bits.
				928	This is achieved through this scheme :
				929
				930	\| Value read \| Value decoded \| nb Bits used \|
				931	\| ---------- \| ------------- \| ------------ \|
				932	\| 0 - 98 \| 0 - 98 \| 7 \|
				933	\| 99 - 127 \| 99 - 127 \| 8 \|
				934	\| 128 - 226 \| 0 - 98 \| 7 \|
				935	\| 227 - 255 \| 128 - 156 \| 8 \|
				936
				937	Symbols probabilities are read one by one, in order.
				938
				939	Probability is obtained from Value decoded by following formulae :
				940	`Proba = value - 1;`
				941
				942	It means value `0` becomes negative probability `-1`.
				943	`-1` is a special probability, which means `less than 1`.
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	944	Its effect on distribution table is described in [next paragraph].
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	945	For the purpose of calculating cumulated distribution, it counts as one.
				946
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	947	[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
				948
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	949	When a symbol has a probability of `zero`,
				950	it is followed by a 2-bits repeat flag.
				951	This repeat flag tells how many probabilities of zeroes follow the current one.
				952	It provides a number ranging from 0 to 3.
				953	If it is a 3, another 2-bits repeat flag follows, and so on.
				954
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	955	When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	956	decoding is complete.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	957	If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	958	distribution is considered corrupted.
				959
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	960	Then the decoder can tell how many bytes were used in this process,
				961	and how many symbols are present.
				962	The bitstream consumes a round number of bytes.
				963	Any remaining bit within the last byte is just unused.
				964
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	965	##### FSE decoding : from normalized distribution to decoding tables
				966
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	967	The distribution of normalized probabilities is enough
				968	to create a unique decoding table.
				969
				970	It follows the following build rule :
				971
				972	The table has a size of `tableSize = 1 << AccuracyLog;`.
				973	Each cell describes the symbol decoded,
				974	and instructions to get the next state.
				975
				976	Symbols are scanned in their natural order for `less than 1` probabilities.
				977	Symbols with this probability are being attributed a single cell,
				978	starting from the end of the table.
				979	These symbols define a full state reset, reading `AccuracyLog` bits.
				980
				981	All remaining symbols are sorted in their natural order.
				982	Starting from symbol `0` and table position `0`,
				983	each symbol gets attributed as many cells as its probability.
				984	Cell allocation is spreaded, not linear :
				985	each successor position follow this rule :
				986
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	987	```
				988	position += (tableSize>>1) + (tableSize>>3) + 3;
				989	position &= tableSize-1;
				990	```
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	991
				992	A position is skipped if already occupied,
				993	typically by a "less than 1" probability symbol.
				994
				995	The result is a list of state values.
				996	Each state will decode the current symbol.
				997
				998	To get the Number of bits and baseline required for next state,
				999	it's first necessary to sort all states in their natural order.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1000	The lower states will need 1 more bit than higher ones.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1001
				1002	__Example__ :
				1003	Presuming a symbol has a probability of 5.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1004	It receives 5 state values. States are sorted in natural order.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1005
				1006	Next power of 2 is 8.
				1007	Space of probabilities is divided into 8 equal parts.
				1008	Presuming the AccuracyLog is 7, it defines 128 states.
				1009	Divided by 8, each share is 16 large.
				1010
				1011	In order to reach 8, 8-5=3 lowest states will count "double",
				1012	taking shares twice larger,
				1013	requiring one more bit in the process.
				1014
				1015	Numbering starts from higher states using less bits.
				1016
				1017	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1018	\| ----------- \| ----- \| ----- \| ------ \| ---- \| ----- \|
				1019	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1020	\| nb Bits \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1021	\| range nb \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1022	\| baseline \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1023	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
				1024
				1025	Next state is determined from current state
				1026	by reading the required number of bits, and adding the specified baseline.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	1027
				1028
				1029	#### Bitstream
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1030
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1031	All sequences are stored in a single bitstream, read _backward_.
				1032	It is therefore necessary to know the bitstream size,
				1033	which is deducted from compressed block size.
				1034
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1035	The last useful bit of the stream is followed by an end-bit-flag.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1036	Highest bit of last byte is this flag.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1037	It does not belong to the useful part of the bitstream.
				1038	Therefore, last byte has 0-7 useful bits.
				1039	Note that it also means that last byte cannot be `0`.
				1040
				1041	##### Starting states
				1042
				1043	The bitstream starts with initial state values,
				1044	each using the required number of bits in their respective _accuracy_,
				1045	decoded previously from their normalized distribution.
				1046
				1047	It starts by `Literal Length State`,
				1048	followed by `Offset State`,
				1049	and finally `Match Length State`.
				1050
				1051	Reminder : always keep in mind that all values are read _backward_.
				1052
				1053	##### Decoding a sequence
				1054
				1055	A state gives a code.
				1056	A code provides a baseline and number of bits to add.
				1057	See [Symbol Decoding] section for details on each symbol.
				1058
				1059	Decoding starts by reading the nb of bits required to decode offset.
				1060	It then does the same for match length,
				1061	and then for literal length.
				1062
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1063	Offset / matchLength / litLength define a sequence.
				1064	It starts by inserting the number of literals defined by `litLength`,
				1065	then continue by copying `matchLength` bytes from `currentPos - offset`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1066
				1067	The next operation is to update states.
				1068	Using rules pre-calculated in the decoding tables,
				1069	`Literal Length State` is updated,
				1070	followed by `Match Length State`,
				1071	and then `Offset State`.
				1072
				1073	This operation will be repeated `NbSeqs` times.
				1074	At the end, the bitstream shall be entirely consumed,
				1075	otherwise bitstream is considered corrupted.
				1076
				1077	[Symbol Decoding]:#symbols-decoding
				1078
				1079	##### Repeat offsets
				1080
				1081	As seen in [Offset Codes], the first 3 values define a repeated offset.
				1082	They are sorted in recency order, with 1 meaning "most recent one".
				1083
				1084	There is an exception though, when current sequence's literal length is `0`.
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1085	In which case, repcodes are "pushed by one",
				1086	so 1 becomes 2, 2 becomes 3,
				1087	and 3 becomes "offset_1 - 1_byte".
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1088
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1089	On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1090
				1091	Then each block receives its start value from previous compressed block.
				1092	Note that non-compressed blocks are skipped,
				1093	they do not contribute to offset history.
				1094
				1095	[Offset Codes]: #offset-codes
				1096
				1097	###### Offset updates rules
				1098
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1099	New offset take the lead in offset history,
				1100	up to its previous place if it was already present.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1101
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1102	It means that when repeat offset 1 (most recent) is used, history is unmodified.
				1103	When repeat offset 2 is used, it's swapped with offset 1.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1104
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1105
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1106	Dictionary format
				1107	-----------------
				1108
				1109	`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
				1110	But dictionaries created by `zstd --train` follow a format, described here.
				1111
				1112	__Pre-requisites__ : a dictionary has a known length,
				1113	defined either by a buffer limit, or a file size.
				1114
				1115	\| Header \| DictID \| Stats \| Content \|
				1116	\| ------ \| ------ \| ----- \| ------- \|
				1117
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	1118	__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1119
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	1120	__Dict_ID__ : 4 bytes, stored in Little-Endian format.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1121	DictID can be any value, except 0 (which means no DictID).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1122	It's used by decoders to check if they use the correct dictionary.
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	1123	_Reserved ranges :_
				1124	If the frame is going to be distributed in a private environment,
				1125	any dictionary ID can be used.
				1126	However, for public distribution of compressed frames,
				1127	some ranges are reserved for future use :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1128
				1129	- low range : 1 - 32767 : reserved
				1130	- high range : >= (2^31) : reserved
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1131
				1132	__Stats__ : Entropy tables, following the same format as a [compressed blocks].
				1133	They are stored in following order :
				1134	Huffman tables for literals, FSE table for offset,
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1135	FSE table for matchLenth, and FSE table for litLength.
				1136	It's finally followed by 3 offset values, populating recent offsets,
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	1137	stored in order, 4-bytes little-endian each, for a total of 12 bytes.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1138
				1139	__Content__ : Where the actual dictionary content is.
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1140	Content size depends on Dictionary size.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1141
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1142	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1143
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1144
				1145	Version changes
				1146	---------------
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	1147	- 0.2.0 : numerous format adjustments for zstd v0.8
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1148	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1149	- 0.1.1 : reserved dictID ranges
				1150	- 0.1.0 : initial release