Blame - doc/zstd_compression_format.md - external_zstd

blob: 2a69c4c30ae95364a6bfea815e8db862af99501e [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
W. Felix Handte	5d693cc	2022-12-20 12:49:47 -0500	[diff] [blame]	6	Copyright (c) Meta Platforms, Inc. and affiliates.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	19	0.3.9 (2023-03-08)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	20
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	29	using the [Zstandard algorithm](https://facebook.github.io/zstd/).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	30	The text of the specification assumes a basic background in programming
				31	at the level of bits and other primitive data representations.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	32
				33	The data can be produced or consumed,
				34	even for an arbitrarily long sequentially presented input data stream,
				35	using only an a priori bounded amount of intermediate storage,
				36	and hence can be used in data communications.
				37	The format uses the Zstandard compression method,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	38	and optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/),
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	39	for detection of data corruption.
				40
				41	The data format defined by this specification
				42	does not attempt to allow random access to compressed data.
				43
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	44	Unless otherwise indicated below,
				45	a compliant compressor must produce data sets
				46	that conform to the specifications presented here.
				47	It doesn’t need to support all options though.
				48
				49	A compliant decompressor must be able to decompress
				50	at least one working set of parameters
				51	that conforms to the specifications presented here.
				52	It may also ignore informative fields, such as checksum.
				53	Whenever it does not support a parameter defined in the compressed stream,
				54	it must produce a non-ambiguous error code and associated error message
				55	explaining which parameter is unsupported.
				56
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	57	This specification is intended for use by implementers of software
				58	to compress data into Zstandard format and/or decompress data from Zstandard format.
				59	The Zstandard format is supported by an open source reference implementation,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	60	written in portable C, and available at : https://github.com/facebook/zstd .
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	61
				62
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	63	### Overall conventions
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	64	In this document:
				65	- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	66	- the naming convention for identifiers is `Mixed_Case_With_Underscores`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	67
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	68	### Definitions
				69	Content compressed by Zstandard is transformed into a Zstandard __frame__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	70	Multiple frames can be appended into a single file or stream.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	71	A frame is completely independent, has a defined beginning and end,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	72	and a set of parameters which tells the decoder how to decompress it.
				73
				74	A frame encapsulates one or multiple __blocks__.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	75	Each block contains arbitrary content, which is described by its header,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	76	and has a guaranteed maximum content size, which depends on frame parameters.
				77	Unlike frames, each block depends on previous blocks for proper decoding.
				78	However, each block can be decompressed without waiting for its successor,
				79	allowing streaming operations.
				80
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	81	Overview
				82	---------
				83	- [Frames](#frames)
				84	- [Zstandard frames](#zstandard-frames)
				85	- [Blocks](#blocks)
				86	- [Literals Section](#literals-section)
				87	- [Sequences Section](#sequences-section)
				88	- [Sequence Execution](#sequence-execution)
				89	- [Skippable frames](#skippable-frames)
				90	- [Entropy Encoding](#entropy-encoding)
				91	- [FSE](#fse)
				92	- [Huffman Coding](#huffman-coding)
				93	- [Dictionary Format](#dictionary-format)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	94
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	95	Frames
				96	------
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	97	Zstandard compressed data is made of one or more __frames__.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	98	Each frame is independent and can be decompressed independently of other frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	99	The decompressed content of multiple concatenated frames is the concatenation of
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	100	each frame decompressed content.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	101
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	102	There are two frame formats defined by Zstandard:
				103	Zstandard frames and Skippable frames.
				104	Zstandard frames contain compressed data, while
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	105	skippable frames contain custom user metadata.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	107	## Zstandard frames
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	108	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	109
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	110	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				111	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	112	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	113
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	114	__`Magic_Number`__
				115
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	116	4 Bytes, __little-endian__ format.
Yann Collet	7bdfcea	2016-09-05 17:43:31 +0200	[diff] [blame]	117	Value : 0xFD2FB528
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	118	Note: This value was selected to be less probable to find at the beginning of some random file.
				119	It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
				120	contains byte values outside of ASCII range,
				121	and doesn't map into UTF8 space.
				122	It reduces the chances that a text file represent this value by accident.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	123
				124	__`Frame_Header`__
				125
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	126	2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	127
				128	__`Data_Block`__
				129
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	130	Detailed in [`Blocks`](#blocks).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	131	That’s where compressed data is stored.
				132
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	133	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	134
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	135	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	The content checksum is the result
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	137	of [xxh64() hash function](https://cyan4973.github.io/xxHash/)
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	138	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	139	The low 4 bytes of the checksum are stored in __little-endian__ format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	140
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	141	### `Frame_Header`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	142
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	143	The `Frame_Header` has a variable size, with a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	144	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	145	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	146
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	147	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				148	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				149	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	150
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	151	#### `Frame_Header_Descriptor`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	152
				153	The first header's byte is called the `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	154	It describes which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	155	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	156
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	157	\| Bit number \| Field name \|
				158	\| ---------- \| ---------- \|
				159	\| 7-6 \| `Frame_Content_Size_flag` \|
				160	\| 5 \| `Single_Segment_flag` \|
				161	\| 4 \| `Unused_bit` \|
				162	\| 3 \| `Reserved_bit` \|
				163	\| 2 \| `Content_Checksum_flag` \|
				164	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	165
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	166	In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	167
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	168	__`Frame_Content_Size_flag`__
				169
				170	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	171	specifying if `Frame_Content_Size` (the decompressed data size)
				172	is provided within the header.
				173	`Flag_Value` provides `FCS_Field_Size`,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	174	which is the number of bytes used by `Frame_Content_Size`
				175	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	176
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	177	\| `Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				178	\| -------------- \| ------ \| --- \| --- \| --- \|
				179	\|`FCS_Field_Size`\| 0 or 1 \| 2 \| 4 \| 8 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	180
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	181	When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	182	if `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
				183	Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	184
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	185	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	186
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	187	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	188	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	189
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	190	In this case, `Window_Descriptor` byte is skipped,
				191	but `Frame_Content_Size` is necessarily present.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	192	As a consequence, the decoder must allocate a memory segment
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	193	of size equal or larger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	194
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	195	In order to preserve the decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	196	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	197	which requests a memory size beyond decoder's authorized range.
				198
				199	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	200	memory sizes of at least 8 MB.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	201	This is only a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	202	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	203	depending on local limitations.
				204
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	205	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	206
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	207	A decoder compliant with this specification version shall not interpret this bit.
				208	It might be used in any future version,
				209	to signal a property which is transparent to properly decode the frame.
				210	An encoder compliant with this specification version must set this bit to zero.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	211
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	212	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	213
				214	This bit is reserved for some future feature.
				215	Its value _must be zero_.
				216	A decoder compliant with this specification version must ensure it is not set.
				217	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	218	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	219
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	220	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	221
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	222	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				223	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	224
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	225	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	226
				227	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	228	telling if a dictionary ID is provided within the header.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	229	It also specifies the size of this field as `DID_Field_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	230
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	231	\|`Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				232	\| -------------- \| --- \| --- \| --- \| --- \|
				233	\|`DID_Field_Size`\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	234
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	235	#### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	237	Provides guarantees on minimum memory buffer required to decompress a frame.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	238	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	239
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	240	The `Window_Descriptor` byte is optional.
				241	When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	242	In this case, `Window_Size` is `Frame_Content_Size`,
				243	which can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	244
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	245	\| Bit numbers \| 7-3 \| 2-0 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	246	\| ----------- \| ---------- \| ---------- \|
				247	\| Field name \| `Exponent` \| `Mantissa` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	248
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	249	The minimum memory buffer size is called `Window_Size`.
				250	It is described by the following formulas :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	251	```
				252	windowLog = 10 + Exponent;
				253	windowBase = 1 << windowLog;
				254	windowAdd = (windowBase / 8) * Mantissa;
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	255	Window_Size = windowBase + windowAdd;
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	256	```
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	257	The minimum `Window_Size` is 1 KB.
				258	The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	259
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	260	In general, larger `Window_Size` tend to improve compression ratio,
				261	but at the cost of memory usage.
				262
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	263	To properly decode compressed data,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	264	a decoder will need to allocate a buffer of at least `Window_Size` bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	265
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	266	In order to preserve decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	267	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	268	which requests a memory size beyond decoder's authorized range.
				269
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	270	For improved interoperability,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	271	it's recommended for decoders to support `Window_Size` of up to 8 MB,
				272	and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	273	It's merely a recommendation though,
				274	decoders are free to support larger or lower limits,
				275	depending on local limitations.
				276
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	277	#### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	278
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	279	This is a variable size field, which contains
				280	the ID of the dictionary required to properly decode the frame.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	281	`Dictionary_ID` field is optional. When it's not present,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	282	it's up to the decoder to know which dictionary to use.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	283
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	284	`Dictionary_ID` field size is provided by `DID_Field_Size`.
				285	`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	286	1 byte can represent an ID 0-255.
				287	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	288	4 bytes can represent an ID 0-4294967295.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	289	Format is __little-endian__.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	290
				291	It's allowed to represent a small ID (for example `13`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	292	with a large 4-bytes dictionary ID, even if it is less efficient.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	293
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	294	A value of `0` has same meaning as no `Dictionary_ID`,
				295	in which case the frame may or may not need a dictionary to be decoded,
				296	and the ID of such a dictionary is not specified.
				297	The decoder must know this information by other means.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	298
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	299	#### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	300
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	301	This is the original (uncompressed) size. This information is optional.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	302	`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
				303	`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
				304	`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	305
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	306	\| `FCS_Field_Size` \| Range \|
				307	\| ---------------- \| ---------- \|
				308	\| 0 \| unknown \|
				309	\| 1 \| 0 - 255 \|
				310	\| 2 \| 256 - 65791\|
				311	\| 4 \| 0 - 2^32-1 \|
				312	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	313
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	314	`Frame_Content_Size` format is __little-endian__.
				315	When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				316	When `FCS_Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	317	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	318
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	319
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	320	Blocks
				321	-------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	322
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	323	After `Magic_Number` and `Frame_Header`, there are some number of blocks.
				324	Each frame must have at least one block,
				325	but there is no upper limit on the number of blocks per frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	326
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	327	The structure of a block is as follows:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	328
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	329	\| `Block_Header` \| `Block_Content` \|
				330	\|:--------------:\|:---------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	331	\| 3 bytes \| n bytes \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	332
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	333	__`Block_Header`__
				334
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	335	`Block_Header` uses 3 bytes, written using __little-endian__ convention.
				336	It contains 3 fields :
				337
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	338	\| `Last_Block` \| `Block_Type` \| `Block_Size` \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	339	\|:------------:\|:------------:\|:------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	340	\| bit 0 \| bits 1-2 \| bits 3-23 \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	341
				342	__`Last_Block`__
				343
				344	The lowest bit signals if this block is the last one.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	345	The frame will end after this last block.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	346	It may be followed by an optional `Content_Checksum`
				347	(see [Zstandard Frames](#zstandard-frames)).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	348
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	349	__`Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	350
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	351	The next 2 bits represent the `Block_Type`.
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	352	`Block_Type` influences the meaning of `Block_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	353	There are 4 block types :
				354
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	355	\| Value \| 0 \| 1 \| 2 \| 3 \|
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	356	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	357	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	358
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	359	- `Raw_Block` - this is an uncompressed block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	360	`Block_Content` contains `Block_Size` bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	361
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	362	- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
				363	`Block_Content` consists of a single byte.
				364	On the decompression side, this byte must be repeated `Block_Size` times.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	365
				366	- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
				367	explained later on.
				368	`Block_Size` is the length of `Block_Content`, the compressed data.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	369	The decompressed size is not known,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	370	but its maximum possible value is guaranteed (see below)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	371
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	372	- `Reserved` - this is not a block.
				373	This value cannot be used with current version of this specification.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	374	If such a value is present, it is considered corrupted data.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	375
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	376	__`Block_Size`__
				377
				378	The upper 21 bits of `Block_Header` represent the `Block_Size`.
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	379
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	380	When `Block_Type` is `Compressed_Block` or `Raw_Block`,
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	381	`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	382
				383	When `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
				384	`Block_Size` represents the number of times this byte must be repeated.
				385
				386	`Block_Size` is limited by `Block_Maximum_Size` (see below).
				387
				388	__`Block_Content`__ and __`Block_Maximum_Size`__
				389
				390	The size of `Block_Content` is limited by `Block_Maximum_Size`,
				391	which is the smallest of:
				392	- `Window_Size`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	393	- 128 KB
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	394
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	395	`Block_Maximum_Size` is constant for a given frame.
				396	This maximum is applicable to both the decompressed size
				397	and the compressed size of any block in the frame.
				398
				399	The reasoning for this limit is that a decoder can read this information
				400	at the beginning of a frame and use it to allocate buffers.
				401	The guarantees on the size of blocks ensure that
				402	the buffers will be large enough for any following block of the valid frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	403
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	404
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	405	Compressed Blocks
				406	-----------------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	407	To decompress a compressed block, the compressed size must be provided
				408	from `Block_Size` field within `Block_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	409
				410	A compressed block consists of 2 sections :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	411	- [Literals Section](#literals-section)
				412	- [Sequences Section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	413
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	414	The results of the two sections are then combined to produce the decompressed
				415	data in [Sequence Execution](#sequence-execution)
				416
				417	#### Prerequisites
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	418	To decode a compressed block, the following elements are necessary :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	419	- Previous decoded data, up to a distance of `Window_Size`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	420	or beginning of the Frame, whichever is smaller.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	421	- List of "recent offsets" from previous `Compressed_Block`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	422	- The previous Huffman tree, required by `Treeless_Literals_Block` type
				423	- Previous FSE decoding tables, required by `Repeat_Mode`
				424	for each symbol type (literals lengths, match lengths, offsets)
				425
				426	Note that decoding tables aren't always from the previous `Compressed_Block`.
				427
				428	- Every decoding table can come from a dictionary.
				429	- The Huffman tree comes from the previous `Compressed_Literals_Block`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	430
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	431	Literals Section
				432	----------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	433	All literals are regrouped in the first part of the block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	434	They can be decoded first, and then copied during [Sequence Execution],
				435	or they can be decoded on the flow during [Sequence Execution].
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	437	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	438	When compressed, a tree description may optionally be present,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	439	followed by 1 or 4 streams.
				440
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	441	\| `Literals_Section_Header` \| [`Huffman_Tree_Description`] \| [jumpTable] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				442	\| ------------------------- \| ---------------------------- \| ----------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	443
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	444
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	445	### `Literals_Section_Header`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	447	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	449	using __little-endian__ convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	450
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	451	\| `Literals_Block_Type` \| `Size_Format` \| `Regenerated_Size` \| [`Compressed_Size`] \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	452	\| --------------------- \| ------------- \| ------------------ \| ------------------- \|
				453	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	454
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	455	In this representation, bits on the left are the lowest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	456
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	457	__`Literals_Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	458
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	459	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	460
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	461	\| `Literals_Block_Type` \| Value \|
				462	\| --------------------------- \| ----- \|
				463	\| `Raw_Literals_Block` \| 0 \|
				464	\| `RLE_Literals_Block` \| 1 \|
				465	\| `Compressed_Literals_Block` \| 2 \|
				466	\| `Treeless_Literals_Block` \| 3 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	467
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	468	- `Raw_Literals_Block` - Literals are stored uncompressed.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	469	- `RLE_Literals_Block` - Literals consist of a single byte value
				470	repeated `Regenerated_Size` times.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	471	- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	472	starting with a Huffman tree description.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	473	In this mode, there are at least 2 different literals represented in the Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	474	See details below.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	475	- `Treeless_Literals_Block` - This is a Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	476	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	477	`Huffman_Tree_Description` will be skipped.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	478	Note: If this mode is triggered without any previous Huffman-table in the frame
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	479	(or [dictionary](#dictionary-format)), this should be treated as data corruption.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	480
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	481	__`Size_Format`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	482
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	483	`Size_Format` is divided into 2 families :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	484
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	485	- For `Raw_Literals_Block` and `RLE_Literals_Block`,
				486	it's only necessary to decode `Regenerated_Size`.
				487	There is no `Compressed_Size` field.
				488	- For `Compressed_Block` and `Treeless_Literals_Block`,
				489	it's required to decode both `Compressed_Size`
				490	and `Regenerated_Size` (the decompressed size).
				491	It's also necessary to decode the number of streams (1 or 4).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	492
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	493	For values spanning several bytes, convention is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	494
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	495	__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	496
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	497	`Size_Format` uses 1 _or_ 2 bits.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	498	Its value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	499
				500	- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	501	`Regenerated_Size` uses 5 bits (0-31).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	502	`Literals_Section_Header` uses 1 byte.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	503	`Regenerated_Size = Literals_Section_Header[0]>>3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	504	- `Size_Format` == 01 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	505	`Regenerated_Size` uses 12 bits (0-4095).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	506	`Literals_Section_Header` uses 2 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	507	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	508	- `Size_Format` == 11 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	509	`Regenerated_Size` uses 20 bits (0-1048575).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	510	`Literals_Section_Header` uses 3 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	511	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	512
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	513	Only Stream1 is present for these cases.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	514	Note : it's allowed to represent a short value (for example `27`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	515	using a long format, even if it's less efficient.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	516
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	517	__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	518
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	519	`Size_Format` always uses 2 bits.
				520
				521	- `Size_Format` == 00 : _A single stream_.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	522	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	523	`Literals_Section_Header` uses 3 bytes.
				524	- `Size_Format` == 01 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	525	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	526	`Literals_Section_Header` uses 3 bytes.
				527	- `Size_Format` == 10 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	528	Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	529	`Literals_Section_Header` uses 4 bytes.
				530	- `Size_Format` == 11 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	531	Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	532	`Literals_Section_Header` uses 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	533
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	534	Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
				535	Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
				536	_when_ it is present.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	537	Note 2: `Compressed_Size` can never be `==0`.
				538	Even in single-stream scenario, assuming an empty content, it must be `>=1`,
				539	since it contains at least the final end bit flag.
				540	In 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10`
				541	(6 bytes for the jump table, + 4x1 bytes for the 4 streams).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	542
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	543	4 streams is faster than 1 stream in decompression speed,
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	544	by exploiting instruction level parallelism.
				545	But it's also more expensive,
				546	costing on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table.
				547
				548	In general, use the 4 streams mode when there are more literals to decode,
				549	to favor higher decompression speeds.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	550	Note that beyond >1KB of literals, the 4 streams mode is compulsory.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	551
				552	Note that a minimum of 6 bytes is required for the 4 streams mode.
				553	That's a technical minimum, but it's not recommended to employ the 4 streams mode
				554	for such a small quantity, that would be wasteful.
				555	A more practical lower bound would be around ~256 bytes.
				556
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	557	#### Raw Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	558	The data in Stream1 is `Regenerated_Size` bytes long,
				559	it contains the raw literals data to be used during [Sequence Execution].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	560
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	561	#### RLE Literals Block
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	562	Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
				563	to generate the decoded literals.
				564
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	565	#### Compressed Literals Block and Treeless Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	566	Both of these modes contain Huffman encoded data.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	567
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	568	For `Treeless_Literals_Block`,
				569	the Huffman table comes from previously compressed literals block,
				570	or from a dictionary.
				571
				572
				573	### `Huffman_Tree_Description`
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	574	This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	575	The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	576	The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	577	The size of `Huffman_Tree_Description` is determined during decoding process,
				578	it must be used to determine where streams begin.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	579	`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	580
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	581
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	582	### Jump Table
				583	The Jump Table is only present when there are 4 Huffman-coded streams.
				584
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	585	Reminder : Huffman compressed data consists of either 1 or 4 streams.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	586
				587	If only one stream is present, it is a single bitstream occupying the entire
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	588	remaining portion of the literals block, encoded as described in
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	589	[Huffman-Coded Streams](#huffman-coded-streams).
				590
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	591	If there are four streams, `Literals_Section_Header` only provided
				592	enough information to know the decompressed and compressed sizes
				593	of all four streams _combined_.
				594	The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	595	except for the last stream which may be up to 3 bytes smaller,
				596	to reach a total decompressed size as specified in `Regenerated_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	597
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	598	The compressed size of each stream is provided explicitly in the Jump Table.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	599	Jump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	600	describing the compressed sizes of the first three streams.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	601	`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	602
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	603	`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	604
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	605	`Stream4_Size` is necessarily `>= 1`. Therefore,
				606	if `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	607	data is considered corrupted.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	608
				609	Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	610	as described in [Huffman-Coded Streams](#huffman-coded-streams)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	611
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	612
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	613	Sequences Section
				614	-----------------
				615	A compressed block is a succession of _sequences_ .
				616	A sequence is a literal copy command, followed by a match copy command.
				617	A literal copy command specifies a length.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	618	It is the number of bytes to be copied (or extracted) from the Literals Section.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	619	A match copy command specifies an offset and a length.
				620
				621	When all _sequences_ are decoded,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	622	if there are literals left in the _literals section_,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	623	these bytes are added at the end of the block.
				624
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	625	This is described in more detail in [Sequence Execution](#sequence-execution).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	626
				627	The `Sequences_Section` regroup all symbols required to decode commands.
				628	There are 3 symbol types : literals lengths, offsets and match lengths.
				629	They are encoded together, interleaved, in a single _bitstream_.
				630
				631	The `Sequences_Section` starts by a header,
				632	followed by optional probability tables for each symbol type,
				633	followed by the bitstream.
				634
				635	\| `Sequences_Section_Header` \| [`Literals_Length_Table`] \| [`Offset_Table`] \| [`Match_Length_Table`] \| bitStream \|
				636	\| -------------------------- \| ------------------------- \| ---------------- \| ---------------------- \| --------- \|
				637
				638	To decode the `Sequences_Section`, it's required to know its size.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	639	Its size is deduced from the size of `Literals_Section`:
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	640	`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	641
				642
				643	#### `Sequences_Section_Header`
				644
				645	Consists of 2 items:
				646	- `Number_of_Sequences`
				647	- Symbol compression modes
				648
				649	__`Number_of_Sequences`__
				650
				651	This is a variable size field using between 1 and 3 bytes.
				652	Let's call its first byte `byte0`.
				653	- `if (byte0 == 0)` : there are no sequences.
				654	The sequence section stops there.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	655	Decompressed content is defined entirely as Literals Section content.
Nick Terrell	73f4c89	2018-05-22 16:12:33 -0700	[diff] [blame]	656	The FSE tables used in `Repeat_Mode` aren't updated.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	657	- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
Yann Collet	1f83b7c	2023-06-05 09:51:52 -0700	[diff] [blame]	658	- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes.
				659	Note that the 2 bytes format fully overlaps the 1 byte format.
				660	- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	661
				662	__Symbol compression modes__
				663
				664	This is a single byte, defining the compression mode of each symbol type.
				665
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	666	\|Bit number\| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	667	\| -------- \| ----------------------- \| -------------- \| -------------------- \| ---------- \|
				668	\|Field name\| `Literals_Lengths_Mode` \| `Offsets_Mode` \| `Match_Lengths_Mode` \| `Reserved` \|
				669
				670	The last field, `Reserved`, must be all-zeroes.
				671
				672	`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	673	literals lengths, offsets, and match lengths symbols respectively.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	674
				675	They follow the same enumeration :
				676
				677	\| Value \| 0 \| 1 \| 2 \| 3 \|
				678	\| ------------------ \| ----------------- \| ---------- \| --------------------- \| ------------- \|
				679	\| `Compression_Mode` \| `Predefined_Mode` \| `RLE_Mode` \| `FSE_Compressed_Mode` \| `Repeat_Mode` \|
				680
				681	- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
				682	[default distributions](#default-distributions).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	683	No distribution table will be present.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	684	- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	685	This symbol will be used for all sequences.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	686	- `FSE_Compressed_Mode` : standard FSE compression.
				687	A distribution table will be present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	688	The format of this distribution table is described in [FSE Table Description](#fse-table-description).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	689	Note that the maximum allowed accuracy log for literals length and match length tables is 9,
				690	and the maximum accuracy log for the offsets table is 8.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	691	`FSE_Compressed_Mode` must not be used when only one symbol is present,
				692	`RLE_Mode` should be used instead (although any other mode will work).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	693	- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
				694	or if this is the first block, table in the dictionary will be used.
				695	Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
				696	It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
				697	No distribution table will be present.
				698	If this mode is used without any previous sequence table in the frame
				699	(nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	700
				701	#### The codes for literals lengths, match lengths, and offsets.
				702
				703	Each symbol is a _code_ in its own context,
				704	which specifies `Baseline` and `Number_of_Bits` to add.
				705	_Codes_ are FSE compressed,
				706	and interleaved with raw additional bits in the same bitstream.
				707
				708	##### Literals length codes
				709
				710	Literals length codes are values ranging from `0` to `35` included.
				711	They define lengths from 0 to 131071 bytes.
				712	The literals length is equal to the decoded `Baseline` plus
				713	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	714	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	715
				716	\| `Literals_Length_Code` \| 0-15 \|
				717	\| ---------------------- \| ---------------------- \|
				718	\| length \| `Literals_Length_Code` \|
				719	\| `Number_of_Bits` \| 0 \|
				720
				721	\| `Literals_Length_Code` \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				722	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				723	\| `Baseline` \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				724	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				725
				726	\| `Literals_Length_Code` \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				727	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				728	\| `Baseline` \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				729	\| `Number_of_Bits` \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				730
				731	\| `Literals_Length_Code` \| 32 \| 33 \| 34 \| 35 \|
				732	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \|
				733	\| `Baseline` \| 8192 \|16384 \|32768 \|65536 \|
				734	\| `Number_of_Bits` \| 13 \| 14 \| 15 \| 16 \|
				735
				736
				737	##### Match length codes
				738
				739	Match length codes are values ranging from `0` to `52` included.
				740	They define lengths from 3 to 131074 bytes.
				741	The match length is equal to the decoded `Baseline` plus
				742	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	743	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	744
				745	\| `Match_Length_Code` \| 0-31 \|
				746	\| ------------------- \| ----------------------- \|
				747	\| value \| `Match_Length_Code` + 3 \|
				748	\| `Number_of_Bits` \| 0 \|
				749
				750	\| `Match_Length_Code` \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				751	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				752	\| `Baseline` \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				753	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				754
				755	\| `Match_Length_Code` \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				756	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				757	\| `Baseline` \| 67 \| 83 \| 99 \| 131 \| 259 \| 515 \| 1027 \| 2051 \|
				758	\| `Number_of_Bits` \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				759
				760	\| `Match_Length_Code` \| 48 \| 49 \| 50 \| 51 \| 52 \|
				761	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				762	\| `Baseline` \| 4099 \| 8195 \|16387 \|32771 \|65539 \|
				763	\| `Number_of_Bits` \| 12 \| 13 \| 14 \| 15 \| 16 \|
				764
				765	##### Offset codes
				766
				767	Offset codes are values ranging from `0` to `N`.
				768
				769	A decoder is free to limit its maximum `N` supported.
				770	Recommendation is to support at least up to `22`.
				771	For information, at the time of this writing.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	772	the reference decoder supports a maximum `N` value of `31`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	773
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	774	An offset code is also the number of additional bits to read in __little-endian__ fashion,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	775	and can be translated into an `Offset_Value` using the following formulas :
				776
				777	```
				778	Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
				779	if (Offset_Value > 3) offset = Offset_Value - 3;
				780	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	781	It means that maximum `Offset_Value` is `(2^(N+1))-1`
				782	supporting back-reference distances up to `(2^(N+1))-4`,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	783	but is limited by [maximum back-reference distance](#window_descriptor).
				784
				785	`Offset_Value` from 1 to 3 are special : they define "repeat codes".
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	786	This is described in more detail in [Repeat Offsets](#repeat-offsets).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	787
				788	#### Decoding Sequences
				789	FSE bitstreams are read in reverse direction than written. In zstd,
				790	the compressor writes bits forward into a block and the decompressor
				791	must read the bitstream _backwards_.
				792
				793	To find the start of the bitstream it is therefore necessary to
				794	know the offset of the last byte of the block which can be found
				795	by counting `Block_Size` bytes after the block header.
				796
				797	After writing the last bit containing information, the compressor
				798	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				799	padding. The last byte of the compressed bitstream cannot be `0` for
				800	that reason.
				801
				802	When decompressing, the last byte containing the padding is the first
				803	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				804	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				805	begins.
				806
				807	FSE decoding requires a 'state' to be carried from symbol to symbol.
				808	For more explanation on FSE decoding, see the [FSE section](#fse).
				809
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	810	For sequence decoding, a separate state keeps track of each
				811	literal lengths, offsets, and match lengths symbols.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	812	Some FSE primitives are also used.
				813	For more details on the operation of these primitives, see the [FSE section](#fse).
				814
				815	##### Starting states
				816	The bitstream starts with initial FSE state values,
				817	each using the required number of bits in their respective _accuracy_,
				818	decoded previously from their normalized distribution.
				819
				820	It starts by `Literals_Length_State`,
				821	followed by `Offset_State`,
				822	and finally `Match_Length_State`.
				823
				824	Reminder : always keep in mind that all values are read _backward_,
				825	so the 'start' of the bitstream is at the highest position in memory,
				826	immediately before the last `1`-bit for padding.
				827
				828	After decoding the starting states, a single sequence is decoded
				829	`Number_Of_Sequences` times.
				830	These sequences are decoded in order from first to last.
				831	Since the compressor writes the bitstream in the forward direction,
				832	this means the compressor must encode the sequences starting with the last
				833	one and ending with the first.
				834
				835	##### Decoding a sequence
				836	For each of the symbol types, the FSE state can be used to determine the appropriate code.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	837	The code then defines the `Baseline` and `Number_of_Bits` to read for each type.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	838	See the [description of the codes] for how to determine these values.
				839
				840	[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
				841
				842	Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	843	It then does the same for `Match_Length`, and then for `Literals_Length`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	844	This sequence is then used for [sequence execution](#sequence-execution).
				845
				846	If it is not the last sequence in the block,
				847	the next operation is to update states.
				848	Using the rules pre-calculated in the decoding tables,
				849	`Literals_Length_State` is updated,
				850	followed by `Match_Length_State`,
				851	and then `Offset_State`.
				852	See the [FSE section](#fse) for details on how to update states from the bitstream.
				853
				854	This operation will be repeated `Number_of_Sequences` times.
				855	At the end, the bitstream shall be entirely consumed,
				856	otherwise the bitstream is considered corrupted.
				857
				858	#### Default Distributions
				859	If `Predefined_Mode` is selected for a symbol type,
				860	its FSE decoding table is generated from a predefined distribution table defined here.
				861	For details on how to convert this distribution into a decoding table, see the [FSE section].
				862
				863	[FSE section]: #from-normalized-distribution-to-decoding-tables
				864
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	865	##### Literals Length
				866	The decoding table uses an accuracy log of 6 bits (64 states).
				867	```
				868	short literalsLength_defaultDistribution[36] =
				869	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				870	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				871	-1,-1,-1,-1 };
				872	```
				873
				874	##### Match Length
				875	The decoding table uses an accuracy log of 6 bits (64 states).
				876	```
				877	short matchLengths_defaultDistribution[53] =
				878	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				879	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				880	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				881	-1,-1,-1,-1,-1 };
				882	```
				883
				884	##### Offset Codes
				885	The decoding table uses an accuracy log of 5 bits (32 states),
				886	and supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
				887
				888	If any sequence in the compressed block requires a larger offset than this,
				889	it's not possible to use the default distribution to represent it.
				890	```
				891	short offsetCodes_defaultDistribution[29] =
				892	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				893	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				894	```
				895
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	896
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	897	Sequence Execution
				898	------------------
				899	Once literals and sequences have been decoded,
				900	they are combined to produce the decoded content of a block.
				901
				902	Each sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	903	decoded as described in the [Sequences Section](#sequences-section).
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	904	To execute a sequence, first copy `literals_length` bytes
				905	from the decoded literals to the output.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	906
				907	Then `match_length` bytes are copied from previous decoded data.
				908	The offset to copy from is determined by `offset_value`:
				909	if `offset_value > 3`, then the offset is `offset_value - 3`.
				910	If `offset_value` is from 1-3, the offset is a special repeat offset value.
				911	See the [repeat offset](#repeat-offsets) section for how the offset is determined
				912	in this case.
				913
				914	The offset is defined as from the current position, so an offset of 6
				915	and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	916	Note that all offsets leading to previously decoded data
				917	must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	918
				919	#### Repeat offsets
				920	As seen in [Sequence Execution](#sequence-execution),
				921	the first 3 values define a repeated offset and we will call them
				922	`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
				923	They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
				924
				925	If `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
				926
				927	There is an exception though, when current sequence's `literals_length = 0`.
				928	In this case, repeated offsets are shifted by one,
				929	so an `offset_value` of 1 means `Repeated_Offset2`,
				930	an `offset_value` of 2 means `Repeated_Offset3`,
				931	and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
				932
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	933	For the first block, the starting offset history is populated with following values :
				934	`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	935	unless a dictionary is used, in which case they come from the dictionary.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	936
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	937	Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
				938	Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	939
				940	[Offset Codes]: #offset-codes
				941
				942	###### Offset updates rules
				943
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	944	During the execution of the sequences of a `Compressed_Block`, the
				945	`Repeated_Offsets`' values are kept up to date, so that they always represent
				946	the three most-recently used offsets. In order to achieve that, they are
				947	updated after executing each sequence in the following way:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	948
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	949	When the sequence's `offset_value` does not refer to one of the
				950	`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
				951	and the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
				952	are shifted back one, and `Repeated_Offset1` takes on the value of the
				953	just-used offset.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	954
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	955	Otherwise, when the sequence's `offset_value` refers to one of the
				956	`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
				957	sequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
				958	so that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
				959	the existing values are pushed back from the first `Repeated_Offset` through to
				960	the `Repeated_Offset` selected by the `offset_value`. This effectively performs
				961	a single-stepped wrapping rotation of the values of these offsets, so that
				962	their order again reflects the recency of their use.
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	963
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	964	The following table shows the values of the `Repeated_Offsets` as a series of
				965	sequences are applied to them:
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	966
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	967	\| `offset_value` \| `literals_length` \| `Repeated_Offset1` \| `Repeated_Offset2` \| `Repeated_Offset3` \| Comment \|
				968	\|:--------------:\|:-----------------:\|:------------------:\|:------------------:\|:------------------:\|:-----------------------:\|
				969	\| \| \| 1 \| 4 \| 8 \| starting values \|
				970	\| 1114 \| 11 \| 1111 \| 1 \| 4 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	971	\| 1 \| 22 \| 1111 \| 1 \| 4 \| repeat 1: no change \|
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	972	\| 2225 \| 22 \| 2222 \| 1111 \| 1 \| non-repeat \|
				973	\| 1114 \| 111 \| 1111 \| 2222 \| 1111 \| non-repeat \|
				974	\| 3336 \| 33 \| 3333 \| 1111 \| 2222 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	975	\| 2 \| 22 \| 1111 \| 3333 \| 2222 \| repeat 2: swap 1 & 2 \|
				976	\| 3 \| 33 \| 2222 \| 1111 \| 3333 \| repeat 3: rotate 3 to 1 \|
				977	\| 3 \| 0 \| 2221 \| 2222 \| 1111 \| special case : insert `repeat1 - 1` \|
				978	\| 1 \| 0 \| 2222 \| 2221 \| 1111 \| == repeat 2 \|
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	979
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	980
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	981	Skippable Frames
				982	----------------
				983
				984	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				985	\|:--------------:\|:------------:\|:-----------:\|
				986	\| 4 bytes \| 4 bytes \| n bytes \|
				987
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	988	Skippable frames allow the insertion of user-defined metadata
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	989	into a flow of concatenated frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	990
				991	Skippable frames defined in this specification are compatible with [LZ4] ones.
				992
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	993	[LZ4]:https://lz4.github.io/lz4/
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	994
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	995	From a compliant decoder perspective, skippable frames need just be skipped,
				996	and their content ignored, resuming decoding after the skippable frame.
				997
				998	It can be noted that a skippable frame
				999	can be used to watermark a stream of concatenated frames
Dominique Pelle	b772f53	2022-03-12 08:52:40 +0100	[diff] [blame]	1000	embedding any kind of tracking information (even just a UUID).
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1001	Users wary of such possibility should scan the stream of concatenated frames
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1002	in an attempt to detect such frame for analysis or removal.
				1003
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1004	__`Magic_Number`__
				1005
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1006	4 Bytes, __little-endian__ format.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1007	Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
				1008	All 16 values are valid to identify a skippable frame.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1009	This specification doesn't detail any specific tagging for skippable frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1010
				1011	__`Frame_Size`__
				1012
				1013	This is the size, in bytes, of the following `User_Data`
				1014	(without including the magic number nor the size field itself).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1015	This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1016	This means `User_Data` can’t be bigger than (2^32-1) bytes.
				1017
				1018	__`User_Data`__
				1019
				1020	The `User_Data` can be anything. Data will just be skipped by the decoder.
				1021
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1022
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1023
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1024	Entropy Encoding
				1025	----------------
				1026	Two types of entropy encoding are used by the Zstandard format:
				1027	FSE, and Huffman coding.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1028	Huffman is used to compress literals,
				1029	while FSE is used for all other symbols
				1030	(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
				1031	and to compress Huffman headers.
				1032
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1033
				1034	FSE
				1035	---
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1036	FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1037	FSE encoding/decoding involves a state that is carried over between symbols,
				1038	so decoding must be done in the opposite direction as encoding.
				1039	Therefore, all FSE bitstreams are read from end to beginning.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1040	Note that the order of the bits in the stream is not reversed,
				1041	we just read the elements in the reverse order they are written.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1042
				1043	For additional details on FSE, see [Finite State Entropy].
				1044
				1045	[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
				1046
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1047	FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1048	`Symbol`, `Num_Bits`, and `Baseline`.
				1049	The `log2` of the table size is its `Accuracy_Log`.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1050	An FSE state value represents an index in this table.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1051
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1052	To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
				1053	The next symbol in the stream is the `Symbol` indicated in the table for that state.
				1054	To obtain the next state value,
				1055	the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1056
				1057	[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
				1058
				1059	### FSE Table Description
				1060	To decode FSE streams, it is necessary to construct the decoding table.
				1061	The Zstandard format encodes FSE table descriptions as follows:
				1062
				1063	An FSE distribution table describes the probabilities of all symbols
				1064	from `0` to the last present one (included)
				1065	on a normalized scale of `1 << Accuracy_Log` .
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1066	Note that there must be two or more symbols with nonzero probability.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1067
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1068	It's a bitstream which is read forward, in __little-endian__ fashion.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1069	It's not necessary to know bitstream exact size,
				1070	it will be discovered and reported by the decoding process.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1071
				1072	The bitstream starts by reporting on which scale it operates.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1073	Let's `low4Bits` designate the lowest 4 bits of the first byte :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1074	`Accuracy_Log = low4bits + 5`.
				1075
				1076	Then follows each symbol value, from `0` to last present one.
				1077	The number of bits used by each field is variable.
				1078	It depends on :
				1079
				1080	- Remaining probabilities + 1 :
				1081	__example__ :
				1082	Presuming an `Accuracy_Log` of 8,
				1083	and presuming 100 probabilities points have already been distributed,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1084	the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
				1085	Therefore, it must read `log2sup(157) == 8` bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1086
				1087	- Value decoded : small values use 1 less bit :
				1088	__example__ :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1089	Presuming values from 0 to 157 (inclusive) are possible,
				1090	255-157 = 98 values are remaining in an 8-bits field.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1091	They are used this way :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1092	first 98 values (hence from 0 to 97) use only 7 bits,
				1093	values from 98 to 157 use 8 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1094	This is achieved through this scheme :
				1095
				1096	\| Value read \| Value decoded \| Number of bits used \|
				1097	\| ---------- \| ------------- \| ------------------- \|
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1098	\| 0 - 97 \| 0 - 97 \| 7 \|
				1099	\| 98 - 127 \| 98 - 127 \| 8 \|
				1100	\| 128 - 225 \| 0 - 97 \| 7 \|
				1101	\| 226 - 255 \| 128 - 157 \| 8 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1102
				1103	Symbols probabilities are read one by one, in order.
				1104
				1105	Probability is obtained from Value decoded by following formula :
				1106	`Proba = value - 1`
				1107
				1108	It means value `0` becomes negative probability `-1`.
				1109	`-1` is a special probability, which means "less than 1".
				1110	Its effect on distribution table is described in the [next section].
				1111	For the purpose of calculating total allocated probability points, it counts as one.
				1112
				1113	[next section]:#from-normalized-distribution-to-decoding-tables
				1114
				1115	When a symbol has a __probability__ of `zero`,
				1116	it is followed by a 2-bits repeat flag.
				1117	This repeat flag tells how many probabilities of zeroes follow the current one.
				1118	It provides a number ranging from 0 to 3.
				1119	If it is a 3, another 2-bits repeat flag follows, and so on.
				1120
				1121	When last symbol reaches cumulated total of `1 << Accuracy_Log`,
				1122	decoding is complete.
				1123	If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
				1124	distribution is considered corrupted.
				1125
				1126	Then the decoder can tell how many bytes were used in this process,
				1127	and how many symbols are present.
				1128	The bitstream consumes a round number of bytes.
				1129	Any remaining bit within the last byte is just unused.
				1130
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1131	#### From normalized distribution to decoding tables
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1132
				1133	The distribution of normalized probabilities is enough
				1134	to create a unique decoding table.
				1135
				1136	It follows the following build rule :
				1137
				1138	The table has a size of `Table_Size = 1 << Accuracy_Log`.
				1139	Each cell describes the symbol decoded,
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1140	and instructions to get the next state (`Number_of_Bits` and `Baseline`).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1141
				1142	Symbols are scanned in their natural order for "less than 1" probabilities.
				1143	Symbols with this probability are being attributed a single cell,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1144	starting from the end of the table and retreating.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1145	These symbols define a full state reset, reading `Accuracy_Log` bits.
				1146
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1147	Then, all remaining symbols, sorted in natural order, are allocated cells.
				1148	Starting from symbol `0` (if it exists), and table position `0`,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1149	each symbol gets allocated as many cells as its probability.
Dimitris Apostolou	ebbd675	2021-11-13 10:04:04 +0200	[diff] [blame]	1150	Cell allocation is spread, not linear :
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1151	each successor position follows this rule :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1152
				1153	```
				1154	position += (tableSize>>1) + (tableSize>>3) + 3;
				1155	position &= tableSize-1;
				1156	```
				1157
				1158	A position is skipped if already occupied by a "less than 1" probability symbol.
				1159	`position` does not reset between symbols, it simply iterates through
				1160	each position in the table, switching to the next symbol when enough
				1161	states have been allocated to the current one.
				1162
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1163	The process guarantees that the table is entirely filled.
				1164	Each cell corresponds to a state value, which contains the symbol being decoded.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1165
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1166	To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
				1167	it's first necessary to sort all occurrences of each symbol in state order.
				1168	Lower states will need 1 more bit than higher ones.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1169	The process is repeated for each symbol.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1170
				1171	__Example__ :
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1172	Presuming a symbol has a probability of 5,
				1173	it receives 5 cells, corresponding to 5 state values.
				1174	These state values are then sorted in natural order.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1175
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1176	Next power of 2 after 5 is 8.
				1177	Space of probabilities must be divided into 8 equal parts.
				1178	Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1179	Divided by 8, each share is 16 large.
				1180
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1181	In order to reach 8 shares, 8-5=3 lowest states will count "double",
				1182	doubling their shares (32 in width), hence requiring one more bit.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1183
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1184	Baseline is assigned starting from the higher states using fewer bits,
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1185	increasing at each state, then resuming at the first state,
				1186	each state takes its allocated width from Baseline.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1187
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1188	\| state value \| 1 \| 39 \| 77 \| 84 \| 122 \|
				1189	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1190	\| ---------------- \| ----- \| ----- \| ------ \| ---- \| ------ \|
				1191	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1192	\| `Number_of_Bits` \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1193	\| range number \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1194	\| `Baseline` \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1195	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1196
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1197	During decoding, the next state value is determined from current state value,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1198	by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
				1199
				1200	See [Appendix A] for the results of this process applied to the default distributions.
				1201
				1202	[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
				1203
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1204
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1205	Huffman Coding
				1206	--------------
				1207	Zstandard Huffman-coded streams are read backwards,
				1208	similar to the FSE bitstreams.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1209	Therefore, to find the start of the bitstream, it is required to
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1210	know the offset of the last byte of the Huffman-coded stream.
				1211
				1212	After writing the last bit containing information, the compressor
				1213	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				1214	padding. The last byte of the compressed bitstream cannot be `0` for
				1215	that reason.
				1216
				1217	When decompressing, the last byte containing the padding is the first
				1218	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				1219	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				1220	begins.
				1221
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1222	The bitstream contains Huffman-coded symbols in __little-endian__ order,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1223	with the codes defined by the method below.
				1224
				1225	### Huffman Tree Description
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1226
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1227	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1228	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1229	in a manner such that different symbols may be represented
				1230	by bit sequences of different lengths,
				1231	but a parser can always parse an encoded string
				1232	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1233
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1234	Given an alphabet with known symbol frequencies,
				1235	the Huffman algorithm allows the construction of an optimal prefix code
				1236	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1237
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1238	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1239	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1240	and require more memory or more complex decoding operations.
				1241	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1242
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1243	#### Representation
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1244
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1245	All literal values from zero (included) to last present one (excluded)
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1246	are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
				1247	Transformation from `Weight` to `Number_of_Bits` follows this formula :
				1248	```
				1249	Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
				1250	```
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1251	When a literal value is not present, it receives a `Weight` of 0.
				1252	The least frequent symbol receives a `Weight` of 1.
				1253	Consequently, the `Weight` 1 is necessarily present.
				1254	The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
				1255	The last symbol's `Weight` is deduced from previously retrieved Weights,
				1256	by completing to the nearest power of 2. It's necessarily non 0.
				1257	If it's not possible to reach a clean power of 2 with a single `Weight` value,
				1258	the Huffman Tree Description is considered invalid.
				1259	This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1260	`Max_Number_of_Bits` must be <= 11,
				1261	otherwise the representation is considered corrupted.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1262
				1263	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1264	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1265
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1266	\| literal value \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1267	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1268	\| `Number_of_Bits` \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1269
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1270	The tree depth is 4, since its longest elements uses 4 bits
				1271	(longest elements are the one with smallest frequency).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1272	Literal value `5` will not be listed, as it can be determined from previous values 0-4,
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1273	nor will values above `5` as they are all 0.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1274	Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1275	Weight formula is :
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1276	```
				1277	Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
				1278	```
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1279	It gives the following series of weights :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1280
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1281	\| literal value \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1282	\| ------------- \| --- \| --- \| --- \| --- \| --- \|
				1283	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1284
				1285	The decoder will do the inverse operation :
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1286	having collected weights of literal symbols from `0` to `4`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1287	it knows the last literal, `5`, is present with a non-zero `Weight`.
				1288	The `Weight` of `5` can be determined by advancing to the next power of 2.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1289	The sum of `2^(Weight-1)` (excluding 0's) is :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1290	`8 + 4 + 2 + 0 + 1 = 15`.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1291	Nearest larger power of 2 value is 16.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1292	Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1293
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1294	#### Huffman Tree header
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1295
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1296	This is a single byte value (0-255),
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1297	which describes how the series of weights is encoded.
				1298
				1299	- if `headerByte` < 128 :
				1300	the series of weights is compressed using FSE (see below).
				1301	The length of the FSE-compressed series is equal to `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1302
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1303	- if `headerByte` >= 128 :
				1304	+ the series of weights uses a direct representation,
				1305	where each `Weight` is encoded directly as a 4 bits field (0-15).
				1306	+ They are encoded forward, 2 weights to a byte,
				1307	first weight taking the top four bits and second one taking the bottom four.
				1308	* e.g. the following operations could be used to read the weights:
				1309	`Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
				1310	+ The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
				1311	meaning it uses only full bytes even if `Number_of_Weights` is odd.
				1312	+ `Number_of_Weights = headerByte - 127`.
				1313	* Note that maximum `Number_of_Weights` is 255-127 = 128,
				1314	therefore, only up to 128 `Weight` can be encoded using direct representation.
				1315	* Since the last non-zero `Weight` is _not_ encoded,
				1316	this scheme is compatible with alphabet sizes of up to 129 symbols,
				1317	hence including literal symbol 128.
				1318	* If any literal symbol > 128 has a non-zero `Weight`,
				1319	direct representation is not possible.
				1320	In such case, it's necessary to use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1321
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1322
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1323	#### Finite State Entropy (FSE) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1324
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1325	In this case, the series of Huffman weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1326	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1327	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1328
				1329	To decode an FSE bitstream, it is necessary to know its compressed size.
				1330	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1331	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1332	which is `255`, since literal values span from `0` to `255`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1333	and last symbol's `Weight` is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1334
				1335	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1336	It will create a Decoding Table.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1337	For a list of Huffman weights, the maximum accuracy log is 6 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1338	For more description see the [FSE header description](#fse-table-description)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1339
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1340	The Huffman header compression uses 2 states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1341	which share the same FSE distribution table.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1342	The first state (`State1`) encodes the even indexed symbols,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1343	and the second (`State2`) encodes the odd indexed symbols.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1344	`State1` is initialized first, and then `State2`, and they take turns
				1345	decoding a single symbol and updating their state.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1346	For more details on these FSE operations, see the [FSE section](#fse).
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1347
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1348	The number of symbols to decode is determined
				1349	by tracking bitStream overflow condition:
				1350	If updating state after decoding a symbol would require more bits than
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1351	remain in the stream, it is assumed that extra bits are 0. Then,
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1352	symbols for each of the final states are decoded and the process is complete.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1353
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1354	#### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1355
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1356	All present symbols shall now have a `Weight` value.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	1357	It is possible to transform weights into `Number_of_Bits`, using this formula:
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1358	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1359	Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1360	```
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1361	Symbols are sorted by `Weight`.
				1362	Within same `Weight`, symbols keep natural sequential order.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1363	Symbols with a `Weight` of zero are removed.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1364	Then, starting from lowest `Weight`, prefix codes are distributed in sequential order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1365
				1366	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1367	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1368
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1369	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				1370	\| -------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1371	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1372
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1373	Sorted by weight and then natural sequential order,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1374	it gives the following distribution :
				1375
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1376	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				1377	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| ---- \|
				1378	\| `Weight` \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
				1379	\| `Number_of_Bits` \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
				1380	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1381
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1382	### Huffman-coded Streams
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1383
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1384	Given a Huffman decoding table,
				1385	it's possible to decode a Huffman-coded stream.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1386
				1387	Each bitstream must be read _backward_,
				1388	that is starting from the end down to the beginning.
				1389	Therefore it's necessary to know the size of each bitstream.
				1390
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1391	It's also necessary to know exactly which _bit_ is the last one.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1392	This is detected by a final bit flag :
				1393	the highest bit of latest byte is a final-bit-flag.
				1394	Consequently, a last byte of `0` is not possible.
				1395	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1396	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1397
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1398	Starting from the end,
				1399	it's possible to read the bitstream in a __little-endian__ fashion,
				1400	keeping track of already used bits. Since the bitstream is encoded in reverse
				1401	order, starting from the end read symbols in forward order.
				1402
				1403	For example, if the literal sequence "0145" was encoded using above prefix code,
				1404	it would be encoded (in reverse order) as:
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1405
				1406	\|Symbol \| 5 \| 4 \| 1 \| 0 \| Padding \|
				1407	\|--------\|------\|------\|----\|---\|---------\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1408	\|Encoding\|`0000`\|`0001`\|`01`\|`1`\| `00001` \|
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1409
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1410	Resulting in following 2-bytes bitstream :
				1411	```
				1412	00010000 00001101
				1413	```
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1414
Yann Collet	e8d35cc	2017-08-20 10:39:20 -0700	[diff] [blame]	1415	Here is an alternative representation with the symbol codes separated by underscore:
Yann Collet	d0d06e4	2017-08-19 12:26:09 -0700	[diff] [blame]	1416	```
				1417	0001_0000 00001_1_01
				1418	```
				1419
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1420	Reading highest `Max_Number_of_Bits` bits,
				1421	it's possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1422	determining the symbol to decode and number of bits to discard.
				1423
				1424	The process continues up to reading the required number of symbols per stream.
				1425	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1426	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1427	the decoding process is considered faulty.
				1428
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1429
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1430	Dictionary Format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1431	-----------------
				1432
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1433	Zstandard is compatible with "raw content" dictionaries,
				1434	free of any format restriction, except that they must be at least 8 bytes.
				1435	These dictionaries function as if they were just the `Content` part
				1436	of a formatted dictionary.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1437
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1438	But dictionaries created by `zstd --train` follow a format, described here.
				1439
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1440	__Pre-requisites__ : a dictionary has a size,
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1441	defined either by a buffer limit, or a file size.
				1442
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1443	\| `Magic_Number` \| `Dictionary_ID` \| `Entropy_Tables` \| `Content` \|
				1444	\| -------------- \| --------------- \| ---------------- \| --------- \|
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1445
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1446	__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1447
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1448	__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1449	`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1450	It's used by decoders to check if they use the correct dictionary.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1451
				1452	_Reserved ranges :_
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1453	If the dictionary is going to be distributed in a public environment,
				1454	the following ranges of `Dictionary_ID` are reserved for some future registrar
				1455	and shall not be used :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1456
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1457	- low range : <= 32767
				1458	- high range : >= (2^31)
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1459
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1460	Outside of these ranges, any value of `Dictionary_ID`
				1461	which is both `>= 32768` and `< (1<<31)` can be used freely,
				1462	even in public environment.
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1463
				1464
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1465	__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1466	See the relevant [FSE](#fse-table-description)
				1467	and [Huffman](#huffman-tree-description) sections for how to decode these tables.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1468	They are stored in following order :
				1469	Huffman tables for literals, FSE table for offsets,
				1470	FSE table for match lengths, and FSE table for literals lengths.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1471	These tables populate the Repeat Stats literals mode and
				1472	Repeat distribution mode for sequence decoding.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1473	It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1474	stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
senhuang42	8adeb9f	2020-09-22 13:24:27 -0400	[diff] [blame]	1475	Each recent offset must have a value <= dictionary content size, and cannot equal 0.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1476
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1477	__`Content`__ : The rest of the dictionary is its content.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1478	The content act as a "past" in front of data to compress or decompress,
				1479	so it can be referenced in sequence commands.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1480	As long as the amount of data decoded from this frame is less than or
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1481	equal to `Window_Size`, sequence commands may specify offsets longer
				1482	than the total length of decoded output so far to reference back to the
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1483	dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1484	After the total output has surpassed `Window_Size` however,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1485	this is no longer allowed and the dictionary is no longer accessible.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1486
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1487	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1488
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1489	If a dictionary is provided by an external source,
				1490	it should be loaded with great care, its content considered untrusted.
				1491
				1492
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1493
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1494	Appendix A - Decoding tables for predefined codes
				1495	-------------------------------------------------
				1496
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1497	This appendix contains FSE decoding tables
				1498	for the predefined literal length, match length, and offset codes.
				1499	The tables have been constructed using the algorithm as given above in chapter
				1500	"from normalized distribution to decoding tables".
				1501	The tables here can be used as examples
				1502	to crosscheck that an implementation build its decoding tables correctly.
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1503
				1504	#### Literal Length Code:
				1505
				1506	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1507	\| ----- \| ------ \| -------------- \| ---- \|
				1508	\| 0 \| 0 \| 4 \| 0 \|
				1509	\| 1 \| 0 \| 4 \| 16 \|
				1510	\| 2 \| 1 \| 5 \| 32 \|
				1511	\| 3 \| 3 \| 5 \| 0 \|
				1512	\| 4 \| 4 \| 5 \| 0 \|
				1513	\| 5 \| 6 \| 5 \| 0 \|
				1514	\| 6 \| 7 \| 5 \| 0 \|
				1515	\| 7 \| 9 \| 5 \| 0 \|
				1516	\| 8 \| 10 \| 5 \| 0 \|
				1517	\| 9 \| 12 \| 5 \| 0 \|
				1518	\| 10 \| 14 \| 6 \| 0 \|
				1519	\| 11 \| 16 \| 5 \| 0 \|
				1520	\| 12 \| 18 \| 5 \| 0 \|
				1521	\| 13 \| 19 \| 5 \| 0 \|
				1522	\| 14 \| 21 \| 5 \| 0 \|
				1523	\| 15 \| 22 \| 5 \| 0 \|
				1524	\| 16 \| 24 \| 5 \| 0 \|
				1525	\| 17 \| 25 \| 5 \| 32 \|
				1526	\| 18 \| 26 \| 5 \| 0 \|
				1527	\| 19 \| 27 \| 6 \| 0 \|
				1528	\| 20 \| 29 \| 6 \| 0 \|
				1529	\| 21 \| 31 \| 6 \| 0 \|
				1530	\| 22 \| 0 \| 4 \| 32 \|
				1531	\| 23 \| 1 \| 4 \| 0 \|
				1532	\| 24 \| 2 \| 5 \| 0 \|
				1533	\| 25 \| 4 \| 5 \| 32 \|
				1534	\| 26 \| 5 \| 5 \| 0 \|
				1535	\| 27 \| 7 \| 5 \| 32 \|
				1536	\| 28 \| 8 \| 5 \| 0 \|
				1537	\| 29 \| 10 \| 5 \| 32 \|
				1538	\| 30 \| 11 \| 5 \| 0 \|
				1539	\| 31 \| 13 \| 6 \| 0 \|
				1540	\| 32 \| 16 \| 5 \| 32 \|
				1541	\| 33 \| 17 \| 5 \| 0 \|
				1542	\| 34 \| 19 \| 5 \| 32 \|
				1543	\| 35 \| 20 \| 5 \| 0 \|
				1544	\| 36 \| 22 \| 5 \| 32 \|
				1545	\| 37 \| 23 \| 5 \| 0 \|
				1546	\| 38 \| 25 \| 4 \| 0 \|
				1547	\| 39 \| 25 \| 4 \| 16 \|
				1548	\| 40 \| 26 \| 5 \| 32 \|
				1549	\| 41 \| 28 \| 6 \| 0 \|
				1550	\| 42 \| 30 \| 6 \| 0 \|
				1551	\| 43 \| 0 \| 4 \| 48 \|
				1552	\| 44 \| 1 \| 4 \| 16 \|
				1553	\| 45 \| 2 \| 5 \| 32 \|
				1554	\| 46 \| 3 \| 5 \| 32 \|
				1555	\| 47 \| 5 \| 5 \| 32 \|
				1556	\| 48 \| 6 \| 5 \| 32 \|
				1557	\| 49 \| 8 \| 5 \| 32 \|
				1558	\| 50 \| 9 \| 5 \| 32 \|
				1559	\| 51 \| 11 \| 5 \| 32 \|
				1560	\| 52 \| 12 \| 5 \| 32 \|
				1561	\| 53 \| 15 \| 6 \| 0 \|
				1562	\| 54 \| 17 \| 5 \| 32 \|
				1563	\| 55 \| 18 \| 5 \| 32 \|
				1564	\| 56 \| 20 \| 5 \| 32 \|
				1565	\| 57 \| 21 \| 5 \| 32 \|
				1566	\| 58 \| 23 \| 5 \| 32 \|
				1567	\| 59 \| 24 \| 5 \| 32 \|
				1568	\| 60 \| 35 \| 6 \| 0 \|
				1569	\| 61 \| 34 \| 6 \| 0 \|
				1570	\| 62 \| 33 \| 6 \| 0 \|
				1571	\| 63 \| 32 \| 6 \| 0 \|
				1572
				1573	#### Match Length Code:
				1574
				1575	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1576	\| ----- \| ------ \| -------------- \| ---- \|
				1577	\| 0 \| 0 \| 6 \| 0 \|
				1578	\| 1 \| 1 \| 4 \| 0 \|
				1579	\| 2 \| 2 \| 5 \| 32 \|
				1580	\| 3 \| 3 \| 5 \| 0 \|
				1581	\| 4 \| 5 \| 5 \| 0 \|
				1582	\| 5 \| 6 \| 5 \| 0 \|
				1583	\| 6 \| 8 \| 5 \| 0 \|
				1584	\| 7 \| 10 \| 6 \| 0 \|
				1585	\| 8 \| 13 \| 6 \| 0 \|
				1586	\| 9 \| 16 \| 6 \| 0 \|
				1587	\| 10 \| 19 \| 6 \| 0 \|
				1588	\| 11 \| 22 \| 6 \| 0 \|
				1589	\| 12 \| 25 \| 6 \| 0 \|
				1590	\| 13 \| 28 \| 6 \| 0 \|
				1591	\| 14 \| 31 \| 6 \| 0 \|
				1592	\| 15 \| 33 \| 6 \| 0 \|
				1593	\| 16 \| 35 \| 6 \| 0 \|
				1594	\| 17 \| 37 \| 6 \| 0 \|
				1595	\| 18 \| 39 \| 6 \| 0 \|
				1596	\| 19 \| 41 \| 6 \| 0 \|
				1597	\| 20 \| 43 \| 6 \| 0 \|
				1598	\| 21 \| 45 \| 6 \| 0 \|
				1599	\| 22 \| 1 \| 4 \| 16 \|
				1600	\| 23 \| 2 \| 4 \| 0 \|
				1601	\| 24 \| 3 \| 5 \| 32 \|
				1602	\| 25 \| 4 \| 5 \| 0 \|
				1603	\| 26 \| 6 \| 5 \| 32 \|
				1604	\| 27 \| 7 \| 5 \| 0 \|
				1605	\| 28 \| 9 \| 6 \| 0 \|
				1606	\| 29 \| 12 \| 6 \| 0 \|
				1607	\| 30 \| 15 \| 6 \| 0 \|
				1608	\| 31 \| 18 \| 6 \| 0 \|
				1609	\| 32 \| 21 \| 6 \| 0 \|
				1610	\| 33 \| 24 \| 6 \| 0 \|
				1611	\| 34 \| 27 \| 6 \| 0 \|
				1612	\| 35 \| 30 \| 6 \| 0 \|
				1613	\| 36 \| 32 \| 6 \| 0 \|
				1614	\| 37 \| 34 \| 6 \| 0 \|
				1615	\| 38 \| 36 \| 6 \| 0 \|
				1616	\| 39 \| 38 \| 6 \| 0 \|
				1617	\| 40 \| 40 \| 6 \| 0 \|
				1618	\| 41 \| 42 \| 6 \| 0 \|
				1619	\| 42 \| 44 \| 6 \| 0 \|
				1620	\| 43 \| 1 \| 4 \| 32 \|
				1621	\| 44 \| 1 \| 4 \| 48 \|
				1622	\| 45 \| 2 \| 4 \| 16 \|
				1623	\| 46 \| 4 \| 5 \| 32 \|
				1624	\| 47 \| 5 \| 5 \| 32 \|
				1625	\| 48 \| 7 \| 5 \| 32 \|
				1626	\| 49 \| 8 \| 5 \| 32 \|
				1627	\| 50 \| 11 \| 6 \| 0 \|
				1628	\| 51 \| 14 \| 6 \| 0 \|
				1629	\| 52 \| 17 \| 6 \| 0 \|
				1630	\| 53 \| 20 \| 6 \| 0 \|
				1631	\| 54 \| 23 \| 6 \| 0 \|
				1632	\| 55 \| 26 \| 6 \| 0 \|
				1633	\| 56 \| 29 \| 6 \| 0 \|
				1634	\| 57 \| 52 \| 6 \| 0 \|
				1635	\| 58 \| 51 \| 6 \| 0 \|
				1636	\| 59 \| 50 \| 6 \| 0 \|
				1637	\| 60 \| 49 \| 6 \| 0 \|
				1638	\| 61 \| 48 \| 6 \| 0 \|
				1639	\| 62 \| 47 \| 6 \| 0 \|
				1640	\| 63 \| 46 \| 6 \| 0 \|
				1641
				1642	#### Offset Code:
				1643
				1644	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1645	\| ----- \| ------ \| -------------- \| ---- \|
				1646	\| 0 \| 0 \| 5 \| 0 \|
				1647	\| 1 \| 6 \| 4 \| 0 \|
				1648	\| 2 \| 9 \| 5 \| 0 \|
				1649	\| 3 \| 15 \| 5 \| 0 \|
				1650	\| 4 \| 21 \| 5 \| 0 \|
				1651	\| 5 \| 3 \| 5 \| 0 \|
				1652	\| 6 \| 7 \| 4 \| 0 \|
				1653	\| 7 \| 12 \| 5 \| 0 \|
				1654	\| 8 \| 18 \| 5 \| 0 \|
				1655	\| 9 \| 23 \| 5 \| 0 \|
				1656	\| 10 \| 5 \| 5 \| 0 \|
				1657	\| 11 \| 8 \| 4 \| 0 \|
				1658	\| 12 \| 14 \| 5 \| 0 \|
				1659	\| 13 \| 20 \| 5 \| 0 \|
				1660	\| 14 \| 2 \| 5 \| 0 \|
				1661	\| 15 \| 7 \| 4 \| 16 \|
				1662	\| 16 \| 11 \| 5 \| 0 \|
				1663	\| 17 \| 17 \| 5 \| 0 \|
				1664	\| 18 \| 22 \| 5 \| 0 \|
				1665	\| 19 \| 4 \| 5 \| 0 \|
				1666	\| 20 \| 8 \| 4 \| 16 \|
				1667	\| 21 \| 13 \| 5 \| 0 \|
				1668	\| 22 \| 19 \| 5 \| 0 \|
				1669	\| 23 \| 1 \| 5 \| 0 \|
				1670	\| 24 \| 6 \| 4 \| 16 \|
				1671	\| 25 \| 10 \| 5 \| 0 \|
				1672	\| 26 \| 16 \| 5 \| 0 \|
				1673	\| 27 \| 28 \| 5 \| 0 \|
				1674	\| 28 \| 27 \| 5 \| 0 \|
				1675	\| 29 \| 26 \| 5 \| 0 \|
				1676	\| 30 \| 25 \| 5 \| 0 \|
				1677	\| 31 \| 24 \| 5 \| 0 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1678
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1679
				1680
				1681	Appendix B - Resources for implementers
				1682	-------------------------------------------------
				1683
				1684	An open source reference implementation is available on :
				1685	https://github.com/facebook/zstd
				1686
				1687	The project contains a frame generator, called [decodeCorpus],
				1688	which can be used by any 3rd-party implementation
				1689	to verify that a tested decoder is compliant with the specification.
				1690
				1691	[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
				1692
				1693	`decodeCorpus` generates random valid frames.
				1694	A compliant decoder should be able to decode them all,
				1695	or at least provide a meaningful error code explaining for which reason it cannot
				1696	(memory limit restrictions for example).
				1697
				1698
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1699	Version changes
				1700	---------------
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	1701	- 0.3.9 : clarifications for Huffman-compressed literal sizes.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1702	- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
Yann Collet	0b0b62d	2021-05-15 23:04:46 -0700	[diff] [blame]	1703	- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1704	- 0.3.6 : clarifications for Dictionary_ID
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	1705	- 0.3.5 : clarifications for Block_Maximum_Size
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1706	- 0.3.4 : clarifications for FSE decoding table
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	1707	- 0.3.3 : clarifications for field Block_Size
W. Felix Handte	a2861d7	2019-07-17 17:55:15 -0400	[diff] [blame]	1708	- 0.3.2 : remove additional block size restriction on compressed blocks
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	1709	- 0.3.1 : minor clarification regarding offset history update rules
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1710	- 0.3.0 : minor edits to match RFC8478
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1711	- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1712	- 0.2.8 : clarifications for IETF RFC discuss
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1713	- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1714	- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1715	- 0.2.5 : minor typos and clarifications
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1716	- 0.2.4 : section restructuring, by Sean Purcell
Yann Collet	20bed42	2017-01-27 12:16:16 -0800	[diff] [blame]	1717	- 0.2.3 : clarified several details, by Sean Purcell
Yann Collet	55981a9	2016-09-15 02:13:18 +0200	[diff] [blame]	1718	- 0.2.2 : added predefined codes, by Johannes Rudolph
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1719	- 0.2.1 : clarify field names, by Przemyslaw Skibinski
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1720	- 0.2.0 : numerous format adjustments for zstd v0.8+
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1721	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1722	- 0.1.1 : reserved dictID ranges
				1723	- 0.1.0 : initial release