Blame - doc/zstd_compression_format.md - external_zstd

blob: cd7308de196e58099fff907b6a9c40689ff66136 [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
W. Felix Handte	5d693cc	2022-12-20 12:49:47 -0500	[diff] [blame]	6	Copyright (c) Meta Platforms, Inc. and affiliates.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame^]	19	0.4.0 (2023-06-05)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	20
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	29	using the [Zstandard algorithm](https://facebook.github.io/zstd/).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	30	The text of the specification assumes a basic background in programming
				31	at the level of bits and other primitive data representations.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	32
				33	The data can be produced or consumed,
				34	even for an arbitrarily long sequentially presented input data stream,
				35	using only an a priori bounded amount of intermediate storage,
				36	and hence can be used in data communications.
				37	The format uses the Zstandard compression method,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	38	and optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/),
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	39	for detection of data corruption.
				40
				41	The data format defined by this specification
				42	does not attempt to allow random access to compressed data.
				43
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	44	Unless otherwise indicated below,
				45	a compliant compressor must produce data sets
				46	that conform to the specifications presented here.
				47	It doesn’t need to support all options though.
				48
				49	A compliant decompressor must be able to decompress
				50	at least one working set of parameters
				51	that conforms to the specifications presented here.
				52	It may also ignore informative fields, such as checksum.
				53	Whenever it does not support a parameter defined in the compressed stream,
				54	it must produce a non-ambiguous error code and associated error message
				55	explaining which parameter is unsupported.
				56
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	57	This specification is intended for use by implementers of software
				58	to compress data into Zstandard format and/or decompress data from Zstandard format.
				59	The Zstandard format is supported by an open source reference implementation,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	60	written in portable C, and available at : https://github.com/facebook/zstd .
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	61
				62
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	63	### Overall conventions
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	64	In this document:
				65	- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	66	- the naming convention for identifiers is `Mixed_Case_With_Underscores`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	67
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	68	### Definitions
				69	Content compressed by Zstandard is transformed into a Zstandard __frame__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	70	Multiple frames can be appended into a single file or stream.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	71	A frame is completely independent, has a defined beginning and end,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	72	and a set of parameters which tells the decoder how to decompress it.
				73
				74	A frame encapsulates one or multiple __blocks__.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	75	Each block contains arbitrary content, which is described by its header,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	76	and has a guaranteed maximum content size, which depends on frame parameters.
				77	Unlike frames, each block depends on previous blocks for proper decoding.
				78	However, each block can be decompressed without waiting for its successor,
				79	allowing streaming operations.
				80
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	81	Overview
				82	---------
				83	- [Frames](#frames)
				84	- [Zstandard frames](#zstandard-frames)
				85	- [Blocks](#blocks)
				86	- [Literals Section](#literals-section)
				87	- [Sequences Section](#sequences-section)
				88	- [Sequence Execution](#sequence-execution)
				89	- [Skippable frames](#skippable-frames)
				90	- [Entropy Encoding](#entropy-encoding)
				91	- [FSE](#fse)
				92	- [Huffman Coding](#huffman-coding)
				93	- [Dictionary Format](#dictionary-format)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	94
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	95	Frames
				96	------
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	97	Zstandard compressed data is made of one or more __frames__.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	98	Each frame is independent and can be decompressed independently of other frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	99	The decompressed content of multiple concatenated frames is the concatenation of
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	100	each frame decompressed content.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	101
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	102	There are two frame formats defined by Zstandard:
				103	Zstandard frames and Skippable frames.
				104	Zstandard frames contain compressed data, while
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	105	skippable frames contain custom user metadata.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	107	## Zstandard frames
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	108	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	109
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	110	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				111	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	112	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	113
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	114	__`Magic_Number`__
				115
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	116	4 Bytes, __little-endian__ format.
Yann Collet	7bdfcea	2016-09-05 17:43:31 +0200	[diff] [blame]	117	Value : 0xFD2FB528
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	118	Note: This value was selected to be less probable to find at the beginning of some random file.
				119	It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
				120	contains byte values outside of ASCII range,
				121	and doesn't map into UTF8 space.
				122	It reduces the chances that a text file represent this value by accident.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	123
				124	__`Frame_Header`__
				125
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	126	2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	127
				128	__`Data_Block`__
				129
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	130	Detailed in [`Blocks`](#blocks).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	131	That’s where compressed data is stored.
				132
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	133	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	134
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	135	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	The content checksum is the result
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	137	of [xxh64() hash function](https://cyan4973.github.io/xxHash/)
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	138	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	139	The low 4 bytes of the checksum are stored in __little-endian__ format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	140
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	141	### `Frame_Header`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	142
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	143	The `Frame_Header` has a variable size, with a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	144	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	145	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	146
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	147	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				148	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				149	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	150
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	151	#### `Frame_Header_Descriptor`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	152
				153	The first header's byte is called the `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	154	It describes which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	155	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	156
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	157	\| Bit number \| Field name \|
				158	\| ---------- \| ---------- \|
				159	\| 7-6 \| `Frame_Content_Size_flag` \|
				160	\| 5 \| `Single_Segment_flag` \|
				161	\| 4 \| `Unused_bit` \|
				162	\| 3 \| `Reserved_bit` \|
				163	\| 2 \| `Content_Checksum_flag` \|
				164	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	165
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	166	In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	167
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	168	__`Frame_Content_Size_flag`__
				169
				170	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	171	specifying if `Frame_Content_Size` (the decompressed data size)
				172	is provided within the header.
				173	`Flag_Value` provides `FCS_Field_Size`,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	174	which is the number of bytes used by `Frame_Content_Size`
				175	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	176
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	177	\| `Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				178	\| -------------- \| ------ \| --- \| --- \| --- \|
				179	\|`FCS_Field_Size`\| 0 or 1 \| 2 \| 4 \| 8 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	180
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	181	When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	182	if `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
				183	Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	184
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	185	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	186
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	187	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	188	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	189
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	190	In this case, `Window_Descriptor` byte is skipped,
				191	but `Frame_Content_Size` is necessarily present.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	192	As a consequence, the decoder must allocate a memory segment
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	193	of size equal or larger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	194
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	195	In order to preserve the decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	196	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	197	which requests a memory size beyond decoder's authorized range.
				198
				199	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	200	memory sizes of at least 8 MB.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	201	This is only a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	202	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	203	depending on local limitations.
				204
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	205	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	206
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	207	A decoder compliant with this specification version shall not interpret this bit.
				208	It might be used in any future version,
				209	to signal a property which is transparent to properly decode the frame.
				210	An encoder compliant with this specification version must set this bit to zero.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	211
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	212	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	213
				214	This bit is reserved for some future feature.
				215	Its value _must be zero_.
				216	A decoder compliant with this specification version must ensure it is not set.
				217	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	218	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	219
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	220	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	221
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	222	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				223	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	224
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	225	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	226
				227	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	228	telling if a dictionary ID is provided within the header.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	229	It also specifies the size of this field as `DID_Field_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	230
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	231	\|`Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				232	\| -------------- \| --- \| --- \| --- \| --- \|
				233	\|`DID_Field_Size`\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	234
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	235	#### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	237	Provides guarantees on minimum memory buffer required to decompress a frame.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	238	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	239
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	240	The `Window_Descriptor` byte is optional.
				241	When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	242	In this case, `Window_Size` is `Frame_Content_Size`,
				243	which can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	244
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	245	\| Bit numbers \| 7-3 \| 2-0 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	246	\| ----------- \| ---------- \| ---------- \|
				247	\| Field name \| `Exponent` \| `Mantissa` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	248
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	249	The minimum memory buffer size is called `Window_Size`.
				250	It is described by the following formulas :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	251	```
				252	windowLog = 10 + Exponent;
				253	windowBase = 1 << windowLog;
				254	windowAdd = (windowBase / 8) * Mantissa;
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	255	Window_Size = windowBase + windowAdd;
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	256	```
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	257	The minimum `Window_Size` is 1 KB.
				258	The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	259
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	260	In general, larger `Window_Size` tend to improve compression ratio,
				261	but at the cost of memory usage.
				262
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	263	To properly decode compressed data,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	264	a decoder will need to allocate a buffer of at least `Window_Size` bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	265
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	266	In order to preserve decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	267	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	268	which requests a memory size beyond decoder's authorized range.
				269
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	270	For improved interoperability,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	271	it's recommended for decoders to support `Window_Size` of up to 8 MB,
				272	and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	273	It's merely a recommendation though,
				274	decoders are free to support larger or lower limits,
				275	depending on local limitations.
				276
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	277	#### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	278
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	279	This is a variable size field, which contains
				280	the ID of the dictionary required to properly decode the frame.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	281	`Dictionary_ID` field is optional. When it's not present,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	282	it's up to the decoder to know which dictionary to use.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	283
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	284	`Dictionary_ID` field size is provided by `DID_Field_Size`.
				285	`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	286	1 byte can represent an ID 0-255.
				287	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	288	4 bytes can represent an ID 0-4294967295.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	289	Format is __little-endian__.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	290
				291	It's allowed to represent a small ID (for example `13`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	292	with a large 4-bytes dictionary ID, even if it is less efficient.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	293
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	294	A value of `0` has same meaning as no `Dictionary_ID`,
				295	in which case the frame may or may not need a dictionary to be decoded,
				296	and the ID of such a dictionary is not specified.
				297	The decoder must know this information by other means.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	298
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	299	#### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	300
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	301	This is the original (uncompressed) size. This information is optional.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	302	`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
				303	`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
				304	`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	305
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	306	\| `FCS_Field_Size` \| Range \|
				307	\| ---------------- \| ---------- \|
				308	\| 0 \| unknown \|
				309	\| 1 \| 0 - 255 \|
				310	\| 2 \| 256 - 65791\|
				311	\| 4 \| 0 - 2^32-1 \|
				312	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	313
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	314	`Frame_Content_Size` format is __little-endian__.
				315	When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				316	When `FCS_Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	317	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	318
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	319
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	320	Blocks
				321	-------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	322
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	323	After `Magic_Number` and `Frame_Header`, there are some number of blocks.
				324	Each frame must have at least one block,
				325	but there is no upper limit on the number of blocks per frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	326
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	327	The structure of a block is as follows:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	328
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	329	\| `Block_Header` \| `Block_Content` \|
				330	\|:--------------:\|:---------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	331	\| 3 bytes \| n bytes \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	332
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	333	__`Block_Header`__
				334
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	335	`Block_Header` uses 3 bytes, written using __little-endian__ convention.
				336	It contains 3 fields :
				337
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	338	\| `Last_Block` \| `Block_Type` \| `Block_Size` \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	339	\|:------------:\|:------------:\|:------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	340	\| bit 0 \| bits 1-2 \| bits 3-23 \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	341
				342	__`Last_Block`__
				343
				344	The lowest bit signals if this block is the last one.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	345	The frame will end after this last block.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	346	It may be followed by an optional `Content_Checksum`
				347	(see [Zstandard Frames](#zstandard-frames)).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	348
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	349	__`Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	350
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	351	The next 2 bits represent the `Block_Type`.
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	352	`Block_Type` influences the meaning of `Block_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	353	There are 4 block types :
				354
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	355	\| Value \| 0 \| 1 \| 2 \| 3 \|
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	356	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	357	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	358
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	359	- `Raw_Block` - this is an uncompressed block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	360	`Block_Content` contains `Block_Size` bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	361
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	362	- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
				363	`Block_Content` consists of a single byte.
				364	On the decompression side, this byte must be repeated `Block_Size` times.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	365
				366	- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
				367	explained later on.
				368	`Block_Size` is the length of `Block_Content`, the compressed data.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	369	The decompressed size is not known,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	370	but its maximum possible value is guaranteed (see below)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	371
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	372	- `Reserved` - this is not a block.
				373	This value cannot be used with current version of this specification.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	374	If such a value is present, it is considered corrupted data.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	375
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	376	__`Block_Size`__
				377
				378	The upper 21 bits of `Block_Header` represent the `Block_Size`.
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	379
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	380	When `Block_Type` is `Compressed_Block` or `Raw_Block`,
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	381	`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	382
				383	When `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
				384	`Block_Size` represents the number of times this byte must be repeated.
				385
				386	`Block_Size` is limited by `Block_Maximum_Size` (see below).
				387
				388	__`Block_Content`__ and __`Block_Maximum_Size`__
				389
				390	The size of `Block_Content` is limited by `Block_Maximum_Size`,
				391	which is the smallest of:
				392	- `Window_Size`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	393	- 128 KB
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	394
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	395	`Block_Maximum_Size` is constant for a given frame.
				396	This maximum is applicable to both the decompressed size
				397	and the compressed size of any block in the frame.
				398
				399	The reasoning for this limit is that a decoder can read this information
				400	at the beginning of a frame and use it to allocate buffers.
				401	The guarantees on the size of blocks ensure that
				402	the buffers will be large enough for any following block of the valid frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	403
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	404
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	405	Compressed Blocks
				406	-----------------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	407	To decompress a compressed block, the compressed size must be provided
				408	from `Block_Size` field within `Block_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	409
				410	A compressed block consists of 2 sections :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	411	- [Literals Section](#literals-section)
				412	- [Sequences Section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	413
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	414	The results of the two sections are then combined to produce the decompressed
				415	data in [Sequence Execution](#sequence-execution)
				416
				417	#### Prerequisites
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	418	To decode a compressed block, the following elements are necessary :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	419	- Previous decoded data, up to a distance of `Window_Size`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	420	or beginning of the Frame, whichever is smaller.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	421	- List of "recent offsets" from previous `Compressed_Block`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	422	- The previous Huffman tree, required by `Treeless_Literals_Block` type
				423	- Previous FSE decoding tables, required by `Repeat_Mode`
				424	for each symbol type (literals lengths, match lengths, offsets)
				425
				426	Note that decoding tables aren't always from the previous `Compressed_Block`.
				427
				428	- Every decoding table can come from a dictionary.
				429	- The Huffman tree comes from the previous `Compressed_Literals_Block`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	430
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	431	Literals Section
				432	----------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	433	All literals are regrouped in the first part of the block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	434	They can be decoded first, and then copied during [Sequence Execution],
				435	or they can be decoded on the flow during [Sequence Execution].
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	437	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	438	When compressed, a tree description may optionally be present,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	439	followed by 1 or 4 streams.
				440
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	441	\| `Literals_Section_Header` \| [`Huffman_Tree_Description`] \| [jumpTable] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				442	\| ------------------------- \| ---------------------------- \| ----------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	443
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	444
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	445	### `Literals_Section_Header`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	447	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	449	using __little-endian__ convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	450
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	451	\| `Literals_Block_Type` \| `Size_Format` \| `Regenerated_Size` \| [`Compressed_Size`] \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	452	\| --------------------- \| ------------- \| ------------------ \| ------------------- \|
				453	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	454
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	455	In this representation, bits on the left are the lowest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	456
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	457	__`Literals_Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	458
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	459	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	460
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	461	\| `Literals_Block_Type` \| Value \|
				462	\| --------------------------- \| ----- \|
				463	\| `Raw_Literals_Block` \| 0 \|
				464	\| `RLE_Literals_Block` \| 1 \|
				465	\| `Compressed_Literals_Block` \| 2 \|
				466	\| `Treeless_Literals_Block` \| 3 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	467
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	468	- `Raw_Literals_Block` - Literals are stored uncompressed.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	469	- `RLE_Literals_Block` - Literals consist of a single byte value
				470	repeated `Regenerated_Size` times.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	471	- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	472	starting with a Huffman tree description.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	473	In this mode, there are at least 2 different literals represented in the Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	474	See details below.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	475	- `Treeless_Literals_Block` - This is a Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	476	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	477	`Huffman_Tree_Description` will be skipped.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	478	Note: If this mode is triggered without any previous Huffman-table in the frame
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	479	(or [dictionary](#dictionary-format)), this should be treated as data corruption.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	480
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	481	__`Size_Format`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	482
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	483	`Size_Format` is divided into 2 families :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	484
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	485	- For `Raw_Literals_Block` and `RLE_Literals_Block`,
				486	it's only necessary to decode `Regenerated_Size`.
				487	There is no `Compressed_Size` field.
				488	- For `Compressed_Block` and `Treeless_Literals_Block`,
				489	it's required to decode both `Compressed_Size`
				490	and `Regenerated_Size` (the decompressed size).
				491	It's also necessary to decode the number of streams (1 or 4).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	492
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	493	For values spanning several bytes, convention is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	494
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	495	__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	496
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	497	`Size_Format` uses 1 _or_ 2 bits.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	498	Its value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	499
				500	- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	501	`Regenerated_Size` uses 5 bits (0-31).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	502	`Literals_Section_Header` uses 1 byte.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	503	`Regenerated_Size = Literals_Section_Header[0]>>3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	504	- `Size_Format` == 01 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	505	`Regenerated_Size` uses 12 bits (0-4095).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	506	`Literals_Section_Header` uses 2 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	507	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	508	- `Size_Format` == 11 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	509	`Regenerated_Size` uses 20 bits (0-1048575).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	510	`Literals_Section_Header` uses 3 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	511	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	512
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	513	Only Stream1 is present for these cases.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	514	Note : it's allowed to represent a short value (for example `27`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	515	using a long format, even if it's less efficient.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	516
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	517	__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	518
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	519	`Size_Format` always uses 2 bits.
				520
				521	- `Size_Format` == 00 : _A single stream_.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	522	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	523	`Literals_Section_Header` uses 3 bytes.
				524	- `Size_Format` == 01 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	525	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	526	`Literals_Section_Header` uses 3 bytes.
				527	- `Size_Format` == 10 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	528	Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	529	`Literals_Section_Header` uses 4 bytes.
				530	- `Size_Format` == 11 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	531	Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	532	`Literals_Section_Header` uses 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	533
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	534	Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
				535	Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
				536	_when_ it is present.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	537	Note 2: `Compressed_Size` can never be `==0`.
				538	Even in single-stream scenario, assuming an empty content, it must be `>=1`,
				539	since it contains at least the final end bit flag.
				540	In 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10`
				541	(6 bytes for the jump table, + 4x1 bytes for the 4 streams).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	542
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	543	4 streams is faster than 1 stream in decompression speed,
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	544	by exploiting instruction level parallelism.
				545	But it's also more expensive,
				546	costing on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table.
				547
				548	In general, use the 4 streams mode when there are more literals to decode,
				549	to favor higher decompression speeds.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	550	Note that beyond >1KB of literals, the 4 streams mode is compulsory.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	551
				552	Note that a minimum of 6 bytes is required for the 4 streams mode.
				553	That's a technical minimum, but it's not recommended to employ the 4 streams mode
				554	for such a small quantity, that would be wasteful.
				555	A more practical lower bound would be around ~256 bytes.
				556
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	557	#### Raw Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	558	The data in Stream1 is `Regenerated_Size` bytes long,
				559	it contains the raw literals data to be used during [Sequence Execution].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	560
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	561	#### RLE Literals Block
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	562	Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
				563	to generate the decoded literals.
				564
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	565	#### Compressed Literals Block and Treeless Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	566	Both of these modes contain Huffman encoded data.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	567
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	568	For `Treeless_Literals_Block`,
				569	the Huffman table comes from previously compressed literals block,
				570	or from a dictionary.
				571
				572
				573	### `Huffman_Tree_Description`
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	574	This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	575	The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	576	The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	577	The size of `Huffman_Tree_Description` is determined during decoding process,
				578	it must be used to determine where streams begin.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	579	`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	580
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	581
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	582	### Jump Table
				583	The Jump Table is only present when there are 4 Huffman-coded streams.
				584
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	585	Reminder : Huffman compressed data consists of either 1 or 4 streams.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	586
				587	If only one stream is present, it is a single bitstream occupying the entire
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	588	remaining portion of the literals block, encoded as described in
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	589	[Huffman-Coded Streams](#huffman-coded-streams).
				590
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	591	If there are four streams, `Literals_Section_Header` only provided
				592	enough information to know the decompressed and compressed sizes
				593	of all four streams _combined_.
				594	The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	595	except for the last stream which may be up to 3 bytes smaller,
				596	to reach a total decompressed size as specified in `Regenerated_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	597
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	598	The compressed size of each stream is provided explicitly in the Jump Table.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	599	Jump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	600	describing the compressed sizes of the first three streams.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	601	`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	602
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	603	`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	604
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	605	`Stream4_Size` is necessarily `>= 1`. Therefore,
				606	if `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	607	data is considered corrupted.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	608
				609	Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	610	as described in [Huffman-Coded Streams](#huffman-coded-streams)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	611
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	612
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	613	Sequences Section
				614	-----------------
				615	A compressed block is a succession of _sequences_ .
				616	A sequence is a literal copy command, followed by a match copy command.
				617	A literal copy command specifies a length.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	618	It is the number of bytes to be copied (or extracted) from the Literals Section.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	619	A match copy command specifies an offset and a length.
				620
				621	When all _sequences_ are decoded,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	622	if there are literals left in the _literals section_,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	623	these bytes are added at the end of the block.
				624
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	625	This is described in more detail in [Sequence Execution](#sequence-execution).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	626
				627	The `Sequences_Section` regroup all symbols required to decode commands.
				628	There are 3 symbol types : literals lengths, offsets and match lengths.
				629	They are encoded together, interleaved, in a single _bitstream_.
				630
				631	The `Sequences_Section` starts by a header,
				632	followed by optional probability tables for each symbol type,
				633	followed by the bitstream.
				634
				635	\| `Sequences_Section_Header` \| [`Literals_Length_Table`] \| [`Offset_Table`] \| [`Match_Length_Table`] \| bitStream \|
				636	\| -------------------------- \| ------------------------- \| ---------------- \| ---------------------- \| --------- \|
				637
				638	To decode the `Sequences_Section`, it's required to know its size.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	639	Its size is deduced from the size of `Literals_Section`:
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	640	`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	641
				642
				643	#### `Sequences_Section_Header`
				644
				645	Consists of 2 items:
				646	- `Number_of_Sequences`
				647	- Symbol compression modes
				648
				649	__`Number_of_Sequences`__
				650
				651	This is a variable size field using between 1 and 3 bytes.
				652	Let's call its first byte `byte0`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	653	- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
Yann Collet	1f83b7c	2023-06-05 09:51:52 -0700	[diff] [blame]	654	- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes.
				655	Note that the 2 bytes format fully overlaps the 1 byte format.
				656	- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	657
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame^]	658	`if (Number_of_Sequences == 0)` : there are no sequences.
				659	The sequence section stops immediately,
				660	FSE tables used in `Repeat_Mode` aren't updated.
				661	Block's decompressed content is defined solely by the Literals Section content.
				662
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	663	__Symbol compression modes__
				664
				665	This is a single byte, defining the compression mode of each symbol type.
				666
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	667	\|Bit number\| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	668	\| -------- \| ----------------------- \| -------------- \| -------------------- \| ---------- \|
				669	\|Field name\| `Literals_Lengths_Mode` \| `Offsets_Mode` \| `Match_Lengths_Mode` \| `Reserved` \|
				670
				671	The last field, `Reserved`, must be all-zeroes.
				672
				673	`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	674	literals lengths, offsets, and match lengths symbols respectively.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	675
				676	They follow the same enumeration :
				677
				678	\| Value \| 0 \| 1 \| 2 \| 3 \|
				679	\| ------------------ \| ----------------- \| ---------- \| --------------------- \| ------------- \|
				680	\| `Compression_Mode` \| `Predefined_Mode` \| `RLE_Mode` \| `FSE_Compressed_Mode` \| `Repeat_Mode` \|
				681
				682	- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
				683	[default distributions](#default-distributions).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	684	No distribution table will be present.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	685	- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	686	This symbol will be used for all sequences.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	687	- `FSE_Compressed_Mode` : standard FSE compression.
				688	A distribution table will be present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	689	The format of this distribution table is described in [FSE Table Description](#fse-table-description).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	690	Note that the maximum allowed accuracy log for literals length and match length tables is 9,
				691	and the maximum accuracy log for the offsets table is 8.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	692	`FSE_Compressed_Mode` must not be used when only one symbol is present,
				693	`RLE_Mode` should be used instead (although any other mode will work).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	694	- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
				695	or if this is the first block, table in the dictionary will be used.
				696	Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
				697	It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
				698	No distribution table will be present.
				699	If this mode is used without any previous sequence table in the frame
				700	(nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	701
				702	#### The codes for literals lengths, match lengths, and offsets.
				703
				704	Each symbol is a _code_ in its own context,
				705	which specifies `Baseline` and `Number_of_Bits` to add.
				706	_Codes_ are FSE compressed,
				707	and interleaved with raw additional bits in the same bitstream.
				708
				709	##### Literals length codes
				710
				711	Literals length codes are values ranging from `0` to `35` included.
				712	They define lengths from 0 to 131071 bytes.
				713	The literals length is equal to the decoded `Baseline` plus
				714	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	715	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	716
				717	\| `Literals_Length_Code` \| 0-15 \|
				718	\| ---------------------- \| ---------------------- \|
				719	\| length \| `Literals_Length_Code` \|
				720	\| `Number_of_Bits` \| 0 \|
				721
				722	\| `Literals_Length_Code` \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				723	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				724	\| `Baseline` \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				725	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				726
				727	\| `Literals_Length_Code` \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				728	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				729	\| `Baseline` \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				730	\| `Number_of_Bits` \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				731
				732	\| `Literals_Length_Code` \| 32 \| 33 \| 34 \| 35 \|
				733	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \|
				734	\| `Baseline` \| 8192 \|16384 \|32768 \|65536 \|
				735	\| `Number_of_Bits` \| 13 \| 14 \| 15 \| 16 \|
				736
				737
				738	##### Match length codes
				739
				740	Match length codes are values ranging from `0` to `52` included.
				741	They define lengths from 3 to 131074 bytes.
				742	The match length is equal to the decoded `Baseline` plus
				743	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	744	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	745
				746	\| `Match_Length_Code` \| 0-31 \|
				747	\| ------------------- \| ----------------------- \|
				748	\| value \| `Match_Length_Code` + 3 \|
				749	\| `Number_of_Bits` \| 0 \|
				750
				751	\| `Match_Length_Code` \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				752	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				753	\| `Baseline` \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				754	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				755
				756	\| `Match_Length_Code` \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				757	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				758	\| `Baseline` \| 67 \| 83 \| 99 \| 131 \| 259 \| 515 \| 1027 \| 2051 \|
				759	\| `Number_of_Bits` \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				760
				761	\| `Match_Length_Code` \| 48 \| 49 \| 50 \| 51 \| 52 \|
				762	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				763	\| `Baseline` \| 4099 \| 8195 \|16387 \|32771 \|65539 \|
				764	\| `Number_of_Bits` \| 12 \| 13 \| 14 \| 15 \| 16 \|
				765
				766	##### Offset codes
				767
				768	Offset codes are values ranging from `0` to `N`.
				769
				770	A decoder is free to limit its maximum `N` supported.
				771	Recommendation is to support at least up to `22`.
				772	For information, at the time of this writing.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	773	the reference decoder supports a maximum `N` value of `31`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	774
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	775	An offset code is also the number of additional bits to read in __little-endian__ fashion,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	776	and can be translated into an `Offset_Value` using the following formulas :
				777
				778	```
				779	Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
				780	if (Offset_Value > 3) offset = Offset_Value - 3;
				781	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	782	It means that maximum `Offset_Value` is `(2^(N+1))-1`
				783	supporting back-reference distances up to `(2^(N+1))-4`,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	784	but is limited by [maximum back-reference distance](#window_descriptor).
				785
				786	`Offset_Value` from 1 to 3 are special : they define "repeat codes".
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	787	This is described in more detail in [Repeat Offsets](#repeat-offsets).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	788
				789	#### Decoding Sequences
				790	FSE bitstreams are read in reverse direction than written. In zstd,
				791	the compressor writes bits forward into a block and the decompressor
				792	must read the bitstream _backwards_.
				793
				794	To find the start of the bitstream it is therefore necessary to
				795	know the offset of the last byte of the block which can be found
				796	by counting `Block_Size` bytes after the block header.
				797
				798	After writing the last bit containing information, the compressor
				799	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				800	padding. The last byte of the compressed bitstream cannot be `0` for
				801	that reason.
				802
				803	When decompressing, the last byte containing the padding is the first
				804	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				805	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				806	begins.
				807
				808	FSE decoding requires a 'state' to be carried from symbol to symbol.
				809	For more explanation on FSE decoding, see the [FSE section](#fse).
				810
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	811	For sequence decoding, a separate state keeps track of each
				812	literal lengths, offsets, and match lengths symbols.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	813	Some FSE primitives are also used.
				814	For more details on the operation of these primitives, see the [FSE section](#fse).
				815
				816	##### Starting states
				817	The bitstream starts with initial FSE state values,
				818	each using the required number of bits in their respective _accuracy_,
				819	decoded previously from their normalized distribution.
				820
				821	It starts by `Literals_Length_State`,
				822	followed by `Offset_State`,
				823	and finally `Match_Length_State`.
				824
				825	Reminder : always keep in mind that all values are read _backward_,
				826	so the 'start' of the bitstream is at the highest position in memory,
				827	immediately before the last `1`-bit for padding.
				828
				829	After decoding the starting states, a single sequence is decoded
				830	`Number_Of_Sequences` times.
				831	These sequences are decoded in order from first to last.
				832	Since the compressor writes the bitstream in the forward direction,
				833	this means the compressor must encode the sequences starting with the last
				834	one and ending with the first.
				835
				836	##### Decoding a sequence
				837	For each of the symbol types, the FSE state can be used to determine the appropriate code.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	838	The code then defines the `Baseline` and `Number_of_Bits` to read for each type.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	839	See the [description of the codes] for how to determine these values.
				840
				841	[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
				842
				843	Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	844	It then does the same for `Match_Length`, and then for `Literals_Length`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	845	This sequence is then used for [sequence execution](#sequence-execution).
				846
				847	If it is not the last sequence in the block,
				848	the next operation is to update states.
				849	Using the rules pre-calculated in the decoding tables,
				850	`Literals_Length_State` is updated,
				851	followed by `Match_Length_State`,
				852	and then `Offset_State`.
				853	See the [FSE section](#fse) for details on how to update states from the bitstream.
				854
				855	This operation will be repeated `Number_of_Sequences` times.
				856	At the end, the bitstream shall be entirely consumed,
				857	otherwise the bitstream is considered corrupted.
				858
				859	#### Default Distributions
				860	If `Predefined_Mode` is selected for a symbol type,
				861	its FSE decoding table is generated from a predefined distribution table defined here.
				862	For details on how to convert this distribution into a decoding table, see the [FSE section].
				863
				864	[FSE section]: #from-normalized-distribution-to-decoding-tables
				865
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	866	##### Literals Length
				867	The decoding table uses an accuracy log of 6 bits (64 states).
				868	```
				869	short literalsLength_defaultDistribution[36] =
				870	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				871	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				872	-1,-1,-1,-1 };
				873	```
				874
				875	##### Match Length
				876	The decoding table uses an accuracy log of 6 bits (64 states).
				877	```
				878	short matchLengths_defaultDistribution[53] =
				879	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				880	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				881	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				882	-1,-1,-1,-1,-1 };
				883	```
				884
				885	##### Offset Codes
				886	The decoding table uses an accuracy log of 5 bits (32 states),
				887	and supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
				888
				889	If any sequence in the compressed block requires a larger offset than this,
				890	it's not possible to use the default distribution to represent it.
				891	```
				892	short offsetCodes_defaultDistribution[29] =
				893	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				894	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				895	```
				896
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	897
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	898	Sequence Execution
				899	------------------
				900	Once literals and sequences have been decoded,
				901	they are combined to produce the decoded content of a block.
				902
				903	Each sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	904	decoded as described in the [Sequences Section](#sequences-section).
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	905	To execute a sequence, first copy `literals_length` bytes
				906	from the decoded literals to the output.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	907
				908	Then `match_length` bytes are copied from previous decoded data.
				909	The offset to copy from is determined by `offset_value`:
				910	if `offset_value > 3`, then the offset is `offset_value - 3`.
				911	If `offset_value` is from 1-3, the offset is a special repeat offset value.
				912	See the [repeat offset](#repeat-offsets) section for how the offset is determined
				913	in this case.
				914
				915	The offset is defined as from the current position, so an offset of 6
				916	and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	917	Note that all offsets leading to previously decoded data
				918	must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	919
				920	#### Repeat offsets
				921	As seen in [Sequence Execution](#sequence-execution),
				922	the first 3 values define a repeated offset and we will call them
				923	`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
				924	They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
				925
				926	If `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
				927
				928	There is an exception though, when current sequence's `literals_length = 0`.
				929	In this case, repeated offsets are shifted by one,
				930	so an `offset_value` of 1 means `Repeated_Offset2`,
				931	an `offset_value` of 2 means `Repeated_Offset3`,
				932	and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
				933
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	934	For the first block, the starting offset history is populated with following values :
				935	`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	936	unless a dictionary is used, in which case they come from the dictionary.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	937
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	938	Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
				939	Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	940
				941	[Offset Codes]: #offset-codes
				942
				943	###### Offset updates rules
				944
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	945	During the execution of the sequences of a `Compressed_Block`, the
				946	`Repeated_Offsets`' values are kept up to date, so that they always represent
				947	the three most-recently used offsets. In order to achieve that, they are
				948	updated after executing each sequence in the following way:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	949
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	950	When the sequence's `offset_value` does not refer to one of the
				951	`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
				952	and the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
				953	are shifted back one, and `Repeated_Offset1` takes on the value of the
				954	just-used offset.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	955
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	956	Otherwise, when the sequence's `offset_value` refers to one of the
				957	`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
				958	sequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
				959	so that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
				960	the existing values are pushed back from the first `Repeated_Offset` through to
				961	the `Repeated_Offset` selected by the `offset_value`. This effectively performs
				962	a single-stepped wrapping rotation of the values of these offsets, so that
				963	their order again reflects the recency of their use.
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	964
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	965	The following table shows the values of the `Repeated_Offsets` as a series of
				966	sequences are applied to them:
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	967
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	968	\| `offset_value` \| `literals_length` \| `Repeated_Offset1` \| `Repeated_Offset2` \| `Repeated_Offset3` \| Comment \|
				969	\|:--------------:\|:-----------------:\|:------------------:\|:------------------:\|:------------------:\|:-----------------------:\|
				970	\| \| \| 1 \| 4 \| 8 \| starting values \|
				971	\| 1114 \| 11 \| 1111 \| 1 \| 4 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	972	\| 1 \| 22 \| 1111 \| 1 \| 4 \| repeat 1: no change \|
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	973	\| 2225 \| 22 \| 2222 \| 1111 \| 1 \| non-repeat \|
				974	\| 1114 \| 111 \| 1111 \| 2222 \| 1111 \| non-repeat \|
				975	\| 3336 \| 33 \| 3333 \| 1111 \| 2222 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	976	\| 2 \| 22 \| 1111 \| 3333 \| 2222 \| repeat 2: swap 1 & 2 \|
				977	\| 3 \| 33 \| 2222 \| 1111 \| 3333 \| repeat 3: rotate 3 to 1 \|
				978	\| 3 \| 0 \| 2221 \| 2222 \| 1111 \| special case : insert `repeat1 - 1` \|
				979	\| 1 \| 0 \| 2222 \| 2221 \| 1111 \| == repeat 2 \|
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	980
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	981
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	982	Skippable Frames
				983	----------------
				984
				985	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				986	\|:--------------:\|:------------:\|:-----------:\|
				987	\| 4 bytes \| 4 bytes \| n bytes \|
				988
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	989	Skippable frames allow the insertion of user-defined metadata
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	990	into a flow of concatenated frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	991
				992	Skippable frames defined in this specification are compatible with [LZ4] ones.
				993
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	994	[LZ4]:https://lz4.github.io/lz4/
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	995
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	996	From a compliant decoder perspective, skippable frames need just be skipped,
				997	and their content ignored, resuming decoding after the skippable frame.
				998
				999	It can be noted that a skippable frame
				1000	can be used to watermark a stream of concatenated frames
Dominique Pelle	b772f53	2022-03-12 08:52:40 +0100	[diff] [blame]	1001	embedding any kind of tracking information (even just a UUID).
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1002	Users wary of such possibility should scan the stream of concatenated frames
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1003	in an attempt to detect such frame for analysis or removal.
				1004
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1005	__`Magic_Number`__
				1006
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1007	4 Bytes, __little-endian__ format.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1008	Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
				1009	All 16 values are valid to identify a skippable frame.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1010	This specification doesn't detail any specific tagging for skippable frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1011
				1012	__`Frame_Size`__
				1013
				1014	This is the size, in bytes, of the following `User_Data`
				1015	(without including the magic number nor the size field itself).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1016	This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1017	This means `User_Data` can’t be bigger than (2^32-1) bytes.
				1018
				1019	__`User_Data`__
				1020
				1021	The `User_Data` can be anything. Data will just be skipped by the decoder.
				1022
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1023
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1024
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1025	Entropy Encoding
				1026	----------------
				1027	Two types of entropy encoding are used by the Zstandard format:
				1028	FSE, and Huffman coding.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1029	Huffman is used to compress literals,
				1030	while FSE is used for all other symbols
				1031	(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
				1032	and to compress Huffman headers.
				1033
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1034
				1035	FSE
				1036	---
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1037	FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1038	FSE encoding/decoding involves a state that is carried over between symbols,
				1039	so decoding must be done in the opposite direction as encoding.
				1040	Therefore, all FSE bitstreams are read from end to beginning.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1041	Note that the order of the bits in the stream is not reversed,
				1042	we just read the elements in the reverse order they are written.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1043
				1044	For additional details on FSE, see [Finite State Entropy].
				1045
				1046	[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
				1047
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1048	FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1049	`Symbol`, `Num_Bits`, and `Baseline`.
				1050	The `log2` of the table size is its `Accuracy_Log`.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1051	An FSE state value represents an index in this table.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1052
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1053	To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
				1054	The next symbol in the stream is the `Symbol` indicated in the table for that state.
				1055	To obtain the next state value,
				1056	the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1057
				1058	[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
				1059
				1060	### FSE Table Description
				1061	To decode FSE streams, it is necessary to construct the decoding table.
				1062	The Zstandard format encodes FSE table descriptions as follows:
				1063
				1064	An FSE distribution table describes the probabilities of all symbols
				1065	from `0` to the last present one (included)
				1066	on a normalized scale of `1 << Accuracy_Log` .
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1067	Note that there must be two or more symbols with nonzero probability.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1068
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1069	It's a bitstream which is read forward, in __little-endian__ fashion.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1070	It's not necessary to know bitstream exact size,
				1071	it will be discovered and reported by the decoding process.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1072
				1073	The bitstream starts by reporting on which scale it operates.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1074	Let's `low4Bits` designate the lowest 4 bits of the first byte :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1075	`Accuracy_Log = low4bits + 5`.
				1076
				1077	Then follows each symbol value, from `0` to last present one.
				1078	The number of bits used by each field is variable.
				1079	It depends on :
				1080
				1081	- Remaining probabilities + 1 :
				1082	__example__ :
				1083	Presuming an `Accuracy_Log` of 8,
				1084	and presuming 100 probabilities points have already been distributed,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1085	the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
				1086	Therefore, it must read `log2sup(157) == 8` bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1087
				1088	- Value decoded : small values use 1 less bit :
				1089	__example__ :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1090	Presuming values from 0 to 157 (inclusive) are possible,
				1091	255-157 = 98 values are remaining in an 8-bits field.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1092	They are used this way :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1093	first 98 values (hence from 0 to 97) use only 7 bits,
				1094	values from 98 to 157 use 8 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1095	This is achieved through this scheme :
				1096
				1097	\| Value read \| Value decoded \| Number of bits used \|
				1098	\| ---------- \| ------------- \| ------------------- \|
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1099	\| 0 - 97 \| 0 - 97 \| 7 \|
				1100	\| 98 - 127 \| 98 - 127 \| 8 \|
				1101	\| 128 - 225 \| 0 - 97 \| 7 \|
				1102	\| 226 - 255 \| 128 - 157 \| 8 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1103
				1104	Symbols probabilities are read one by one, in order.
				1105
				1106	Probability is obtained from Value decoded by following formula :
				1107	`Proba = value - 1`
				1108
				1109	It means value `0` becomes negative probability `-1`.
				1110	`-1` is a special probability, which means "less than 1".
				1111	Its effect on distribution table is described in the [next section].
				1112	For the purpose of calculating total allocated probability points, it counts as one.
				1113
				1114	[next section]:#from-normalized-distribution-to-decoding-tables
				1115
				1116	When a symbol has a __probability__ of `zero`,
				1117	it is followed by a 2-bits repeat flag.
				1118	This repeat flag tells how many probabilities of zeroes follow the current one.
				1119	It provides a number ranging from 0 to 3.
				1120	If it is a 3, another 2-bits repeat flag follows, and so on.
				1121
				1122	When last symbol reaches cumulated total of `1 << Accuracy_Log`,
				1123	decoding is complete.
				1124	If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
				1125	distribution is considered corrupted.
				1126
				1127	Then the decoder can tell how many bytes were used in this process,
				1128	and how many symbols are present.
				1129	The bitstream consumes a round number of bytes.
				1130	Any remaining bit within the last byte is just unused.
				1131
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1132	#### From normalized distribution to decoding tables
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1133
				1134	The distribution of normalized probabilities is enough
				1135	to create a unique decoding table.
				1136
				1137	It follows the following build rule :
				1138
				1139	The table has a size of `Table_Size = 1 << Accuracy_Log`.
				1140	Each cell describes the symbol decoded,
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1141	and instructions to get the next state (`Number_of_Bits` and `Baseline`).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1142
				1143	Symbols are scanned in their natural order for "less than 1" probabilities.
				1144	Symbols with this probability are being attributed a single cell,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1145	starting from the end of the table and retreating.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1146	These symbols define a full state reset, reading `Accuracy_Log` bits.
				1147
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1148	Then, all remaining symbols, sorted in natural order, are allocated cells.
				1149	Starting from symbol `0` (if it exists), and table position `0`,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1150	each symbol gets allocated as many cells as its probability.
Dimitris Apostolou	ebbd675	2021-11-13 10:04:04 +0200	[diff] [blame]	1151	Cell allocation is spread, not linear :
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1152	each successor position follows this rule :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1153
				1154	```
				1155	position += (tableSize>>1) + (tableSize>>3) + 3;
				1156	position &= tableSize-1;
				1157	```
				1158
				1159	A position is skipped if already occupied by a "less than 1" probability symbol.
				1160	`position` does not reset between symbols, it simply iterates through
				1161	each position in the table, switching to the next symbol when enough
				1162	states have been allocated to the current one.
				1163
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1164	The process guarantees that the table is entirely filled.
				1165	Each cell corresponds to a state value, which contains the symbol being decoded.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1166
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1167	To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
				1168	it's first necessary to sort all occurrences of each symbol in state order.
				1169	Lower states will need 1 more bit than higher ones.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1170	The process is repeated for each symbol.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1171
				1172	__Example__ :
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1173	Presuming a symbol has a probability of 5,
				1174	it receives 5 cells, corresponding to 5 state values.
				1175	These state values are then sorted in natural order.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1176
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1177	Next power of 2 after 5 is 8.
				1178	Space of probabilities must be divided into 8 equal parts.
				1179	Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1180	Divided by 8, each share is 16 large.
				1181
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1182	In order to reach 8 shares, 8-5=3 lowest states will count "double",
				1183	doubling their shares (32 in width), hence requiring one more bit.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1184
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1185	Baseline is assigned starting from the higher states using fewer bits,
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1186	increasing at each state, then resuming at the first state,
				1187	each state takes its allocated width from Baseline.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1188
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1189	\| state value \| 1 \| 39 \| 77 \| 84 \| 122 \|
				1190	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1191	\| ---------------- \| ----- \| ----- \| ------ \| ---- \| ------ \|
				1192	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1193	\| `Number_of_Bits` \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1194	\| range number \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1195	\| `Baseline` \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1196	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1197
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1198	During decoding, the next state value is determined from current state value,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1199	by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
				1200
				1201	See [Appendix A] for the results of this process applied to the default distributions.
				1202
				1203	[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
				1204
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1205
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1206	Huffman Coding
				1207	--------------
				1208	Zstandard Huffman-coded streams are read backwards,
				1209	similar to the FSE bitstreams.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1210	Therefore, to find the start of the bitstream, it is required to
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1211	know the offset of the last byte of the Huffman-coded stream.
				1212
				1213	After writing the last bit containing information, the compressor
				1214	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				1215	padding. The last byte of the compressed bitstream cannot be `0` for
				1216	that reason.
				1217
				1218	When decompressing, the last byte containing the padding is the first
				1219	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				1220	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				1221	begins.
				1222
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1223	The bitstream contains Huffman-coded symbols in __little-endian__ order,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1224	with the codes defined by the method below.
				1225
				1226	### Huffman Tree Description
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1227
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1228	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1229	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1230	in a manner such that different symbols may be represented
				1231	by bit sequences of different lengths,
				1232	but a parser can always parse an encoded string
				1233	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1234
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1235	Given an alphabet with known symbol frequencies,
				1236	the Huffman algorithm allows the construction of an optimal prefix code
				1237	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1238
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1239	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1240	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1241	and require more memory or more complex decoding operations.
				1242	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1243
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1244	#### Representation
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1245
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1246	All literal values from zero (included) to last present one (excluded)
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1247	are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
				1248	Transformation from `Weight` to `Number_of_Bits` follows this formula :
				1249	```
				1250	Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
				1251	```
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1252	When a literal value is not present, it receives a `Weight` of 0.
				1253	The least frequent symbol receives a `Weight` of 1.
				1254	Consequently, the `Weight` 1 is necessarily present.
				1255	The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
				1256	The last symbol's `Weight` is deduced from previously retrieved Weights,
				1257	by completing to the nearest power of 2. It's necessarily non 0.
				1258	If it's not possible to reach a clean power of 2 with a single `Weight` value,
				1259	the Huffman Tree Description is considered invalid.
				1260	This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1261	`Max_Number_of_Bits` must be <= 11,
				1262	otherwise the representation is considered corrupted.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1263
				1264	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1265	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1266
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1267	\| literal value \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1268	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1269	\| `Number_of_Bits` \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1270
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1271	The tree depth is 4, since its longest elements uses 4 bits
				1272	(longest elements are the one with smallest frequency).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1273	Literal value `5` will not be listed, as it can be determined from previous values 0-4,
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1274	nor will values above `5` as they are all 0.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1275	Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1276	Weight formula is :
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1277	```
				1278	Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
				1279	```
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1280	It gives the following series of weights :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1281
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1282	\| literal value \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1283	\| ------------- \| --- \| --- \| --- \| --- \| --- \|
				1284	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1285
				1286	The decoder will do the inverse operation :
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1287	having collected weights of literal symbols from `0` to `4`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1288	it knows the last literal, `5`, is present with a non-zero `Weight`.
				1289	The `Weight` of `5` can be determined by advancing to the next power of 2.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1290	The sum of `2^(Weight-1)` (excluding 0's) is :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1291	`8 + 4 + 2 + 0 + 1 = 15`.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1292	Nearest larger power of 2 value is 16.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1293	Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1294
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1295	#### Huffman Tree header
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1296
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1297	This is a single byte value (0-255),
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1298	which describes how the series of weights is encoded.
				1299
				1300	- if `headerByte` < 128 :
				1301	the series of weights is compressed using FSE (see below).
				1302	The length of the FSE-compressed series is equal to `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1303
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1304	- if `headerByte` >= 128 :
				1305	+ the series of weights uses a direct representation,
				1306	where each `Weight` is encoded directly as a 4 bits field (0-15).
				1307	+ They are encoded forward, 2 weights to a byte,
				1308	first weight taking the top four bits and second one taking the bottom four.
				1309	* e.g. the following operations could be used to read the weights:
				1310	`Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
				1311	+ The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
				1312	meaning it uses only full bytes even if `Number_of_Weights` is odd.
				1313	+ `Number_of_Weights = headerByte - 127`.
				1314	* Note that maximum `Number_of_Weights` is 255-127 = 128,
				1315	therefore, only up to 128 `Weight` can be encoded using direct representation.
				1316	* Since the last non-zero `Weight` is _not_ encoded,
				1317	this scheme is compatible with alphabet sizes of up to 129 symbols,
				1318	hence including literal symbol 128.
				1319	* If any literal symbol > 128 has a non-zero `Weight`,
				1320	direct representation is not possible.
				1321	In such case, it's necessary to use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1322
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1323
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1324	#### Finite State Entropy (FSE) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1325
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1326	In this case, the series of Huffman weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1327	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1328	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1329
				1330	To decode an FSE bitstream, it is necessary to know its compressed size.
				1331	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1332	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1333	which is `255`, since literal values span from `0` to `255`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1334	and last symbol's `Weight` is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1335
				1336	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1337	It will create a Decoding Table.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1338	For a list of Huffman weights, the maximum accuracy log is 6 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1339	For more description see the [FSE header description](#fse-table-description)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1340
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1341	The Huffman header compression uses 2 states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1342	which share the same FSE distribution table.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1343	The first state (`State1`) encodes the even indexed symbols,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1344	and the second (`State2`) encodes the odd indexed symbols.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1345	`State1` is initialized first, and then `State2`, and they take turns
				1346	decoding a single symbol and updating their state.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1347	For more details on these FSE operations, see the [FSE section](#fse).
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1348
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1349	The number of symbols to decode is determined
				1350	by tracking bitStream overflow condition:
				1351	If updating state after decoding a symbol would require more bits than
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1352	remain in the stream, it is assumed that extra bits are 0. Then,
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1353	symbols for each of the final states are decoded and the process is complete.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1354
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1355	#### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1356
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1357	All present symbols shall now have a `Weight` value.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	1358	It is possible to transform weights into `Number_of_Bits`, using this formula:
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1359	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1360	Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1361	```
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1362	Symbols are sorted by `Weight`.
				1363	Within same `Weight`, symbols keep natural sequential order.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1364	Symbols with a `Weight` of zero are removed.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1365	Then, starting from lowest `Weight`, prefix codes are distributed in sequential order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1366
				1367	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1368	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1369
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1370	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				1371	\| -------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1372	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1373
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1374	Sorted by weight and then natural sequential order,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1375	it gives the following distribution :
				1376
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1377	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				1378	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| ---- \|
				1379	\| `Weight` \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
				1380	\| `Number_of_Bits` \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
				1381	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1382
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1383	### Huffman-coded Streams
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1384
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1385	Given a Huffman decoding table,
				1386	it's possible to decode a Huffman-coded stream.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1387
				1388	Each bitstream must be read _backward_,
				1389	that is starting from the end down to the beginning.
				1390	Therefore it's necessary to know the size of each bitstream.
				1391
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1392	It's also necessary to know exactly which _bit_ is the last one.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1393	This is detected by a final bit flag :
				1394	the highest bit of latest byte is a final-bit-flag.
				1395	Consequently, a last byte of `0` is not possible.
				1396	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1397	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1398
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1399	Starting from the end,
				1400	it's possible to read the bitstream in a __little-endian__ fashion,
				1401	keeping track of already used bits. Since the bitstream is encoded in reverse
				1402	order, starting from the end read symbols in forward order.
				1403
				1404	For example, if the literal sequence "0145" was encoded using above prefix code,
				1405	it would be encoded (in reverse order) as:
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1406
				1407	\|Symbol \| 5 \| 4 \| 1 \| 0 \| Padding \|
				1408	\|--------\|------\|------\|----\|---\|---------\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1409	\|Encoding\|`0000`\|`0001`\|`01`\|`1`\| `00001` \|
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1410
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1411	Resulting in following 2-bytes bitstream :
				1412	```
				1413	00010000 00001101
				1414	```
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1415
Yann Collet	e8d35cc	2017-08-20 10:39:20 -0700	[diff] [blame]	1416	Here is an alternative representation with the symbol codes separated by underscore:
Yann Collet	d0d06e4	2017-08-19 12:26:09 -0700	[diff] [blame]	1417	```
				1418	0001_0000 00001_1_01
				1419	```
				1420
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1421	Reading highest `Max_Number_of_Bits` bits,
				1422	it's possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1423	determining the symbol to decode and number of bits to discard.
				1424
				1425	The process continues up to reading the required number of symbols per stream.
				1426	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1427	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1428	the decoding process is considered faulty.
				1429
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1430
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1431	Dictionary Format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1432	-----------------
				1433
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1434	Zstandard is compatible with "raw content" dictionaries,
				1435	free of any format restriction, except that they must be at least 8 bytes.
				1436	These dictionaries function as if they were just the `Content` part
				1437	of a formatted dictionary.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1438
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1439	But dictionaries created by `zstd --train` follow a format, described here.
				1440
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1441	__Pre-requisites__ : a dictionary has a size,
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1442	defined either by a buffer limit, or a file size.
				1443
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1444	\| `Magic_Number` \| `Dictionary_ID` \| `Entropy_Tables` \| `Content` \|
				1445	\| -------------- \| --------------- \| ---------------- \| --------- \|
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1446
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1447	__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1448
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1449	__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1450	`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1451	It's used by decoders to check if they use the correct dictionary.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1452
				1453	_Reserved ranges :_
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1454	If the dictionary is going to be distributed in a public environment,
				1455	the following ranges of `Dictionary_ID` are reserved for some future registrar
				1456	and shall not be used :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1457
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1458	- low range : <= 32767
				1459	- high range : >= (2^31)
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1460
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1461	Outside of these ranges, any value of `Dictionary_ID`
				1462	which is both `>= 32768` and `< (1<<31)` can be used freely,
				1463	even in public environment.
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1464
				1465
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1466	__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1467	See the relevant [FSE](#fse-table-description)
				1468	and [Huffman](#huffman-tree-description) sections for how to decode these tables.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1469	They are stored in following order :
				1470	Huffman tables for literals, FSE table for offsets,
				1471	FSE table for match lengths, and FSE table for literals lengths.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1472	These tables populate the Repeat Stats literals mode and
				1473	Repeat distribution mode for sequence decoding.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1474	It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1475	stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
senhuang42	8adeb9f	2020-09-22 13:24:27 -0400	[diff] [blame]	1476	Each recent offset must have a value <= dictionary content size, and cannot equal 0.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1477
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1478	__`Content`__ : The rest of the dictionary is its content.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1479	The content act as a "past" in front of data to compress or decompress,
				1480	so it can be referenced in sequence commands.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1481	As long as the amount of data decoded from this frame is less than or
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1482	equal to `Window_Size`, sequence commands may specify offsets longer
				1483	than the total length of decoded output so far to reference back to the
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1484	dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1485	After the total output has surpassed `Window_Size` however,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1486	this is no longer allowed and the dictionary is no longer accessible.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1487
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1488	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1489
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1490	If a dictionary is provided by an external source,
				1491	it should be loaded with great care, its content considered untrusted.
				1492
				1493
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1494
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1495	Appendix A - Decoding tables for predefined codes
				1496	-------------------------------------------------
				1497
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1498	This appendix contains FSE decoding tables
				1499	for the predefined literal length, match length, and offset codes.
				1500	The tables have been constructed using the algorithm as given above in chapter
				1501	"from normalized distribution to decoding tables".
				1502	The tables here can be used as examples
				1503	to crosscheck that an implementation build its decoding tables correctly.
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1504
				1505	#### Literal Length Code:
				1506
				1507	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1508	\| ----- \| ------ \| -------------- \| ---- \|
				1509	\| 0 \| 0 \| 4 \| 0 \|
				1510	\| 1 \| 0 \| 4 \| 16 \|
				1511	\| 2 \| 1 \| 5 \| 32 \|
				1512	\| 3 \| 3 \| 5 \| 0 \|
				1513	\| 4 \| 4 \| 5 \| 0 \|
				1514	\| 5 \| 6 \| 5 \| 0 \|
				1515	\| 6 \| 7 \| 5 \| 0 \|
				1516	\| 7 \| 9 \| 5 \| 0 \|
				1517	\| 8 \| 10 \| 5 \| 0 \|
				1518	\| 9 \| 12 \| 5 \| 0 \|
				1519	\| 10 \| 14 \| 6 \| 0 \|
				1520	\| 11 \| 16 \| 5 \| 0 \|
				1521	\| 12 \| 18 \| 5 \| 0 \|
				1522	\| 13 \| 19 \| 5 \| 0 \|
				1523	\| 14 \| 21 \| 5 \| 0 \|
				1524	\| 15 \| 22 \| 5 \| 0 \|
				1525	\| 16 \| 24 \| 5 \| 0 \|
				1526	\| 17 \| 25 \| 5 \| 32 \|
				1527	\| 18 \| 26 \| 5 \| 0 \|
				1528	\| 19 \| 27 \| 6 \| 0 \|
				1529	\| 20 \| 29 \| 6 \| 0 \|
				1530	\| 21 \| 31 \| 6 \| 0 \|
				1531	\| 22 \| 0 \| 4 \| 32 \|
				1532	\| 23 \| 1 \| 4 \| 0 \|
				1533	\| 24 \| 2 \| 5 \| 0 \|
				1534	\| 25 \| 4 \| 5 \| 32 \|
				1535	\| 26 \| 5 \| 5 \| 0 \|
				1536	\| 27 \| 7 \| 5 \| 32 \|
				1537	\| 28 \| 8 \| 5 \| 0 \|
				1538	\| 29 \| 10 \| 5 \| 32 \|
				1539	\| 30 \| 11 \| 5 \| 0 \|
				1540	\| 31 \| 13 \| 6 \| 0 \|
				1541	\| 32 \| 16 \| 5 \| 32 \|
				1542	\| 33 \| 17 \| 5 \| 0 \|
				1543	\| 34 \| 19 \| 5 \| 32 \|
				1544	\| 35 \| 20 \| 5 \| 0 \|
				1545	\| 36 \| 22 \| 5 \| 32 \|
				1546	\| 37 \| 23 \| 5 \| 0 \|
				1547	\| 38 \| 25 \| 4 \| 0 \|
				1548	\| 39 \| 25 \| 4 \| 16 \|
				1549	\| 40 \| 26 \| 5 \| 32 \|
				1550	\| 41 \| 28 \| 6 \| 0 \|
				1551	\| 42 \| 30 \| 6 \| 0 \|
				1552	\| 43 \| 0 \| 4 \| 48 \|
				1553	\| 44 \| 1 \| 4 \| 16 \|
				1554	\| 45 \| 2 \| 5 \| 32 \|
				1555	\| 46 \| 3 \| 5 \| 32 \|
				1556	\| 47 \| 5 \| 5 \| 32 \|
				1557	\| 48 \| 6 \| 5 \| 32 \|
				1558	\| 49 \| 8 \| 5 \| 32 \|
				1559	\| 50 \| 9 \| 5 \| 32 \|
				1560	\| 51 \| 11 \| 5 \| 32 \|
				1561	\| 52 \| 12 \| 5 \| 32 \|
				1562	\| 53 \| 15 \| 6 \| 0 \|
				1563	\| 54 \| 17 \| 5 \| 32 \|
				1564	\| 55 \| 18 \| 5 \| 32 \|
				1565	\| 56 \| 20 \| 5 \| 32 \|
				1566	\| 57 \| 21 \| 5 \| 32 \|
				1567	\| 58 \| 23 \| 5 \| 32 \|
				1568	\| 59 \| 24 \| 5 \| 32 \|
				1569	\| 60 \| 35 \| 6 \| 0 \|
				1570	\| 61 \| 34 \| 6 \| 0 \|
				1571	\| 62 \| 33 \| 6 \| 0 \|
				1572	\| 63 \| 32 \| 6 \| 0 \|
				1573
				1574	#### Match Length Code:
				1575
				1576	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1577	\| ----- \| ------ \| -------------- \| ---- \|
				1578	\| 0 \| 0 \| 6 \| 0 \|
				1579	\| 1 \| 1 \| 4 \| 0 \|
				1580	\| 2 \| 2 \| 5 \| 32 \|
				1581	\| 3 \| 3 \| 5 \| 0 \|
				1582	\| 4 \| 5 \| 5 \| 0 \|
				1583	\| 5 \| 6 \| 5 \| 0 \|
				1584	\| 6 \| 8 \| 5 \| 0 \|
				1585	\| 7 \| 10 \| 6 \| 0 \|
				1586	\| 8 \| 13 \| 6 \| 0 \|
				1587	\| 9 \| 16 \| 6 \| 0 \|
				1588	\| 10 \| 19 \| 6 \| 0 \|
				1589	\| 11 \| 22 \| 6 \| 0 \|
				1590	\| 12 \| 25 \| 6 \| 0 \|
				1591	\| 13 \| 28 \| 6 \| 0 \|
				1592	\| 14 \| 31 \| 6 \| 0 \|
				1593	\| 15 \| 33 \| 6 \| 0 \|
				1594	\| 16 \| 35 \| 6 \| 0 \|
				1595	\| 17 \| 37 \| 6 \| 0 \|
				1596	\| 18 \| 39 \| 6 \| 0 \|
				1597	\| 19 \| 41 \| 6 \| 0 \|
				1598	\| 20 \| 43 \| 6 \| 0 \|
				1599	\| 21 \| 45 \| 6 \| 0 \|
				1600	\| 22 \| 1 \| 4 \| 16 \|
				1601	\| 23 \| 2 \| 4 \| 0 \|
				1602	\| 24 \| 3 \| 5 \| 32 \|
				1603	\| 25 \| 4 \| 5 \| 0 \|
				1604	\| 26 \| 6 \| 5 \| 32 \|
				1605	\| 27 \| 7 \| 5 \| 0 \|
				1606	\| 28 \| 9 \| 6 \| 0 \|
				1607	\| 29 \| 12 \| 6 \| 0 \|
				1608	\| 30 \| 15 \| 6 \| 0 \|
				1609	\| 31 \| 18 \| 6 \| 0 \|
				1610	\| 32 \| 21 \| 6 \| 0 \|
				1611	\| 33 \| 24 \| 6 \| 0 \|
				1612	\| 34 \| 27 \| 6 \| 0 \|
				1613	\| 35 \| 30 \| 6 \| 0 \|
				1614	\| 36 \| 32 \| 6 \| 0 \|
				1615	\| 37 \| 34 \| 6 \| 0 \|
				1616	\| 38 \| 36 \| 6 \| 0 \|
				1617	\| 39 \| 38 \| 6 \| 0 \|
				1618	\| 40 \| 40 \| 6 \| 0 \|
				1619	\| 41 \| 42 \| 6 \| 0 \|
				1620	\| 42 \| 44 \| 6 \| 0 \|
				1621	\| 43 \| 1 \| 4 \| 32 \|
				1622	\| 44 \| 1 \| 4 \| 48 \|
				1623	\| 45 \| 2 \| 4 \| 16 \|
				1624	\| 46 \| 4 \| 5 \| 32 \|
				1625	\| 47 \| 5 \| 5 \| 32 \|
				1626	\| 48 \| 7 \| 5 \| 32 \|
				1627	\| 49 \| 8 \| 5 \| 32 \|
				1628	\| 50 \| 11 \| 6 \| 0 \|
				1629	\| 51 \| 14 \| 6 \| 0 \|
				1630	\| 52 \| 17 \| 6 \| 0 \|
				1631	\| 53 \| 20 \| 6 \| 0 \|
				1632	\| 54 \| 23 \| 6 \| 0 \|
				1633	\| 55 \| 26 \| 6 \| 0 \|
				1634	\| 56 \| 29 \| 6 \| 0 \|
				1635	\| 57 \| 52 \| 6 \| 0 \|
				1636	\| 58 \| 51 \| 6 \| 0 \|
				1637	\| 59 \| 50 \| 6 \| 0 \|
				1638	\| 60 \| 49 \| 6 \| 0 \|
				1639	\| 61 \| 48 \| 6 \| 0 \|
				1640	\| 62 \| 47 \| 6 \| 0 \|
				1641	\| 63 \| 46 \| 6 \| 0 \|
				1642
				1643	#### Offset Code:
				1644
				1645	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1646	\| ----- \| ------ \| -------------- \| ---- \|
				1647	\| 0 \| 0 \| 5 \| 0 \|
				1648	\| 1 \| 6 \| 4 \| 0 \|
				1649	\| 2 \| 9 \| 5 \| 0 \|
				1650	\| 3 \| 15 \| 5 \| 0 \|
				1651	\| 4 \| 21 \| 5 \| 0 \|
				1652	\| 5 \| 3 \| 5 \| 0 \|
				1653	\| 6 \| 7 \| 4 \| 0 \|
				1654	\| 7 \| 12 \| 5 \| 0 \|
				1655	\| 8 \| 18 \| 5 \| 0 \|
				1656	\| 9 \| 23 \| 5 \| 0 \|
				1657	\| 10 \| 5 \| 5 \| 0 \|
				1658	\| 11 \| 8 \| 4 \| 0 \|
				1659	\| 12 \| 14 \| 5 \| 0 \|
				1660	\| 13 \| 20 \| 5 \| 0 \|
				1661	\| 14 \| 2 \| 5 \| 0 \|
				1662	\| 15 \| 7 \| 4 \| 16 \|
				1663	\| 16 \| 11 \| 5 \| 0 \|
				1664	\| 17 \| 17 \| 5 \| 0 \|
				1665	\| 18 \| 22 \| 5 \| 0 \|
				1666	\| 19 \| 4 \| 5 \| 0 \|
				1667	\| 20 \| 8 \| 4 \| 16 \|
				1668	\| 21 \| 13 \| 5 \| 0 \|
				1669	\| 22 \| 19 \| 5 \| 0 \|
				1670	\| 23 \| 1 \| 5 \| 0 \|
				1671	\| 24 \| 6 \| 4 \| 16 \|
				1672	\| 25 \| 10 \| 5 \| 0 \|
				1673	\| 26 \| 16 \| 5 \| 0 \|
				1674	\| 27 \| 28 \| 5 \| 0 \|
				1675	\| 28 \| 27 \| 5 \| 0 \|
				1676	\| 29 \| 26 \| 5 \| 0 \|
				1677	\| 30 \| 25 \| 5 \| 0 \|
				1678	\| 31 \| 24 \| 5 \| 0 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1679
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1680
				1681
				1682	Appendix B - Resources for implementers
				1683	-------------------------------------------------
				1684
				1685	An open source reference implementation is available on :
				1686	https://github.com/facebook/zstd
				1687
				1688	The project contains a frame generator, called [decodeCorpus],
				1689	which can be used by any 3rd-party implementation
				1690	to verify that a tested decoder is compliant with the specification.
				1691
				1692	[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
				1693
				1694	`decodeCorpus` generates random valid frames.
				1695	A compliant decoder should be able to decode them all,
				1696	or at least provide a meaningful error code explaining for which reason it cannot
				1697	(memory limit restrictions for example).
				1698
				1699
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1700	Version changes
				1701	---------------
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame^]	1702	- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	1703	- 0.3.9 : clarifications for Huffman-compressed literal sizes.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1704	- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
Yann Collet	0b0b62d	2021-05-15 23:04:46 -0700	[diff] [blame]	1705	- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1706	- 0.3.6 : clarifications for Dictionary_ID
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	1707	- 0.3.5 : clarifications for Block_Maximum_Size
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1708	- 0.3.4 : clarifications for FSE decoding table
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	1709	- 0.3.3 : clarifications for field Block_Size
W. Felix Handte	a2861d7	2019-07-17 17:55:15 -0400	[diff] [blame]	1710	- 0.3.2 : remove additional block size restriction on compressed blocks
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	1711	- 0.3.1 : minor clarification regarding offset history update rules
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1712	- 0.3.0 : minor edits to match RFC8478
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1713	- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1714	- 0.2.8 : clarifications for IETF RFC discuss
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1715	- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1716	- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1717	- 0.2.5 : minor typos and clarifications
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1718	- 0.2.4 : section restructuring, by Sean Purcell
Yann Collet	20bed42	2017-01-27 12:16:16 -0800	[diff] [blame]	1719	- 0.2.3 : clarified several details, by Sean Purcell
Yann Collet	55981a9	2016-09-15 02:13:18 +0200	[diff] [blame]	1720	- 0.2.2 : added predefined codes, by Johannes Rudolph
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1721	- 0.2.1 : clarify field names, by Przemyslaw Skibinski
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1722	- 0.2.0 : numerous format adjustments for zstd v0.8+
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1723	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1724	- 0.1.1 : reserved dictID ranges
				1725	- 0.1.0 : initial release