Blame - doc/zstd_compression_format.md - external_zstd

blob: a8d4a0f35f54ec774ec86abedce22528420a956d [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
W. Felix Handte	5d693cc	2022-12-20 12:49:47 -0500	[diff] [blame]	6	Copyright (c) Meta Platforms, Inc. and affiliates.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame]	19	0.4.0 (2023-06-05)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	20
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	29	using the [Zstandard algorithm](https://facebook.github.io/zstd/).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	30	The text of the specification assumes a basic background in programming
				31	at the level of bits and other primitive data representations.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	32
				33	The data can be produced or consumed,
				34	even for an arbitrarily long sequentially presented input data stream,
				35	using only an a priori bounded amount of intermediate storage,
				36	and hence can be used in data communications.
				37	The format uses the Zstandard compression method,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	38	and optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/),
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	39	for detection of data corruption.
				40
				41	The data format defined by this specification
				42	does not attempt to allow random access to compressed data.
				43
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	44	Unless otherwise indicated below,
				45	a compliant compressor must produce data sets
				46	that conform to the specifications presented here.
				47	It doesn’t need to support all options though.
				48
				49	A compliant decompressor must be able to decompress
				50	at least one working set of parameters
				51	that conforms to the specifications presented here.
				52	It may also ignore informative fields, such as checksum.
				53	Whenever it does not support a parameter defined in the compressed stream,
				54	it must produce a non-ambiguous error code and associated error message
				55	explaining which parameter is unsupported.
				56
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	57	This specification is intended for use by implementers of software
				58	to compress data into Zstandard format and/or decompress data from Zstandard format.
				59	The Zstandard format is supported by an open source reference implementation,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	60	written in portable C, and available at : https://github.com/facebook/zstd .
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	61
				62
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	63	### Overall conventions
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	64	In this document:
				65	- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	66	- the naming convention for identifiers is `Mixed_Case_With_Underscores`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	67
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	68	### Definitions
				69	Content compressed by Zstandard is transformed into a Zstandard __frame__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	70	Multiple frames can be appended into a single file or stream.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	71	A frame is completely independent, has a defined beginning and end,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	72	and a set of parameters which tells the decoder how to decompress it.
				73
				74	A frame encapsulates one or multiple __blocks__.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	75	Each block contains arbitrary content, which is described by its header,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	76	and has a guaranteed maximum content size, which depends on frame parameters.
				77	Unlike frames, each block depends on previous blocks for proper decoding.
				78	However, each block can be decompressed without waiting for its successor,
				79	allowing streaming operations.
				80
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	81	Overview
				82	---------
				83	- [Frames](#frames)
				84	- [Zstandard frames](#zstandard-frames)
				85	- [Blocks](#blocks)
				86	- [Literals Section](#literals-section)
				87	- [Sequences Section](#sequences-section)
				88	- [Sequence Execution](#sequence-execution)
				89	- [Skippable frames](#skippable-frames)
				90	- [Entropy Encoding](#entropy-encoding)
				91	- [FSE](#fse)
				92	- [Huffman Coding](#huffman-coding)
				93	- [Dictionary Format](#dictionary-format)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	94
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	95	Frames
				96	------
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	97	Zstandard compressed data is made of one or more __frames__.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	98	Each frame is independent and can be decompressed independently of other frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	99	The decompressed content of multiple concatenated frames is the concatenation of
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	100	each frame decompressed content.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	101
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	102	There are two frame formats defined by Zstandard:
				103	Zstandard frames and Skippable frames.
				104	Zstandard frames contain compressed data, while
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	105	skippable frames contain custom user metadata.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	107	## Zstandard frames
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	108	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	109
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	110	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				111	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	112	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	113
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	114	__`Magic_Number`__
				115
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	116	4 Bytes, __little-endian__ format.
Yann Collet	7bdfcea	2016-09-05 17:43:31 +0200	[diff] [blame]	117	Value : 0xFD2FB528
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	118	Note: This value was selected to be less probable to find at the beginning of some random file.
				119	It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
				120	contains byte values outside of ASCII range,
				121	and doesn't map into UTF8 space.
				122	It reduces the chances that a text file represent this value by accident.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	123
				124	__`Frame_Header`__
				125
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	126	2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	127
				128	__`Data_Block`__
				129
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	130	Detailed in [`Blocks`](#blocks).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	131	That’s where compressed data is stored.
				132
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	133	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	134
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	135	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	The content checksum is the result
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	137	of [xxh64() hash function](https://cyan4973.github.io/xxHash/)
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	138	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	139	The low 4 bytes of the checksum are stored in __little-endian__ format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	140
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	141	### `Frame_Header`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	142
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	143	The `Frame_Header` has a variable size, with a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	144	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	145	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	146
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	147	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				148	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				149	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	150
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	151	#### `Frame_Header_Descriptor`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	152
				153	The first header's byte is called the `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	154	It describes which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	155	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	156
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	157	\| Bit number \| Field name \|
				158	\| ---------- \| ---------- \|
				159	\| 7-6 \| `Frame_Content_Size_flag` \|
				160	\| 5 \| `Single_Segment_flag` \|
				161	\| 4 \| `Unused_bit` \|
				162	\| 3 \| `Reserved_bit` \|
				163	\| 2 \| `Content_Checksum_flag` \|
				164	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	165
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	166	In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	167
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	168	__`Frame_Content_Size_flag`__
				169
				170	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	171	specifying if `Frame_Content_Size` (the decompressed data size)
				172	is provided within the header.
				173	`Flag_Value` provides `FCS_Field_Size`,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	174	which is the number of bytes used by `Frame_Content_Size`
				175	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	176
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	177	\| `Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				178	\| -------------- \| ------ \| --- \| --- \| --- \|
				179	\|`FCS_Field_Size`\| 0 or 1 \| 2 \| 4 \| 8 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	180
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	181	When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	182	if `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
				183	Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	184
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	185	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	186
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	187	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	188	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	189
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	190	In this case, `Window_Descriptor` byte is skipped,
				191	but `Frame_Content_Size` is necessarily present.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	192	As a consequence, the decoder must allocate a memory segment
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	193	of size equal or larger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	194
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	195	In order to preserve the decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	196	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	197	which requests a memory size beyond decoder's authorized range.
				198
				199	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	200	memory sizes of at least 8 MB.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	201	This is only a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	202	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	203	depending on local limitations.
				204
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	205	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	206
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	207	A decoder compliant with this specification version shall not interpret this bit.
				208	It might be used in any future version,
				209	to signal a property which is transparent to properly decode the frame.
				210	An encoder compliant with this specification version must set this bit to zero.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	211
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	212	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	213
				214	This bit is reserved for some future feature.
				215	Its value _must be zero_.
				216	A decoder compliant with this specification version must ensure it is not set.
				217	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	218	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	219
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	220	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	221
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	222	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				223	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	224
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	225	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	226
				227	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	228	telling if a dictionary ID is provided within the header.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	229	It also specifies the size of this field as `DID_Field_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	230
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	231	\|`Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				232	\| -------------- \| --- \| --- \| --- \| --- \|
				233	\|`DID_Field_Size`\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	234
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	235	#### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	237	Provides guarantees on minimum memory buffer required to decompress a frame.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	238	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	239
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	240	The `Window_Descriptor` byte is optional.
				241	When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	242	In this case, `Window_Size` is `Frame_Content_Size`,
				243	which can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	244
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	245	\| Bit numbers \| 7-3 \| 2-0 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	246	\| ----------- \| ---------- \| ---------- \|
				247	\| Field name \| `Exponent` \| `Mantissa` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	248
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	249	The minimum memory buffer size is called `Window_Size`.
				250	It is described by the following formulas :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	251	```
				252	windowLog = 10 + Exponent;
				253	windowBase = 1 << windowLog;
				254	windowAdd = (windowBase / 8) * Mantissa;
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	255	Window_Size = windowBase + windowAdd;
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	256	```
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	257	The minimum `Window_Size` is 1 KB.
				258	The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	259
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	260	In general, larger `Window_Size` tend to improve compression ratio,
				261	but at the cost of memory usage.
				262
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	263	To properly decode compressed data,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	264	a decoder will need to allocate a buffer of at least `Window_Size` bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	265
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	266	In order to preserve decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	267	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	268	which requests a memory size beyond decoder's authorized range.
				269
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	270	For improved interoperability,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	271	it's recommended for decoders to support `Window_Size` of up to 8 MB,
				272	and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	273	It's merely a recommendation though,
				274	decoders are free to support larger or lower limits,
				275	depending on local limitations.
				276
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	277	#### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	278
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	279	This is a variable size field, which contains
				280	the ID of the dictionary required to properly decode the frame.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	281	`Dictionary_ID` field is optional. When it's not present,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	282	it's up to the decoder to know which dictionary to use.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	283
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	284	`Dictionary_ID` field size is provided by `DID_Field_Size`.
				285	`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	286	1 byte can represent an ID 0-255.
				287	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	288	4 bytes can represent an ID 0-4294967295.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	289	Format is __little-endian__.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	290
				291	It's allowed to represent a small ID (for example `13`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	292	with a large 4-bytes dictionary ID, even if it is less efficient.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	293
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	294	A value of `0` has same meaning as no `Dictionary_ID`,
				295	in which case the frame may or may not need a dictionary to be decoded,
				296	and the ID of such a dictionary is not specified.
				297	The decoder must know this information by other means.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	298
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	299	#### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	300
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	301	This is the original (uncompressed) size. This information is optional.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	302	`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
				303	`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
				304	`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	305
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	306	\| `FCS_Field_Size` \| Range \|
				307	\| ---------------- \| ---------- \|
				308	\| 0 \| unknown \|
				309	\| 1 \| 0 - 255 \|
				310	\| 2 \| 256 - 65791\|
				311	\| 4 \| 0 - 2^32-1 \|
				312	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	313
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	314	`Frame_Content_Size` format is __little-endian__.
				315	When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				316	When `FCS_Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	317	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	318
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	319
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	320	Blocks
				321	-------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	322
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	323	After `Magic_Number` and `Frame_Header`, there are some number of blocks.
				324	Each frame must have at least one block,
				325	but there is no upper limit on the number of blocks per frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	326
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	327	The structure of a block is as follows:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	328
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	329	\| `Block_Header` \| `Block_Content` \|
				330	\|:--------------:\|:---------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	331	\| 3 bytes \| n bytes \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	332
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	333	__`Block_Header`__
				334
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	335	`Block_Header` uses 3 bytes, written using __little-endian__ convention.
				336	It contains 3 fields :
				337
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	338	\| `Last_Block` \| `Block_Type` \| `Block_Size` \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	339	\|:------------:\|:------------:\|:------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	340	\| bit 0 \| bits 1-2 \| bits 3-23 \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	341
				342	__`Last_Block`__
				343
				344	The lowest bit signals if this block is the last one.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	345	The frame will end after this last block.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	346	It may be followed by an optional `Content_Checksum`
				347	(see [Zstandard Frames](#zstandard-frames)).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	348
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	349	__`Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	350
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	351	The next 2 bits represent the `Block_Type`.
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	352	`Block_Type` influences the meaning of `Block_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	353	There are 4 block types :
				354
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	355	\| Value \| 0 \| 1 \| 2 \| 3 \|
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	356	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	357	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	358
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	359	- `Raw_Block` - this is an uncompressed block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	360	`Block_Content` contains `Block_Size` bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	361
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	362	- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
				363	`Block_Content` consists of a single byte.
				364	On the decompression side, this byte must be repeated `Block_Size` times.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	365
				366	- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
				367	explained later on.
				368	`Block_Size` is the length of `Block_Content`, the compressed data.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	369	The decompressed size is not known,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	370	but its maximum possible value is guaranteed (see below)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	371
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	372	- `Reserved` - this is not a block.
				373	This value cannot be used with current version of this specification.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	374	If such a value is present, it is considered corrupted data.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	375
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	376	__`Block_Size`__
				377
				378	The upper 21 bits of `Block_Header` represent the `Block_Size`.
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	379
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	380	When `Block_Type` is `Compressed_Block` or `Raw_Block`,
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	381	`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	382
				383	When `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
				384	`Block_Size` represents the number of times this byte must be repeated.
				385
				386	`Block_Size` is limited by `Block_Maximum_Size` (see below).
				387
				388	__`Block_Content`__ and __`Block_Maximum_Size`__
				389
				390	The size of `Block_Content` is limited by `Block_Maximum_Size`,
				391	which is the smallest of:
				392	- `Window_Size`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	393	- 128 KB
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	394
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	395	`Block_Maximum_Size` is constant for a given frame.
				396	This maximum is applicable to both the decompressed size
				397	and the compressed size of any block in the frame.
				398
				399	The reasoning for this limit is that a decoder can read this information
				400	at the beginning of a frame and use it to allocate buffers.
				401	The guarantees on the size of blocks ensure that
				402	the buffers will be large enough for any following block of the valid frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	403
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	404
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	405	Compressed Blocks
				406	-----------------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	407	To decompress a compressed block, the compressed size must be provided
				408	from `Block_Size` field within `Block_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	409
				410	A compressed block consists of 2 sections :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	411	- [Literals Section](#literals-section)
				412	- [Sequences Section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	413
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	414	The results of the two sections are then combined to produce the decompressed
				415	data in [Sequence Execution](#sequence-execution)
				416
				417	#### Prerequisites
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	418	To decode a compressed block, the following elements are necessary :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	419	- Previous decoded data, up to a distance of `Window_Size`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	420	or beginning of the Frame, whichever is smaller.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	421	- List of "recent offsets" from previous `Compressed_Block`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	422	- The previous Huffman tree, required by `Treeless_Literals_Block` type
				423	- Previous FSE decoding tables, required by `Repeat_Mode`
				424	for each symbol type (literals lengths, match lengths, offsets)
				425
				426	Note that decoding tables aren't always from the previous `Compressed_Block`.
				427
				428	- Every decoding table can come from a dictionary.
				429	- The Huffman tree comes from the previous `Compressed_Literals_Block`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	430
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	431	Literals Section
				432	----------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	433	All literals are regrouped in the first part of the block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	434	They can be decoded first, and then copied during [Sequence Execution],
				435	or they can be decoded on the flow during [Sequence Execution].
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	437	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	438	When compressed, a tree description may optionally be present,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	439	followed by 1 or 4 streams.
				440
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	441	\| `Literals_Section_Header` \| [`Huffman_Tree_Description`] \| [jumpTable] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				442	\| ------------------------- \| ---------------------------- \| ----------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	443
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	444
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	445	### `Literals_Section_Header`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	447	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	449	using __little-endian__ convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	450
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	451	\| `Literals_Block_Type` \| `Size_Format` \| `Regenerated_Size` \| [`Compressed_Size`] \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	452	\| --------------------- \| ------------- \| ------------------ \| ------------------- \|
				453	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	454
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	455	In this representation, bits on the left are the lowest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	456
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	457	__`Literals_Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	458
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	459	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	460
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	461	\| `Literals_Block_Type` \| Value \|
				462	\| --------------------------- \| ----- \|
				463	\| `Raw_Literals_Block` \| 0 \|
				464	\| `RLE_Literals_Block` \| 1 \|
				465	\| `Compressed_Literals_Block` \| 2 \|
				466	\| `Treeless_Literals_Block` \| 3 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	467
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	468	- `Raw_Literals_Block` - Literals are stored uncompressed.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	469	- `RLE_Literals_Block` - Literals consist of a single byte value
				470	repeated `Regenerated_Size` times.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	471	- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	472	starting with a Huffman tree description.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	473	In this mode, there are at least 2 different literals represented in the Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	474	See details below.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	475	- `Treeless_Literals_Block` - This is a Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	476	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	477	`Huffman_Tree_Description` will be skipped.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	478	Note: If this mode is triggered without any previous Huffman-table in the frame
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	479	(or [dictionary](#dictionary-format)), this should be treated as data corruption.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	480
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	481	__`Size_Format`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	482
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	483	`Size_Format` is divided into 2 families :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	484
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	485	- For `Raw_Literals_Block` and `RLE_Literals_Block`,
				486	it's only necessary to decode `Regenerated_Size`.
				487	There is no `Compressed_Size` field.
				488	- For `Compressed_Block` and `Treeless_Literals_Block`,
				489	it's required to decode both `Compressed_Size`
				490	and `Regenerated_Size` (the decompressed size).
				491	It's also necessary to decode the number of streams (1 or 4).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	492
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	493	For values spanning several bytes, convention is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	494
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	495	__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	496
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	497	`Size_Format` uses 1 _or_ 2 bits.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	498	Its value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	499
				500	- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	501	`Regenerated_Size` uses 5 bits (0-31).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	502	`Literals_Section_Header` uses 1 byte.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	503	`Regenerated_Size = Literals_Section_Header[0]>>3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	504	- `Size_Format` == 01 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	505	`Regenerated_Size` uses 12 bits (0-4095).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	506	`Literals_Section_Header` uses 2 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	507	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	508	- `Size_Format` == 11 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	509	`Regenerated_Size` uses 20 bits (0-1048575).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	510	`Literals_Section_Header` uses 3 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	511	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	512
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	513	Only Stream1 is present for these cases.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	514	Note : it's allowed to represent a short value (for example `27`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	515	using a long format, even if it's less efficient.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	516
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	517	__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	518
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	519	`Size_Format` always uses 2 bits.
				520
				521	- `Size_Format` == 00 : _A single stream_.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	522	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	523	`Literals_Section_Header` uses 3 bytes.
				524	- `Size_Format` == 01 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	525	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	526	`Literals_Section_Header` uses 3 bytes.
				527	- `Size_Format` == 10 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	528	Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	529	`Literals_Section_Header` uses 4 bytes.
				530	- `Size_Format` == 11 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	531	Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	532	`Literals_Section_Header` uses 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	533
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	534	Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
				535	Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
				536	_when_ it is present.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	537	Note 2: `Compressed_Size` can never be `==0`.
				538	Even in single-stream scenario, assuming an empty content, it must be `>=1`,
				539	since it contains at least the final end bit flag.
				540	In 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10`
				541	(6 bytes for the jump table, + 4x1 bytes for the 4 streams).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	542
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	543	4 streams is faster than 1 stream in decompression speed,
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	544	by exploiting instruction level parallelism.
				545	But it's also more expensive,
				546	costing on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table.
				547
				548	In general, use the 4 streams mode when there are more literals to decode,
				549	to favor higher decompression speeds.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	550	Note that beyond >1KB of literals, the 4 streams mode is compulsory.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	551
				552	Note that a minimum of 6 bytes is required for the 4 streams mode.
				553	That's a technical minimum, but it's not recommended to employ the 4 streams mode
				554	for such a small quantity, that would be wasteful.
				555	A more practical lower bound would be around ~256 bytes.
				556
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	557	#### Raw Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	558	The data in Stream1 is `Regenerated_Size` bytes long,
				559	it contains the raw literals data to be used during [Sequence Execution].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	560
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	561	#### RLE Literals Block
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	562	Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
				563	to generate the decoded literals.
				564
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	565	#### Compressed Literals Block and Treeless Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	566	Both of these modes contain Huffman encoded data.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	567
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	568	For `Treeless_Literals_Block`,
				569	the Huffman table comes from previously compressed literals block,
				570	or from a dictionary.
				571
				572
				573	### `Huffman_Tree_Description`
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	574	This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	575	The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	576	The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	577	The size of `Huffman_Tree_Description` is determined during decoding process,
				578	it must be used to determine where streams begin.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	579	`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	580
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	581
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	582	### Jump Table
				583	The Jump Table is only present when there are 4 Huffman-coded streams.
				584
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	585	Reminder : Huffman compressed data consists of either 1 or 4 streams.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	586
				587	If only one stream is present, it is a single bitstream occupying the entire
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	588	remaining portion of the literals block, encoded as described in
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	589	[Huffman-Coded Streams](#huffman-coded-streams).
				590
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	591	If there are four streams, `Literals_Section_Header` only provided
				592	enough information to know the decompressed and compressed sizes
				593	of all four streams _combined_.
				594	The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	595	except for the last stream which may be up to 3 bytes smaller,
				596	to reach a total decompressed size as specified in `Regenerated_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	597
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	598	The compressed size of each stream is provided explicitly in the Jump Table.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	599	Jump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	600	describing the compressed sizes of the first three streams.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	601	`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	602
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	603	`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	604
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	605	`Stream4_Size` is necessarily `>= 1`. Therefore,
				606	if `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	607	data is considered corrupted.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	608
				609	Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	610	as described in [Huffman-Coded Streams](#huffman-coded-streams)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	611
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	612
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	613	Sequences Section
				614	-----------------
				615	A compressed block is a succession of _sequences_ .
				616	A sequence is a literal copy command, followed by a match copy command.
				617	A literal copy command specifies a length.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	618	It is the number of bytes to be copied (or extracted) from the Literals Section.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	619	A match copy command specifies an offset and a length.
				620
				621	When all _sequences_ are decoded,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	622	if there are literals left in the _literals section_,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	623	these bytes are added at the end of the block.
				624
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	625	This is described in more detail in [Sequence Execution](#sequence-execution).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	626
				627	The `Sequences_Section` regroup all symbols required to decode commands.
				628	There are 3 symbol types : literals lengths, offsets and match lengths.
				629	They are encoded together, interleaved, in a single _bitstream_.
				630
				631	The `Sequences_Section` starts by a header,
				632	followed by optional probability tables for each symbol type,
				633	followed by the bitstream.
				634
				635	\| `Sequences_Section_Header` \| [`Literals_Length_Table`] \| [`Offset_Table`] \| [`Match_Length_Table`] \| bitStream \|
				636	\| -------------------------- \| ------------------------- \| ---------------- \| ---------------------- \| --------- \|
				637
				638	To decode the `Sequences_Section`, it's required to know its size.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	639	Its size is deduced from the size of `Literals_Section`:
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	640	`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	641
				642
				643	#### `Sequences_Section_Header`
				644
				645	Consists of 2 items:
				646	- `Number_of_Sequences`
				647	- Symbol compression modes
				648
				649	__`Number_of_Sequences`__
				650
				651	This is a variable size field using between 1 and 3 bytes.
				652	Let's call its first byte `byte0`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	653	- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
Yann Collet	1f83b7c	2023-06-05 09:51:52 -0700	[diff] [blame]	654	- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes.
				655	Note that the 2 bytes format fully overlaps the 1 byte format.
				656	- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	657
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame]	658	`if (Number_of_Sequences == 0)` : there are no sequences.
				659	The sequence section stops immediately,
				660	FSE tables used in `Repeat_Mode` aren't updated.
				661	Block's decompressed content is defined solely by the Literals Section content.
				662
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	663	__Symbol compression modes__
				664
				665	This is a single byte, defining the compression mode of each symbol type.
				666
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	667	\|Bit number\| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	668	\| -------- \| ----------------------- \| -------------- \| -------------------- \| ---------- \|
				669	\|Field name\| `Literals_Lengths_Mode` \| `Offsets_Mode` \| `Match_Lengths_Mode` \| `Reserved` \|
				670
				671	The last field, `Reserved`, must be all-zeroes.
				672
				673	`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	674	literals lengths, offsets, and match lengths symbols respectively.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	675
				676	They follow the same enumeration :
				677
				678	\| Value \| 0 \| 1 \| 2 \| 3 \|
				679	\| ------------------ \| ----------------- \| ---------- \| --------------------- \| ------------- \|
				680	\| `Compression_Mode` \| `Predefined_Mode` \| `RLE_Mode` \| `FSE_Compressed_Mode` \| `Repeat_Mode` \|
				681
				682	- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
				683	[default distributions](#default-distributions).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	684	No distribution table will be present.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	685	- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	686	This symbol will be used for all sequences.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	687	- `FSE_Compressed_Mode` : standard FSE compression.
				688	A distribution table will be present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	689	The format of this distribution table is described in [FSE Table Description](#fse-table-description).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	690	Note that the maximum allowed accuracy log for literals length and match length tables is 9,
				691	and the maximum accuracy log for the offsets table is 8.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	692	`FSE_Compressed_Mode` must not be used when only one symbol is present,
				693	`RLE_Mode` should be used instead (although any other mode will work).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	694	- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
				695	or if this is the first block, table in the dictionary will be used.
				696	Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
				697	It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
				698	No distribution table will be present.
				699	If this mode is used without any previous sequence table in the frame
				700	(nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	701
				702	#### The codes for literals lengths, match lengths, and offsets.
				703
				704	Each symbol is a _code_ in its own context,
				705	which specifies `Baseline` and `Number_of_Bits` to add.
				706	_Codes_ are FSE compressed,
				707	and interleaved with raw additional bits in the same bitstream.
				708
				709	##### Literals length codes
				710
				711	Literals length codes are values ranging from `0` to `35` included.
				712	They define lengths from 0 to 131071 bytes.
				713	The literals length is equal to the decoded `Baseline` plus
				714	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	715	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	716
				717	\| `Literals_Length_Code` \| 0-15 \|
				718	\| ---------------------- \| ---------------------- \|
				719	\| length \| `Literals_Length_Code` \|
				720	\| `Number_of_Bits` \| 0 \|
				721
				722	\| `Literals_Length_Code` \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				723	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				724	\| `Baseline` \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				725	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				726
				727	\| `Literals_Length_Code` \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				728	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				729	\| `Baseline` \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				730	\| `Number_of_Bits` \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				731
				732	\| `Literals_Length_Code` \| 32 \| 33 \| 34 \| 35 \|
				733	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \|
				734	\| `Baseline` \| 8192 \|16384 \|32768 \|65536 \|
				735	\| `Number_of_Bits` \| 13 \| 14 \| 15 \| 16 \|
				736
				737
				738	##### Match length codes
				739
				740	Match length codes are values ranging from `0` to `52` included.
				741	They define lengths from 3 to 131074 bytes.
				742	The match length is equal to the decoded `Baseline` plus
				743	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	744	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	745
				746	\| `Match_Length_Code` \| 0-31 \|
				747	\| ------------------- \| ----------------------- \|
				748	\| value \| `Match_Length_Code` + 3 \|
				749	\| `Number_of_Bits` \| 0 \|
				750
				751	\| `Match_Length_Code` \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				752	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				753	\| `Baseline` \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				754	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				755
				756	\| `Match_Length_Code` \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				757	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				758	\| `Baseline` \| 67 \| 83 \| 99 \| 131 \| 259 \| 515 \| 1027 \| 2051 \|
				759	\| `Number_of_Bits` \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				760
				761	\| `Match_Length_Code` \| 48 \| 49 \| 50 \| 51 \| 52 \|
				762	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				763	\| `Baseline` \| 4099 \| 8195 \|16387 \|32771 \|65539 \|
				764	\| `Number_of_Bits` \| 12 \| 13 \| 14 \| 15 \| 16 \|
				765
				766	##### Offset codes
				767
				768	Offset codes are values ranging from `0` to `N`.
				769
				770	A decoder is free to limit its maximum `N` supported.
				771	Recommendation is to support at least up to `22`.
				772	For information, at the time of this writing.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	773	the reference decoder supports a maximum `N` value of `31`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	774
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	775	An offset code is also the number of additional bits to read in __little-endian__ fashion,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	776	and can be translated into an `Offset_Value` using the following formulas :
				777
				778	```
				779	Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
				780	if (Offset_Value > 3) offset = Offset_Value - 3;
				781	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	782	It means that maximum `Offset_Value` is `(2^(N+1))-1`
				783	supporting back-reference distances up to `(2^(N+1))-4`,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	784	but is limited by [maximum back-reference distance](#window_descriptor).
				785
				786	`Offset_Value` from 1 to 3 are special : they define "repeat codes".
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	787	This is described in more detail in [Repeat Offsets](#repeat-offsets).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	788
				789	#### Decoding Sequences
				790	FSE bitstreams are read in reverse direction than written. In zstd,
				791	the compressor writes bits forward into a block and the decompressor
				792	must read the bitstream _backwards_.
				793
				794	To find the start of the bitstream it is therefore necessary to
				795	know the offset of the last byte of the block which can be found
				796	by counting `Block_Size` bytes after the block header.
				797
				798	After writing the last bit containing information, the compressor
				799	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				800	padding. The last byte of the compressed bitstream cannot be `0` for
				801	that reason.
				802
				803	When decompressing, the last byte containing the padding is the first
				804	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				805	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				806	begins.
				807
				808	FSE decoding requires a 'state' to be carried from symbol to symbol.
				809	For more explanation on FSE decoding, see the [FSE section](#fse).
				810
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	811	For sequence decoding, a separate state keeps track of each
				812	literal lengths, offsets, and match lengths symbols.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	813	Some FSE primitives are also used.
				814	For more details on the operation of these primitives, see the [FSE section](#fse).
				815
				816	##### Starting states
				817	The bitstream starts with initial FSE state values,
				818	each using the required number of bits in their respective _accuracy_,
				819	decoded previously from their normalized distribution.
				820
				821	It starts by `Literals_Length_State`,
				822	followed by `Offset_State`,
				823	and finally `Match_Length_State`.
				824
				825	Reminder : always keep in mind that all values are read _backward_,
				826	so the 'start' of the bitstream is at the highest position in memory,
				827	immediately before the last `1`-bit for padding.
				828
				829	After decoding the starting states, a single sequence is decoded
				830	`Number_Of_Sequences` times.
				831	These sequences are decoded in order from first to last.
				832	Since the compressor writes the bitstream in the forward direction,
				833	this means the compressor must encode the sequences starting with the last
				834	one and ending with the first.
				835
				836	##### Decoding a sequence
				837	For each of the symbol types, the FSE state can be used to determine the appropriate code.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	838	The code then defines the `Baseline` and `Number_of_Bits` to read for each type.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	839	See the [description of the codes] for how to determine these values.
				840
				841	[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
				842
				843	Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	844	It then does the same for `Match_Length`, and then for `Literals_Length`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	845	This sequence is then used for [sequence execution](#sequence-execution).
				846
				847	If it is not the last sequence in the block,
				848	the next operation is to update states.
				849	Using the rules pre-calculated in the decoding tables,
				850	`Literals_Length_State` is updated,
				851	followed by `Match_Length_State`,
				852	and then `Offset_State`.
				853	See the [FSE section](#fse) for details on how to update states from the bitstream.
				854
				855	This operation will be repeated `Number_of_Sequences` times.
				856	At the end, the bitstream shall be entirely consumed,
				857	otherwise the bitstream is considered corrupted.
				858
				859	#### Default Distributions
				860	If `Predefined_Mode` is selected for a symbol type,
				861	its FSE decoding table is generated from a predefined distribution table defined here.
				862	For details on how to convert this distribution into a decoding table, see the [FSE section].
				863
				864	[FSE section]: #from-normalized-distribution-to-decoding-tables
				865
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	866	##### Literals Length
				867	The decoding table uses an accuracy log of 6 bits (64 states).
				868	```
				869	short literalsLength_defaultDistribution[36] =
				870	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				871	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				872	-1,-1,-1,-1 };
				873	```
				874
				875	##### Match Length
				876	The decoding table uses an accuracy log of 6 bits (64 states).
				877	```
				878	short matchLengths_defaultDistribution[53] =
				879	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				880	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				881	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				882	-1,-1,-1,-1,-1 };
				883	```
				884
				885	##### Offset Codes
				886	The decoding table uses an accuracy log of 5 bits (32 states),
				887	and supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
				888
				889	If any sequence in the compressed block requires a larger offset than this,
				890	it's not possible to use the default distribution to represent it.
				891	```
				892	short offsetCodes_defaultDistribution[29] =
				893	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				894	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				895	```
				896
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	897
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	898	Sequence Execution
				899	------------------
				900	Once literals and sequences have been decoded,
				901	they are combined to produce the decoded content of a block.
				902
				903	Each sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	904	decoded as described in the [Sequences Section](#sequences-section).
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	905	To execute a sequence, first copy `literals_length` bytes
				906	from the decoded literals to the output.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	907
				908	Then `match_length` bytes are copied from previous decoded data.
				909	The offset to copy from is determined by `offset_value`:
				910	if `offset_value > 3`, then the offset is `offset_value - 3`.
				911	If `offset_value` is from 1-3, the offset is a special repeat offset value.
				912	See the [repeat offset](#repeat-offsets) section for how the offset is determined
				913	in this case.
				914
				915	The offset is defined as from the current position, so an offset of 6
				916	and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	917	Note that all offsets leading to previously decoded data
				918	must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	919
				920	#### Repeat offsets
				921	As seen in [Sequence Execution](#sequence-execution),
				922	the first 3 values define a repeated offset and we will call them
				923	`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
				924	They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
				925
				926	If `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
				927
				928	There is an exception though, when current sequence's `literals_length = 0`.
				929	In this case, repeated offsets are shifted by one,
				930	so an `offset_value` of 1 means `Repeated_Offset2`,
				931	an `offset_value` of 2 means `Repeated_Offset3`,
elasota	f06b18b	2023-11-19 15:33:37 -0500	[diff] [blame^]	932	and an `offset_value` of 3 means `Repeated_Offset1 - 1`.
				933
				934	In the final case, if `Repeated_Offset1 - 1` evaluates to 0, then the
				935	data is considered corrupted.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	936
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	937	For the first block, the starting offset history is populated with following values :
				938	`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	939	unless a dictionary is used, in which case they come from the dictionary.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	940
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	941	Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
				942	Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	943
				944	[Offset Codes]: #offset-codes
				945
				946	###### Offset updates rules
				947
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	948	During the execution of the sequences of a `Compressed_Block`, the
				949	`Repeated_Offsets`' values are kept up to date, so that they always represent
				950	the three most-recently used offsets. In order to achieve that, they are
				951	updated after executing each sequence in the following way:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	952
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	953	When the sequence's `offset_value` does not refer to one of the
				954	`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
				955	and the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
				956	are shifted back one, and `Repeated_Offset1` takes on the value of the
				957	just-used offset.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	958
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	959	Otherwise, when the sequence's `offset_value` refers to one of the
				960	`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
				961	sequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
				962	so that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
				963	the existing values are pushed back from the first `Repeated_Offset` through to
				964	the `Repeated_Offset` selected by the `offset_value`. This effectively performs
				965	a single-stepped wrapping rotation of the values of these offsets, so that
				966	their order again reflects the recency of their use.
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	967
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	968	The following table shows the values of the `Repeated_Offsets` as a series of
				969	sequences are applied to them:
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	970
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	971	\| `offset_value` \| `literals_length` \| `Repeated_Offset1` \| `Repeated_Offset2` \| `Repeated_Offset3` \| Comment \|
				972	\|:--------------:\|:-----------------:\|:------------------:\|:------------------:\|:------------------:\|:-----------------------:\|
				973	\| \| \| 1 \| 4 \| 8 \| starting values \|
				974	\| 1114 \| 11 \| 1111 \| 1 \| 4 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	975	\| 1 \| 22 \| 1111 \| 1 \| 4 \| repeat 1: no change \|
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	976	\| 2225 \| 22 \| 2222 \| 1111 \| 1 \| non-repeat \|
				977	\| 1114 \| 111 \| 1111 \| 2222 \| 1111 \| non-repeat \|
				978	\| 3336 \| 33 \| 3333 \| 1111 \| 2222 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	979	\| 2 \| 22 \| 1111 \| 3333 \| 2222 \| repeat 2: swap 1 & 2 \|
				980	\| 3 \| 33 \| 2222 \| 1111 \| 3333 \| repeat 3: rotate 3 to 1 \|
				981	\| 3 \| 0 \| 2221 \| 2222 \| 1111 \| special case : insert `repeat1 - 1` \|
				982	\| 1 \| 0 \| 2222 \| 2221 \| 1111 \| == repeat 2 \|
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	983
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	984
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	985	Skippable Frames
				986	----------------
				987
				988	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				989	\|:--------------:\|:------------:\|:-----------:\|
				990	\| 4 bytes \| 4 bytes \| n bytes \|
				991
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	992	Skippable frames allow the insertion of user-defined metadata
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	993	into a flow of concatenated frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	994
				995	Skippable frames defined in this specification are compatible with [LZ4] ones.
				996
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	997	[LZ4]:https://lz4.github.io/lz4/
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	998
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	999	From a compliant decoder perspective, skippable frames need just be skipped,
				1000	and their content ignored, resuming decoding after the skippable frame.
				1001
				1002	It can be noted that a skippable frame
				1003	can be used to watermark a stream of concatenated frames
Dominique Pelle	b772f53	2022-03-12 08:52:40 +0100	[diff] [blame]	1004	embedding any kind of tracking information (even just a UUID).
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1005	Users wary of such possibility should scan the stream of concatenated frames
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1006	in an attempt to detect such frame for analysis or removal.
				1007
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1008	__`Magic_Number`__
				1009
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1010	4 Bytes, __little-endian__ format.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1011	Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
				1012	All 16 values are valid to identify a skippable frame.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1013	This specification doesn't detail any specific tagging for skippable frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1014
				1015	__`Frame_Size`__
				1016
				1017	This is the size, in bytes, of the following `User_Data`
				1018	(without including the magic number nor the size field itself).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1019	This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1020	This means `User_Data` can’t be bigger than (2^32-1) bytes.
				1021
				1022	__`User_Data`__
				1023
				1024	The `User_Data` can be anything. Data will just be skipped by the decoder.
				1025
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1026
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1027
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1028	Entropy Encoding
				1029	----------------
				1030	Two types of entropy encoding are used by the Zstandard format:
				1031	FSE, and Huffman coding.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1032	Huffman is used to compress literals,
				1033	while FSE is used for all other symbols
				1034	(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
				1035	and to compress Huffman headers.
				1036
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1037
				1038	FSE
				1039	---
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1040	FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1041	FSE encoding/decoding involves a state that is carried over between symbols,
				1042	so decoding must be done in the opposite direction as encoding.
				1043	Therefore, all FSE bitstreams are read from end to beginning.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1044	Note that the order of the bits in the stream is not reversed,
				1045	we just read the elements in the reverse order they are written.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1046
				1047	For additional details on FSE, see [Finite State Entropy].
				1048
				1049	[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
				1050
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1051	FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1052	`Symbol`, `Num_Bits`, and `Baseline`.
				1053	The `log2` of the table size is its `Accuracy_Log`.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1054	An FSE state value represents an index in this table.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1055
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1056	To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
				1057	The next symbol in the stream is the `Symbol` indicated in the table for that state.
				1058	To obtain the next state value,
				1059	the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1060
				1061	[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
				1062
				1063	### FSE Table Description
				1064	To decode FSE streams, it is necessary to construct the decoding table.
				1065	The Zstandard format encodes FSE table descriptions as follows:
				1066
				1067	An FSE distribution table describes the probabilities of all symbols
				1068	from `0` to the last present one (included)
				1069	on a normalized scale of `1 << Accuracy_Log` .
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1070	Note that there must be two or more symbols with nonzero probability.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1071
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1072	It's a bitstream which is read forward, in __little-endian__ fashion.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1073	It's not necessary to know bitstream exact size,
				1074	it will be discovered and reported by the decoding process.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1075
				1076	The bitstream starts by reporting on which scale it operates.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1077	Let's `low4Bits` designate the lowest 4 bits of the first byte :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1078	`Accuracy_Log = low4bits + 5`.
				1079
				1080	Then follows each symbol value, from `0` to last present one.
				1081	The number of bits used by each field is variable.
				1082	It depends on :
				1083
				1084	- Remaining probabilities + 1 :
				1085	__example__ :
				1086	Presuming an `Accuracy_Log` of 8,
				1087	and presuming 100 probabilities points have already been distributed,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1088	the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
elasota	324cce4	2023-10-31 11:42:00 -0400	[diff] [blame]	1089	Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
				1090	is the smallest integer `T` that satisfies `(1 << T) > N`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1091
				1092	- Value decoded : small values use 1 less bit :
				1093	__example__ :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1094	Presuming values from 0 to 157 (inclusive) are possible,
				1095	255-157 = 98 values are remaining in an 8-bits field.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1096	They are used this way :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1097	first 98 values (hence from 0 to 97) use only 7 bits,
				1098	values from 98 to 157 use 8 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1099	This is achieved through this scheme :
				1100
				1101	\| Value read \| Value decoded \| Number of bits used \|
				1102	\| ---------- \| ------------- \| ------------------- \|
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1103	\| 0 - 97 \| 0 - 97 \| 7 \|
				1104	\| 98 - 127 \| 98 - 127 \| 8 \|
				1105	\| 128 - 225 \| 0 - 97 \| 7 \|
				1106	\| 226 - 255 \| 128 - 157 \| 8 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1107
				1108	Symbols probabilities are read one by one, in order.
				1109
				1110	Probability is obtained from Value decoded by following formula :
				1111	`Proba = value - 1`
				1112
				1113	It means value `0` becomes negative probability `-1`.
				1114	`-1` is a special probability, which means "less than 1".
				1115	Its effect on distribution table is described in the [next section].
				1116	For the purpose of calculating total allocated probability points, it counts as one.
				1117
				1118	[next section]:#from-normalized-distribution-to-decoding-tables
				1119
				1120	When a symbol has a __probability__ of `zero`,
				1121	it is followed by a 2-bits repeat flag.
				1122	This repeat flag tells how many probabilities of zeroes follow the current one.
				1123	It provides a number ranging from 0 to 3.
				1124	If it is a 3, another 2-bits repeat flag follows, and so on.
				1125
				1126	When last symbol reaches cumulated total of `1 << Accuracy_Log`,
				1127	decoding is complete.
				1128	If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
				1129	distribution is considered corrupted.
				1130
				1131	Then the decoder can tell how many bytes were used in this process,
				1132	and how many symbols are present.
				1133	The bitstream consumes a round number of bytes.
				1134	Any remaining bit within the last byte is just unused.
				1135
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1136	#### From normalized distribution to decoding tables
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1137
				1138	The distribution of normalized probabilities is enough
				1139	to create a unique decoding table.
				1140
				1141	It follows the following build rule :
				1142
				1143	The table has a size of `Table_Size = 1 << Accuracy_Log`.
				1144	Each cell describes the symbol decoded,
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1145	and instructions to get the next state (`Number_of_Bits` and `Baseline`).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1146
				1147	Symbols are scanned in their natural order for "less than 1" probabilities.
				1148	Symbols with this probability are being attributed a single cell,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1149	starting from the end of the table and retreating.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1150	These symbols define a full state reset, reading `Accuracy_Log` bits.
				1151
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1152	Then, all remaining symbols, sorted in natural order, are allocated cells.
				1153	Starting from symbol `0` (if it exists), and table position `0`,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1154	each symbol gets allocated as many cells as its probability.
Dimitris Apostolou	ebbd675	2021-11-13 10:04:04 +0200	[diff] [blame]	1155	Cell allocation is spread, not linear :
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1156	each successor position follows this rule :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1157
				1158	```
				1159	position += (tableSize>>1) + (tableSize>>3) + 3;
				1160	position &= tableSize-1;
				1161	```
				1162
				1163	A position is skipped if already occupied by a "less than 1" probability symbol.
				1164	`position` does not reset between symbols, it simply iterates through
				1165	each position in the table, switching to the next symbol when enough
				1166	states have been allocated to the current one.
				1167
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1168	The process guarantees that the table is entirely filled.
				1169	Each cell corresponds to a state value, which contains the symbol being decoded.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1170
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1171	To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
				1172	it's first necessary to sort all occurrences of each symbol in state order.
				1173	Lower states will need 1 more bit than higher ones.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1174	The process is repeated for each symbol.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1175
				1176	__Example__ :
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1177	Presuming a symbol has a probability of 5,
				1178	it receives 5 cells, corresponding to 5 state values.
				1179	These state values are then sorted in natural order.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1180
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1181	Next power of 2 after 5 is 8.
				1182	Space of probabilities must be divided into 8 equal parts.
				1183	Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1184	Divided by 8, each share is 16 large.
				1185
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1186	In order to reach 8 shares, 8-5=3 lowest states will count "double",
				1187	doubling their shares (32 in width), hence requiring one more bit.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1188
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1189	Baseline is assigned starting from the higher states using fewer bits,
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1190	increasing at each state, then resuming at the first state,
				1191	each state takes its allocated width from Baseline.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1192
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1193	\| state value \| 1 \| 39 \| 77 \| 84 \| 122 \|
				1194	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1195	\| ---------------- \| ----- \| ----- \| ------ \| ---- \| ------ \|
				1196	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1197	\| `Number_of_Bits` \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1198	\| range number \| 2 \| 4 \| 6 \| 0 \| 1 \|
				1199	\| `Baseline` \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1200	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1201
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1202	During decoding, the next state value is determined from current state value,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1203	by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
				1204
				1205	See [Appendix A] for the results of this process applied to the default distributions.
				1206
				1207	[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
				1208
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1209
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1210	Huffman Coding
				1211	--------------
				1212	Zstandard Huffman-coded streams are read backwards,
				1213	similar to the FSE bitstreams.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1214	Therefore, to find the start of the bitstream, it is required to
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1215	know the offset of the last byte of the Huffman-coded stream.
				1216
				1217	After writing the last bit containing information, the compressor
				1218	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				1219	padding. The last byte of the compressed bitstream cannot be `0` for
				1220	that reason.
				1221
				1222	When decompressing, the last byte containing the padding is the first
				1223	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				1224	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				1225	begins.
				1226
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1227	The bitstream contains Huffman-coded symbols in __little-endian__ order,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1228	with the codes defined by the method below.
				1229
				1230	### Huffman Tree Description
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1231
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1232	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1233	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1234	in a manner such that different symbols may be represented
				1235	by bit sequences of different lengths,
				1236	but a parser can always parse an encoded string
				1237	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1238
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1239	Given an alphabet with known symbol frequencies,
				1240	the Huffman algorithm allows the construction of an optimal prefix code
				1241	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1242
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1243	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1244	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1245	and require more memory or more complex decoding operations.
				1246	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1247
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1248	#### Representation
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1249
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1250	All literal values from zero (included) to last present one (excluded)
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1251	are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
				1252	Transformation from `Weight` to `Number_of_Bits` follows this formula :
				1253	```
				1254	Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
				1255	```
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1256	When a literal value is not present, it receives a `Weight` of 0.
				1257	The least frequent symbol receives a `Weight` of 1.
elasota	05059e5	2023-11-08 23:46:37 -0500	[diff] [blame]	1258	If no literal has a `Weight` of 1, then the data is considered corrupted.
				1259	If there are not at least two literals with non-zero `Weight`, then the data
				1260	is considered corrupted.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1261	The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
				1262	The last symbol's `Weight` is deduced from previously retrieved Weights,
				1263	by completing to the nearest power of 2. It's necessarily non 0.
				1264	If it's not possible to reach a clean power of 2 with a single `Weight` value,
				1265	the Huffman Tree Description is considered invalid.
				1266	This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1267	`Max_Number_of_Bits` must be <= 11,
				1268	otherwise the representation is considered corrupted.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1269
				1270	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1271	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1272
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1273	\| literal value \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1274	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1275	\| `Number_of_Bits` \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1276
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1277	The tree depth is 4, since its longest elements uses 4 bits
				1278	(longest elements are the one with smallest frequency).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1279	Literal value `5` will not be listed, as it can be determined from previous values 0-4,
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1280	nor will values above `5` as they are all 0.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1281	Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1282	Weight formula is :
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1283	```
				1284	Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
				1285	```
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1286	It gives the following series of weights :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1287
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1288	\| literal value \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1289	\| ------------- \| --- \| --- \| --- \| --- \| --- \|
				1290	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1291
				1292	The decoder will do the inverse operation :
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1293	having collected weights of literal symbols from `0` to `4`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1294	it knows the last literal, `5`, is present with a non-zero `Weight`.
				1295	The `Weight` of `5` can be determined by advancing to the next power of 2.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1296	The sum of `2^(Weight-1)` (excluding 0's) is :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1297	`8 + 4 + 2 + 0 + 1 = 15`.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1298	Nearest larger power of 2 value is 16.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1299	Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1300
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1301	#### Huffman Tree header
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1302
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1303	This is a single byte value (0-255),
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1304	which describes how the series of weights is encoded.
				1305
				1306	- if `headerByte` < 128 :
				1307	the series of weights is compressed using FSE (see below).
				1308	The length of the FSE-compressed series is equal to `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1309
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1310	- if `headerByte` >= 128 :
				1311	+ the series of weights uses a direct representation,
				1312	where each `Weight` is encoded directly as a 4 bits field (0-15).
				1313	+ They are encoded forward, 2 weights to a byte,
				1314	first weight taking the top four bits and second one taking the bottom four.
				1315	* e.g. the following operations could be used to read the weights:
				1316	`Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
				1317	+ The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
				1318	meaning it uses only full bytes even if `Number_of_Weights` is odd.
				1319	+ `Number_of_Weights = headerByte - 127`.
				1320	* Note that maximum `Number_of_Weights` is 255-127 = 128,
				1321	therefore, only up to 128 `Weight` can be encoded using direct representation.
				1322	* Since the last non-zero `Weight` is _not_ encoded,
				1323	this scheme is compatible with alphabet sizes of up to 129 symbols,
				1324	hence including literal symbol 128.
				1325	* If any literal symbol > 128 has a non-zero `Weight`,
				1326	direct representation is not possible.
				1327	In such case, it's necessary to use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1328
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1329
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1330	#### Finite State Entropy (FSE) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1331
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1332	In this case, the series of Huffman weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1333	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1334	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1335
				1336	To decode an FSE bitstream, it is necessary to know its compressed size.
				1337	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1338	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1339	which is `255`, since literal values span from `0` to `255`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1340	and last symbol's `Weight` is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1341
				1342	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1343	It will create a Decoding Table.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1344	For a list of Huffman weights, the maximum accuracy log is 6 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1345	For more description see the [FSE header description](#fse-table-description)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1346
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1347	The Huffman header compression uses 2 states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1348	which share the same FSE distribution table.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1349	The first state (`State1`) encodes the even indexed symbols,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1350	and the second (`State2`) encodes the odd indexed symbols.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1351	`State1` is initialized first, and then `State2`, and they take turns
				1352	decoding a single symbol and updating their state.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1353	For more details on these FSE operations, see the [FSE section](#fse).
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1354
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1355	The number of symbols to decode is determined
				1356	by tracking bitStream overflow condition:
				1357	If updating state after decoding a symbol would require more bits than
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1358	remain in the stream, it is assumed that extra bits are 0. Then,
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1359	symbols for each of the final states are decoded and the process is complete.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1360
elasota	e61e3ff	2023-11-08 20:06:58 -0500	[diff] [blame]	1361	If this process would produce more weights than the maximum number of decoded
				1362	weights (255), then the data is considered corrupted.
				1363
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1364	#### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1365
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1366	All present symbols shall now have a `Weight` value.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	1367	It is possible to transform weights into `Number_of_Bits`, using this formula:
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1368	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1369	Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1370	```
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1371	Symbols are sorted by `Weight`.
				1372	Within same `Weight`, symbols keep natural sequential order.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1373	Symbols with a `Weight` of zero are removed.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1374	Then, starting from lowest `Weight`, prefix codes are distributed in sequential order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1375
				1376	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1377	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1378
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1379	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				1380	\| -------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1381	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1382
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1383	Sorted by weight and then natural sequential order,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1384	it gives the following distribution :
				1385
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1386	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				1387	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| ---- \|
				1388	\| `Weight` \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
				1389	\| `Number_of_Bits` \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
				1390	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1391
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1392	### Huffman-coded Streams
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1393
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1394	Given a Huffman decoding table,
				1395	it's possible to decode a Huffman-coded stream.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1396
				1397	Each bitstream must be read _backward_,
				1398	that is starting from the end down to the beginning.
				1399	Therefore it's necessary to know the size of each bitstream.
				1400
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1401	It's also necessary to know exactly which _bit_ is the last one.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1402	This is detected by a final bit flag :
				1403	the highest bit of latest byte is a final-bit-flag.
				1404	Consequently, a last byte of `0` is not possible.
				1405	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1406	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1407
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1408	Starting from the end,
				1409	it's possible to read the bitstream in a __little-endian__ fashion,
				1410	keeping track of already used bits. Since the bitstream is encoded in reverse
				1411	order, starting from the end read symbols in forward order.
				1412
				1413	For example, if the literal sequence "0145" was encoded using above prefix code,
				1414	it would be encoded (in reverse order) as:
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1415
				1416	\|Symbol \| 5 \| 4 \| 1 \| 0 \| Padding \|
				1417	\|--------\|------\|------\|----\|---\|---------\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1418	\|Encoding\|`0000`\|`0001`\|`01`\|`1`\| `00001` \|
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1419
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1420	Resulting in following 2-bytes bitstream :
				1421	```
				1422	00010000 00001101
				1423	```
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1424
Yann Collet	e8d35cc	2017-08-20 10:39:20 -0700	[diff] [blame]	1425	Here is an alternative representation with the symbol codes separated by underscore:
Yann Collet	d0d06e4	2017-08-19 12:26:09 -0700	[diff] [blame]	1426	```
				1427	0001_0000 00001_1_01
				1428	```
				1429
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1430	Reading highest `Max_Number_of_Bits` bits,
				1431	it's possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1432	determining the symbol to decode and number of bits to discard.
				1433
				1434	The process continues up to reading the required number of symbols per stream.
				1435	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1436	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1437	the decoding process is considered faulty.
				1438
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1439
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1440	Dictionary Format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1441	-----------------
				1442
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1443	Zstandard is compatible with "raw content" dictionaries,
				1444	free of any format restriction, except that they must be at least 8 bytes.
				1445	These dictionaries function as if they were just the `Content` part
				1446	of a formatted dictionary.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1447
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1448	But dictionaries created by `zstd --train` follow a format, described here.
				1449
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1450	__Pre-requisites__ : a dictionary has a size,
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1451	defined either by a buffer limit, or a file size.
				1452
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1453	\| `Magic_Number` \| `Dictionary_ID` \| `Entropy_Tables` \| `Content` \|
				1454	\| -------------- \| --------------- \| ---------------- \| --------- \|
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1455
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1456	__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1457
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1458	__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1459	`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1460	It's used by decoders to check if they use the correct dictionary.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1461
				1462	_Reserved ranges :_
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1463	If the dictionary is going to be distributed in a public environment,
				1464	the following ranges of `Dictionary_ID` are reserved for some future registrar
				1465	and shall not be used :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1466
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1467	- low range : <= 32767
				1468	- high range : >= (2^31)
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1469
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1470	Outside of these ranges, any value of `Dictionary_ID`
				1471	which is both `>= 32768` and `< (1<<31)` can be used freely,
				1472	even in public environment.
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1473
				1474
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1475	__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1476	See the relevant [FSE](#fse-table-description)
				1477	and [Huffman](#huffman-tree-description) sections for how to decode these tables.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1478	They are stored in following order :
				1479	Huffman tables for literals, FSE table for offsets,
				1480	FSE table for match lengths, and FSE table for literals lengths.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1481	These tables populate the Repeat Stats literals mode and
				1482	Repeat distribution mode for sequence decoding.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1483	It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1484	stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
senhuang42	8adeb9f	2020-09-22 13:24:27 -0400	[diff] [blame]	1485	Each recent offset must have a value <= dictionary content size, and cannot equal 0.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1486
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1487	__`Content`__ : The rest of the dictionary is its content.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1488	The content act as a "past" in front of data to compress or decompress,
				1489	so it can be referenced in sequence commands.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1490	As long as the amount of data decoded from this frame is less than or
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1491	equal to `Window_Size`, sequence commands may specify offsets longer
				1492	than the total length of decoded output so far to reference back to the
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1493	dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1494	After the total output has surpassed `Window_Size` however,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1495	this is no longer allowed and the dictionary is no longer accessible.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1496
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1497	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1498
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1499	If a dictionary is provided by an external source,
				1500	it should be loaded with great care, its content considered untrusted.
				1501
				1502
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1503
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1504	Appendix A - Decoding tables for predefined codes
				1505	-------------------------------------------------
				1506
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1507	This appendix contains FSE decoding tables
				1508	for the predefined literal length, match length, and offset codes.
				1509	The tables have been constructed using the algorithm as given above in chapter
				1510	"from normalized distribution to decoding tables".
				1511	The tables here can be used as examples
				1512	to crosscheck that an implementation build its decoding tables correctly.
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1513
				1514	#### Literal Length Code:
				1515
				1516	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1517	\| ----- \| ------ \| -------------- \| ---- \|
				1518	\| 0 \| 0 \| 4 \| 0 \|
				1519	\| 1 \| 0 \| 4 \| 16 \|
				1520	\| 2 \| 1 \| 5 \| 32 \|
				1521	\| 3 \| 3 \| 5 \| 0 \|
				1522	\| 4 \| 4 \| 5 \| 0 \|
				1523	\| 5 \| 6 \| 5 \| 0 \|
				1524	\| 6 \| 7 \| 5 \| 0 \|
				1525	\| 7 \| 9 \| 5 \| 0 \|
				1526	\| 8 \| 10 \| 5 \| 0 \|
				1527	\| 9 \| 12 \| 5 \| 0 \|
				1528	\| 10 \| 14 \| 6 \| 0 \|
				1529	\| 11 \| 16 \| 5 \| 0 \|
				1530	\| 12 \| 18 \| 5 \| 0 \|
				1531	\| 13 \| 19 \| 5 \| 0 \|
				1532	\| 14 \| 21 \| 5 \| 0 \|
				1533	\| 15 \| 22 \| 5 \| 0 \|
				1534	\| 16 \| 24 \| 5 \| 0 \|
				1535	\| 17 \| 25 \| 5 \| 32 \|
				1536	\| 18 \| 26 \| 5 \| 0 \|
				1537	\| 19 \| 27 \| 6 \| 0 \|
				1538	\| 20 \| 29 \| 6 \| 0 \|
				1539	\| 21 \| 31 \| 6 \| 0 \|
				1540	\| 22 \| 0 \| 4 \| 32 \|
				1541	\| 23 \| 1 \| 4 \| 0 \|
				1542	\| 24 \| 2 \| 5 \| 0 \|
				1543	\| 25 \| 4 \| 5 \| 32 \|
				1544	\| 26 \| 5 \| 5 \| 0 \|
				1545	\| 27 \| 7 \| 5 \| 32 \|
				1546	\| 28 \| 8 \| 5 \| 0 \|
				1547	\| 29 \| 10 \| 5 \| 32 \|
				1548	\| 30 \| 11 \| 5 \| 0 \|
				1549	\| 31 \| 13 \| 6 \| 0 \|
				1550	\| 32 \| 16 \| 5 \| 32 \|
				1551	\| 33 \| 17 \| 5 \| 0 \|
				1552	\| 34 \| 19 \| 5 \| 32 \|
				1553	\| 35 \| 20 \| 5 \| 0 \|
				1554	\| 36 \| 22 \| 5 \| 32 \|
				1555	\| 37 \| 23 \| 5 \| 0 \|
				1556	\| 38 \| 25 \| 4 \| 0 \|
				1557	\| 39 \| 25 \| 4 \| 16 \|
				1558	\| 40 \| 26 \| 5 \| 32 \|
				1559	\| 41 \| 28 \| 6 \| 0 \|
				1560	\| 42 \| 30 \| 6 \| 0 \|
				1561	\| 43 \| 0 \| 4 \| 48 \|
				1562	\| 44 \| 1 \| 4 \| 16 \|
				1563	\| 45 \| 2 \| 5 \| 32 \|
				1564	\| 46 \| 3 \| 5 \| 32 \|
				1565	\| 47 \| 5 \| 5 \| 32 \|
				1566	\| 48 \| 6 \| 5 \| 32 \|
				1567	\| 49 \| 8 \| 5 \| 32 \|
				1568	\| 50 \| 9 \| 5 \| 32 \|
				1569	\| 51 \| 11 \| 5 \| 32 \|
				1570	\| 52 \| 12 \| 5 \| 32 \|
				1571	\| 53 \| 15 \| 6 \| 0 \|
				1572	\| 54 \| 17 \| 5 \| 32 \|
				1573	\| 55 \| 18 \| 5 \| 32 \|
				1574	\| 56 \| 20 \| 5 \| 32 \|
				1575	\| 57 \| 21 \| 5 \| 32 \|
				1576	\| 58 \| 23 \| 5 \| 32 \|
				1577	\| 59 \| 24 \| 5 \| 32 \|
				1578	\| 60 \| 35 \| 6 \| 0 \|
				1579	\| 61 \| 34 \| 6 \| 0 \|
				1580	\| 62 \| 33 \| 6 \| 0 \|
				1581	\| 63 \| 32 \| 6 \| 0 \|
				1582
				1583	#### Match Length Code:
				1584
				1585	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1586	\| ----- \| ------ \| -------------- \| ---- \|
				1587	\| 0 \| 0 \| 6 \| 0 \|
				1588	\| 1 \| 1 \| 4 \| 0 \|
				1589	\| 2 \| 2 \| 5 \| 32 \|
				1590	\| 3 \| 3 \| 5 \| 0 \|
				1591	\| 4 \| 5 \| 5 \| 0 \|
				1592	\| 5 \| 6 \| 5 \| 0 \|
				1593	\| 6 \| 8 \| 5 \| 0 \|
				1594	\| 7 \| 10 \| 6 \| 0 \|
				1595	\| 8 \| 13 \| 6 \| 0 \|
				1596	\| 9 \| 16 \| 6 \| 0 \|
				1597	\| 10 \| 19 \| 6 \| 0 \|
				1598	\| 11 \| 22 \| 6 \| 0 \|
				1599	\| 12 \| 25 \| 6 \| 0 \|
				1600	\| 13 \| 28 \| 6 \| 0 \|
				1601	\| 14 \| 31 \| 6 \| 0 \|
				1602	\| 15 \| 33 \| 6 \| 0 \|
				1603	\| 16 \| 35 \| 6 \| 0 \|
				1604	\| 17 \| 37 \| 6 \| 0 \|
				1605	\| 18 \| 39 \| 6 \| 0 \|
				1606	\| 19 \| 41 \| 6 \| 0 \|
				1607	\| 20 \| 43 \| 6 \| 0 \|
				1608	\| 21 \| 45 \| 6 \| 0 \|
				1609	\| 22 \| 1 \| 4 \| 16 \|
				1610	\| 23 \| 2 \| 4 \| 0 \|
				1611	\| 24 \| 3 \| 5 \| 32 \|
				1612	\| 25 \| 4 \| 5 \| 0 \|
				1613	\| 26 \| 6 \| 5 \| 32 \|
				1614	\| 27 \| 7 \| 5 \| 0 \|
				1615	\| 28 \| 9 \| 6 \| 0 \|
				1616	\| 29 \| 12 \| 6 \| 0 \|
				1617	\| 30 \| 15 \| 6 \| 0 \|
				1618	\| 31 \| 18 \| 6 \| 0 \|
				1619	\| 32 \| 21 \| 6 \| 0 \|
				1620	\| 33 \| 24 \| 6 \| 0 \|
				1621	\| 34 \| 27 \| 6 \| 0 \|
				1622	\| 35 \| 30 \| 6 \| 0 \|
				1623	\| 36 \| 32 \| 6 \| 0 \|
				1624	\| 37 \| 34 \| 6 \| 0 \|
				1625	\| 38 \| 36 \| 6 \| 0 \|
				1626	\| 39 \| 38 \| 6 \| 0 \|
				1627	\| 40 \| 40 \| 6 \| 0 \|
				1628	\| 41 \| 42 \| 6 \| 0 \|
				1629	\| 42 \| 44 \| 6 \| 0 \|
				1630	\| 43 \| 1 \| 4 \| 32 \|
				1631	\| 44 \| 1 \| 4 \| 48 \|
				1632	\| 45 \| 2 \| 4 \| 16 \|
				1633	\| 46 \| 4 \| 5 \| 32 \|
				1634	\| 47 \| 5 \| 5 \| 32 \|
				1635	\| 48 \| 7 \| 5 \| 32 \|
				1636	\| 49 \| 8 \| 5 \| 32 \|
				1637	\| 50 \| 11 \| 6 \| 0 \|
				1638	\| 51 \| 14 \| 6 \| 0 \|
				1639	\| 52 \| 17 \| 6 \| 0 \|
				1640	\| 53 \| 20 \| 6 \| 0 \|
				1641	\| 54 \| 23 \| 6 \| 0 \|
				1642	\| 55 \| 26 \| 6 \| 0 \|
				1643	\| 56 \| 29 \| 6 \| 0 \|
				1644	\| 57 \| 52 \| 6 \| 0 \|
				1645	\| 58 \| 51 \| 6 \| 0 \|
				1646	\| 59 \| 50 \| 6 \| 0 \|
				1647	\| 60 \| 49 \| 6 \| 0 \|
				1648	\| 61 \| 48 \| 6 \| 0 \|
				1649	\| 62 \| 47 \| 6 \| 0 \|
				1650	\| 63 \| 46 \| 6 \| 0 \|
				1651
				1652	#### Offset Code:
				1653
				1654	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1655	\| ----- \| ------ \| -------------- \| ---- \|
				1656	\| 0 \| 0 \| 5 \| 0 \|
				1657	\| 1 \| 6 \| 4 \| 0 \|
				1658	\| 2 \| 9 \| 5 \| 0 \|
				1659	\| 3 \| 15 \| 5 \| 0 \|
				1660	\| 4 \| 21 \| 5 \| 0 \|
				1661	\| 5 \| 3 \| 5 \| 0 \|
				1662	\| 6 \| 7 \| 4 \| 0 \|
				1663	\| 7 \| 12 \| 5 \| 0 \|
				1664	\| 8 \| 18 \| 5 \| 0 \|
				1665	\| 9 \| 23 \| 5 \| 0 \|
				1666	\| 10 \| 5 \| 5 \| 0 \|
				1667	\| 11 \| 8 \| 4 \| 0 \|
				1668	\| 12 \| 14 \| 5 \| 0 \|
				1669	\| 13 \| 20 \| 5 \| 0 \|
				1670	\| 14 \| 2 \| 5 \| 0 \|
				1671	\| 15 \| 7 \| 4 \| 16 \|
				1672	\| 16 \| 11 \| 5 \| 0 \|
				1673	\| 17 \| 17 \| 5 \| 0 \|
				1674	\| 18 \| 22 \| 5 \| 0 \|
				1675	\| 19 \| 4 \| 5 \| 0 \|
				1676	\| 20 \| 8 \| 4 \| 16 \|
				1677	\| 21 \| 13 \| 5 \| 0 \|
				1678	\| 22 \| 19 \| 5 \| 0 \|
				1679	\| 23 \| 1 \| 5 \| 0 \|
				1680	\| 24 \| 6 \| 4 \| 16 \|
				1681	\| 25 \| 10 \| 5 \| 0 \|
				1682	\| 26 \| 16 \| 5 \| 0 \|
				1683	\| 27 \| 28 \| 5 \| 0 \|
				1684	\| 28 \| 27 \| 5 \| 0 \|
				1685	\| 29 \| 26 \| 5 \| 0 \|
				1686	\| 30 \| 25 \| 5 \| 0 \|
				1687	\| 31 \| 24 \| 5 \| 0 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1688
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1689
				1690
				1691	Appendix B - Resources for implementers
				1692	-------------------------------------------------
				1693
				1694	An open source reference implementation is available on :
				1695	https://github.com/facebook/zstd
				1696
				1697	The project contains a frame generator, called [decodeCorpus],
				1698	which can be used by any 3rd-party implementation
				1699	to verify that a tested decoder is compliant with the specification.
				1700
				1701	[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
				1702
				1703	`decodeCorpus` generates random valid frames.
				1704	A compliant decoder should be able to decode them all,
				1705	or at least provide a meaningful error code explaining for which reason it cannot
				1706	(memory limit restrictions for example).
				1707
				1708
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1709	Version changes
				1710	---------------
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame]	1711	- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	1712	- 0.3.9 : clarifications for Huffman-compressed literal sizes.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1713	- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
Yann Collet	0b0b62d	2021-05-15 23:04:46 -0700	[diff] [blame]	1714	- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1715	- 0.3.6 : clarifications for Dictionary_ID
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	1716	- 0.3.5 : clarifications for Block_Maximum_Size
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1717	- 0.3.4 : clarifications for FSE decoding table
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	1718	- 0.3.3 : clarifications for field Block_Size
W. Felix Handte	a2861d7	2019-07-17 17:55:15 -0400	[diff] [blame]	1719	- 0.3.2 : remove additional block size restriction on compressed blocks
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	1720	- 0.3.1 : minor clarification regarding offset history update rules
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1721	- 0.3.0 : minor edits to match RFC8478
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1722	- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1723	- 0.2.8 : clarifications for IETF RFC discuss
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1724	- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1725	- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1726	- 0.2.5 : minor typos and clarifications
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1727	- 0.2.4 : section restructuring, by Sean Purcell
Yann Collet	20bed42	2017-01-27 12:16:16 -0800	[diff] [blame]	1728	- 0.2.3 : clarified several details, by Sean Purcell
Yann Collet	55981a9	2016-09-15 02:13:18 +0200	[diff] [blame]	1729	- 0.2.2 : added predefined codes, by Johannes Rudolph
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1730	- 0.2.1 : clarify field names, by Przemyslaw Skibinski
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1731	- 0.2.0 : numerous format adjustments for zstd v0.8+
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1732	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1733	- 0.1.1 : reserved dictID ranges
				1734	- 0.1.0 : initial release