Blame - doc/zstd_compression_format.md - external_zstd

blob: a2bf20cbc7620e82dae6bd93f55ad348b2f9144f [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
W. Felix Handte	5d693cc	2022-12-20 12:49:47 -0500	[diff] [blame]	6	Copyright (c) Meta Platforms, Inc. and affiliates.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	19	0.4.3 (2024-10-07)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	20
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	29	using the [Zstandard algorithm](https://facebook.github.io/zstd/).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	30	The text of the specification assumes a basic background in programming
				31	at the level of bits and other primitive data representations.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	32
				33	The data can be produced or consumed,
				34	even for an arbitrarily long sequentially presented input data stream,
				35	using only an a priori bounded amount of intermediate storage,
				36	and hence can be used in data communications.
				37	The format uses the Zstandard compression method,
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	38	and optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/),
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	39	for detection of data corruption.
				40
				41	The data format defined by this specification
				42	does not attempt to allow random access to compressed data.
				43
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	44	Unless otherwise indicated below,
				45	a compliant compressor must produce data sets
				46	that conform to the specifications presented here.
				47	It doesn’t need to support all options though.
				48
				49	A compliant decompressor must be able to decompress
				50	at least one working set of parameters
				51	that conforms to the specifications presented here.
				52	It may also ignore informative fields, such as checksum.
				53	Whenever it does not support a parameter defined in the compressed stream,
				54	it must produce a non-ambiguous error code and associated error message
				55	explaining which parameter is unsupported.
				56
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	57	This specification is intended for use by implementers of software
				58	to compress data into Zstandard format and/or decompress data from Zstandard format.
				59	The Zstandard format is supported by an open source reference implementation,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	60	written in portable C, and available at : https://github.com/facebook/zstd .
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	61
				62
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	63	### Overall conventions
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	64	In this document:
				65	- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	66	- the naming convention for identifiers is `Mixed_Case_With_Underscores`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	67
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	68	### Definitions
				69	Content compressed by Zstandard is transformed into a Zstandard __frame__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	70	Multiple frames can be appended into a single file or stream.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	71	A frame is completely independent, has a defined beginning and end,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	72	and a set of parameters which tells the decoder how to decompress it.
				73
				74	A frame encapsulates one or multiple __blocks__.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	75	Each block contains arbitrary content, which is described by its header,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	76	and has a guaranteed maximum content size, which depends on frame parameters.
				77	Unlike frames, each block depends on previous blocks for proper decoding.
				78	However, each block can be decompressed without waiting for its successor,
				79	allowing streaming operations.
				80
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	81	Overview
				82	---------
				83	- [Frames](#frames)
				84	- [Zstandard frames](#zstandard-frames)
				85	- [Blocks](#blocks)
				86	- [Literals Section](#literals-section)
				87	- [Sequences Section](#sequences-section)
				88	- [Sequence Execution](#sequence-execution)
				89	- [Skippable frames](#skippable-frames)
				90	- [Entropy Encoding](#entropy-encoding)
				91	- [FSE](#fse)
				92	- [Huffman Coding](#huffman-coding)
				93	- [Dictionary Format](#dictionary-format)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	94
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	95	Frames
				96	------
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	97	Zstandard compressed data is made of one or more __frames__.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	98	Each frame is independent and can be decompressed independently of other frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	99	The decompressed content of multiple concatenated frames is the concatenation of
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	100	each frame decompressed content.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	101
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	102	There are two frame formats defined by Zstandard:
				103	Zstandard frames and Skippable frames.
				104	Zstandard frames contain compressed data, while
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	105	skippable frames contain custom user metadata.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	106
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	107	## Zstandard frames
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	108	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	109
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	110	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				111	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	112	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	113
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	114	__`Magic_Number`__
				115
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	116	4 Bytes, __little-endian__ format.
Yann Collet	7bdfcea	2016-09-05 17:43:31 +0200	[diff] [blame]	117	Value : 0xFD2FB528
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	118	Note: This value was selected to be less probable to find at the beginning of some random file.
				119	It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
				120	contains byte values outside of ASCII range,
				121	and doesn't map into UTF8 space.
				122	It reduces the chances that a text file represent this value by accident.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	123
				124	__`Frame_Header`__
				125
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	126	2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	127
				128	__`Data_Block`__
				129
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	130	Detailed in [`Blocks`](#blocks).
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	131	That’s where compressed data is stored.
				132
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	133	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	134
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	135	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	The content checksum is the result
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	137	of [xxh64() hash function](https://cyan4973.github.io/xxHash/)
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	138	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	139	The low 4 bytes of the checksum are stored in __little-endian__ format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	140
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	141	### `Frame_Header`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	142
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	143	The `Frame_Header` has a variable size, with a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	144	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	145	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	146
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	147	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				148	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				149	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	150
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	151	#### `Frame_Header_Descriptor`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	152
				153	The first header's byte is called the `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	154	It describes which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	155	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	156
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	157	\| Bit number \| Field name \|
				158	\| ---------- \| ---------- \|
				159	\| 7-6 \| `Frame_Content_Size_flag` \|
				160	\| 5 \| `Single_Segment_flag` \|
				161	\| 4 \| `Unused_bit` \|
				162	\| 3 \| `Reserved_bit` \|
				163	\| 2 \| `Content_Checksum_flag` \|
				164	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	165
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	166	In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	167
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	168	__`Frame_Content_Size_flag`__
				169
				170	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	171	specifying if `Frame_Content_Size` (the decompressed data size)
				172	is provided within the header.
				173	`Flag_Value` provides `FCS_Field_Size`,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	174	which is the number of bytes used by `Frame_Content_Size`
				175	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	176
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	177	\| `Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				178	\| -------------- \| ------ \| --- \| --- \| --- \|
				179	\|`FCS_Field_Size`\| 0 or 1 \| 2 \| 4 \| 8 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	180
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	181	When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	182	if `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
				183	Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	184
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	185	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	186
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	187	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	188	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	189
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	190	In this case, `Window_Descriptor` byte is skipped,
				191	but `Frame_Content_Size` is necessarily present.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	192	As a consequence, the decoder must allocate a memory segment
Yann Collet	fccb46f	2017-11-18 11:28:00 -0800	[diff] [blame]	193	of size equal or larger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	194
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	195	In order to preserve the decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	196	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	197	which requests a memory size beyond decoder's authorized range.
				198
				199	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	200	memory sizes of at least 8 MB.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	201	This is only a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	202	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	203	depending on local limitations.
				204
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	205	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	206
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	207	A decoder compliant with this specification version shall not interpret this bit.
				208	It might be used in any future version,
				209	to signal a property which is transparent to properly decode the frame.
				210	An encoder compliant with this specification version must set this bit to zero.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	211
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	212	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	213
				214	This bit is reserved for some future feature.
				215	Its value _must be zero_.
				216	A decoder compliant with this specification version must ensure it is not set.
				217	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	218	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	219
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	220	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	221
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	222	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				223	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	224
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	225	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	226
				227	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	228	telling if a dictionary ID is provided within the header.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	229	It also specifies the size of this field as `DID_Field_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	230
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	231	\|`Flag_Value` \| 0 \| 1 \| 2 \| 3 \|
				232	\| -------------- \| --- \| --- \| --- \| --- \|
				233	\|`DID_Field_Size`\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	234
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	235	#### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	237	Provides guarantees on minimum memory buffer required to decompress a frame.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	238	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	239
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	240	The `Window_Descriptor` byte is optional.
				241	When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	242	In this case, `Window_Size` is `Frame_Content_Size`,
				243	which can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	244
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	245	\| Bit numbers \| 7-3 \| 2-0 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	246	\| ----------- \| ---------- \| ---------- \|
				247	\| Field name \| `Exponent` \| `Mantissa` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	248
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	249	The minimum memory buffer size is called `Window_Size`.
				250	It is described by the following formulas :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	251	```
				252	windowLog = 10 + Exponent;
				253	windowBase = 1 << windowLog;
				254	windowAdd = (windowBase / 8) * Mantissa;
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	255	Window_Size = windowBase + windowAdd;
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	256	```
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	257	The minimum `Window_Size` is 1 KB.
				258	The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	259
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	260	In general, larger `Window_Size` tend to improve compression ratio,
				261	but at the cost of memory usage.
				262
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	263	To properly decode compressed data,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	264	a decoder will need to allocate a buffer of at least `Window_Size` bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	265
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	266	In order to preserve decoder from unreasonable memory requirements,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	267	a decoder is allowed to reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	268	which requests a memory size beyond decoder's authorized range.
				269
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	270	For improved interoperability,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	271	it's recommended for decoders to support `Window_Size` of up to 8 MB,
				272	and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	273	It's merely a recommendation though,
				274	decoders are free to support larger or lower limits,
				275	depending on local limitations.
				276
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	277	#### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	278
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	279	This is a variable size field, which contains
				280	the ID of the dictionary required to properly decode the frame.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	281	`Dictionary_ID` field is optional. When it's not present,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	282	it's up to the decoder to know which dictionary to use.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	283
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	284	`Dictionary_ID` field size is provided by `DID_Field_Size`.
				285	`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	286	1 byte can represent an ID 0-255.
				287	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	288	4 bytes can represent an ID 0-4294967295.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	289	Format is __little-endian__.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	290
				291	It's allowed to represent a small ID (for example `13`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	292	with a large 4-bytes dictionary ID, even if it is less efficient.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	293
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	294	A value of `0` has same meaning as no `Dictionary_ID`,
				295	in which case the frame may or may not need a dictionary to be decoded,
				296	and the ID of such a dictionary is not specified.
				297	The decoder must know this information by other means.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	298
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	299	#### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	300
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	301	This is the original (uncompressed) size. This information is optional.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	302	`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
				303	`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
				304	`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	305
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	306	\| `FCS_Field_Size` \| Range \|
				307	\| ---------------- \| ---------- \|
				308	\| 0 \| unknown \|
				309	\| 1 \| 0 - 255 \|
				310	\| 2 \| 256 - 65791\|
				311	\| 4 \| 0 - 2^32-1 \|
				312	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	313
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	314	`Frame_Content_Size` format is __little-endian__.
				315	When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				316	When `FCS_Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	317	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	318
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	319
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	320	Blocks
				321	-------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	322
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	323	After `Magic_Number` and `Frame_Header`, there are some number of blocks.
				324	Each frame must have at least one block,
				325	but there is no upper limit on the number of blocks per frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	326
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	327	The structure of a block is as follows:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	328
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	329	\| `Block_Header` \| `Block_Content` \|
				330	\|:--------------:\|:---------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	331	\| 3 bytes \| n bytes \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	332
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	333	__`Block_Header`__
				334
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	335	`Block_Header` uses 3 bytes, written using __little-endian__ convention.
				336	It contains 3 fields :
				337
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	338	\| `Last_Block` \| `Block_Type` \| `Block_Size` \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	339	\|:------------:\|:------------:\|:------------:\|
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	340	\| bit 0 \| bits 1-2 \| bits 3-23 \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	341
				342	__`Last_Block`__
				343
				344	The lowest bit signals if this block is the last one.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	345	The frame will end after this last block.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	346	It may be followed by an optional `Content_Checksum`
				347	(see [Zstandard Frames](#zstandard-frames)).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	348
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	349	__`Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	350
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	351	The next 2 bits represent the `Block_Type`.
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	352	`Block_Type` influences the meaning of `Block_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	353	There are 4 block types :
				354
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	355	\| Value \| 0 \| 1 \| 2 \| 3 \|
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	356	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	357	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	358
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	359	- `Raw_Block` - this is an uncompressed block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	360	`Block_Content` contains `Block_Size` bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	361
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	362	- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
				363	`Block_Content` consists of a single byte.
				364	On the decompression side, this byte must be repeated `Block_Size` times.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	365
				366	- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
				367	explained later on.
				368	`Block_Size` is the length of `Block_Content`, the compressed data.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	369	The decompressed size is not known,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	370	but its maximum possible value is guaranteed (see below)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	371
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	372	- `Reserved` - this is not a block.
				373	This value cannot be used with current version of this specification.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	374	If such a value is present, it is considered corrupted data.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	375
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	376	__`Block_Size`__
				377
				378	The upper 21 bits of `Block_Header` represent the `Block_Size`.
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	379
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	380	When `Block_Type` is `Compressed_Block` or `Raw_Block`,
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	381	`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	382
				383	When `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
				384	`Block_Size` represents the number of times this byte must be repeated.
				385
				386	`Block_Size` is limited by `Block_Maximum_Size` (see below).
				387
				388	__`Block_Content`__ and __`Block_Maximum_Size`__
				389
				390	The size of `Block_Content` is limited by `Block_Maximum_Size`,
				391	which is the smallest of:
				392	- `Window_Size`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	393	- 128 KB
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	394
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	395	`Block_Maximum_Size` is constant for a given frame.
				396	This maximum is applicable to both the decompressed size
				397	and the compressed size of any block in the frame.
				398
				399	The reasoning for this limit is that a decoder can read this information
				400	at the beginning of a frame and use it to allocate buffers.
				401	The guarantees on the size of blocks ensure that
				402	the buffers will be large enough for any following block of the valid frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	403
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	404
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	405	Compressed Blocks
				406	-----------------
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	407	To decompress a compressed block, the compressed size must be provided
				408	from `Block_Size` field within `Block_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	409
				410	A compressed block consists of 2 sections :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	411	- [Literals Section](#literals-section)
				412	- [Sequences Section](#sequences-section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	413
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	414	The results of the two sections are then combined to produce the decompressed
				415	data in [Sequence Execution](#sequence-execution)
				416
				417	#### Prerequisites
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	418	To decode a compressed block, the following elements are necessary :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	419	- Previous decoded data, up to a distance of `Window_Size`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	420	or beginning of the Frame, whichever is smaller.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	421	- List of "recent offsets" from previous `Compressed_Block`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	422	- The previous Huffman tree, required by `Treeless_Literals_Block` type
				423	- Previous FSE decoding tables, required by `Repeat_Mode`
				424	for each symbol type (literals lengths, match lengths, offsets)
				425
				426	Note that decoding tables aren't always from the previous `Compressed_Block`.
				427
				428	- Every decoding table can come from a dictionary.
				429	- The Huffman tree comes from the previous `Compressed_Literals_Block`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	430
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	431	Literals Section
				432	----------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	433	All literals are regrouped in the first part of the block.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	434	They can be decoded first, and then copied during [Sequence Execution],
				435	or they can be decoded on the flow during [Sequence Execution].
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	436
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	437	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	438	When compressed, a tree description may optionally be present,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	439	followed by 1 or 4 streams.
				440
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	441	\| `Literals_Section_Header` \| [`Huffman_Tree_Description`] \| [jumpTable] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				442	\| ------------------------- \| ---------------------------- \| ----------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	443
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	444
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	445	### `Literals_Section_Header`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	446
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	447	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	448	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	449	using __little-endian__ convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	450
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	451	\| `Literals_Block_Type` \| `Size_Format` \| `Regenerated_Size` \| [`Compressed_Size`] \|
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	452	\| --------------------- \| ------------- \| ------------------ \| ------------------- \|
				453	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	454
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	455	In this representation, bits on the left are the lowest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	456
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	457	__`Literals_Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	458
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	459	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	460
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	461	\| `Literals_Block_Type` \| Value \|
				462	\| --------------------------- \| ----- \|
				463	\| `Raw_Literals_Block` \| 0 \|
				464	\| `RLE_Literals_Block` \| 1 \|
				465	\| `Compressed_Literals_Block` \| 2 \|
				466	\| `Treeless_Literals_Block` \| 3 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	467
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	468	- `Raw_Literals_Block` - Literals are stored uncompressed.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	469	- `RLE_Literals_Block` - Literals consist of a single byte value
				470	repeated `Regenerated_Size` times.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	471	- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	472	starting with a Huffman tree description.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	473	In this mode, there are at least 2 different literals represented in the Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	474	See details below.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	475	- `Treeless_Literals_Block` - This is a Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	476	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	477	`Huffman_Tree_Description` will be skipped.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	478	Note: If this mode is triggered without any previous Huffman-table in the frame
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	479	(or [dictionary](#dictionary-format)), this should be treated as data corruption.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	480
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	481	__`Size_Format`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	482
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	483	`Size_Format` is divided into 2 families :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	484
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	485	- For `Raw_Literals_Block` and `RLE_Literals_Block`,
				486	it's only necessary to decode `Regenerated_Size`.
				487	There is no `Compressed_Size` field.
				488	- For `Compressed_Block` and `Treeless_Literals_Block`,
				489	it's required to decode both `Compressed_Size`
				490	and `Regenerated_Size` (the decompressed size).
				491	It's also necessary to decode the number of streams (1 or 4).
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	492
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	493	For values spanning several bytes, convention is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	494
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	495	__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	496
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	497	`Size_Format` uses 1 _or_ 2 bits.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	498	Its value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	499
				500	- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	501	`Regenerated_Size` uses 5 bits (0-31).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	502	`Literals_Section_Header` uses 1 byte.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	503	`Regenerated_Size = Literals_Section_Header[0]>>3`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	504	- `Size_Format` == 01 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	505	`Regenerated_Size` uses 12 bits (0-4095).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	506	`Literals_Section_Header` uses 2 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	507	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	508	- `Size_Format` == 11 : `Size_Format` uses 2 bits.
Sean Purcell	d86153d	2017-01-26 16:58:25 -0800	[diff] [blame]	509	`Regenerated_Size` uses 20 bits (0-1048575).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	510	`Literals_Section_Header` uses 3 bytes.
Nick Terrell	c1a7def	2018-07-10 15:07:36 -0700	[diff] [blame]	511	`Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	512
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	513	Only Stream1 is present for these cases.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	514	Note : it's allowed to represent a short value (for example `27`)
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	515	using a long format, even if it's less efficient.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	516
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	517	__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	518
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	519	`Size_Format` always uses 2 bits.
				520
				521	- `Size_Format` == 00 : _A single stream_.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	522	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	523	`Literals_Section_Header` uses 3 bytes.
				524	- `Size_Format` == 01 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	525	Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	526	`Literals_Section_Header` uses 3 bytes.
				527	- `Size_Format` == 10 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	528	Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	529	`Literals_Section_Header` uses 4 bytes.
				530	- `Size_Format` == 11 : 4 streams.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	531	Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143).
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	532	`Literals_Section_Header` uses 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	533
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	534	Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
				535	Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
				536	_when_ it is present.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	537	Note 2: `Compressed_Size` can never be `==0`.
				538	Even in single-stream scenario, assuming an empty content, it must be `>=1`,
				539	since it contains at least the final end bit flag.
				540	In 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10`
				541	(6 bytes for the jump table, + 4x1 bytes for the 4 streams).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	542
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	543	4 streams is faster than 1 stream in decompression speed,
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	544	by exploiting instruction level parallelism.
				545	But it's also more expensive,
				546	costing on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table.
				547
				548	In general, use the 4 streams mode when there are more literals to decode,
				549	to favor higher decompression speeds.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	550	Note that beyond >1KB of literals, the 4 streams mode is compulsory.
Yann Collet	6a9c525	2022-12-22 11:30:15 -0800	[diff] [blame]	551
				552	Note that a minimum of 6 bytes is required for the 4 streams mode.
				553	That's a technical minimum, but it's not recommended to employ the 4 streams mode
				554	for such a small quantity, that would be wasteful.
				555	A more practical lower bound would be around ~256 bytes.
				556
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	557	#### Raw Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	558	The data in Stream1 is `Regenerated_Size` bytes long,
				559	it contains the raw literals data to be used during [Sequence Execution].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	560
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	561	#### RLE Literals Block
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	562	Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
				563	to generate the decoded literals.
				564
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	565	#### Compressed Literals Block and Treeless Literals Block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	566	Both of these modes contain Huffman encoded data.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	567
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	568	For `Treeless_Literals_Block`,
				569	the Huffman table comes from previously compressed literals block,
				570	or from a dictionary.
				571
				572
				573	### `Huffman_Tree_Description`
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	574	This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	575	The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	576	The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	577	The size of `Huffman_Tree_Description` is determined during decoding process,
				578	it must be used to determine where streams begin.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	579	`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	580
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	581
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	582	### Jump Table
				583	The Jump Table is only present when there are 4 Huffman-coded streams.
				584
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	585	Reminder : Huffman compressed data consists of either 1 or 4 streams.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	586
				587	If only one stream is present, it is a single bitstream occupying the entire
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	588	remaining portion of the literals block, encoded as described in
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	589	[Huffman-Coded Streams](#huffman-coded-streams).
				590
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	591	If there are four streams, `Literals_Section_Header` only provided
				592	enough information to know the decompressed and compressed sizes
				593	of all four streams _combined_.
				594	The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	595	except for the last stream which may be up to 3 bytes smaller,
				596	to reach a total decompressed size as specified in `Regenerated_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	597
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	598	The compressed size of each stream is provided explicitly in the Jump Table.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	599	Jump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	600	describing the compressed sizes of the first three streams.
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	601	`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	602
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	603	`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	604
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	605	`Stream4_Size` is necessarily `>= 1`. Therefore,
				606	if `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`,
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	607	data is considered corrupted.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	608
				609	Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	610	as described in [Huffman-Coded Streams](#huffman-coded-streams)
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	611
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	612
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	613	Sequences Section
				614	-----------------
				615	A compressed block is a succession of _sequences_ .
				616	A sequence is a literal copy command, followed by a match copy command.
				617	A literal copy command specifies a length.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	618	It is the number of bytes to be copied (or extracted) from the Literals Section.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	619	A match copy command specifies an offset and a length.
				620
				621	When all _sequences_ are decoded,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	622	if there are literals left in the _literals section_,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	623	these bytes are added at the end of the block.
				624
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	625	This is described in more detail in [Sequence Execution](#sequence-execution).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	626
				627	The `Sequences_Section` regroup all symbols required to decode commands.
				628	There are 3 symbol types : literals lengths, offsets and match lengths.
				629	They are encoded together, interleaved, in a single _bitstream_.
				630
				631	The `Sequences_Section` starts by a header,
				632	followed by optional probability tables for each symbol type,
				633	followed by the bitstream.
				634
				635	\| `Sequences_Section_Header` \| [`Literals_Length_Table`] \| [`Offset_Table`] \| [`Match_Length_Table`] \| bitStream \|
				636	\| -------------------------- \| ------------------------- \| ---------------- \| ---------------------- \| --------- \|
				637
				638	To decode the `Sequences_Section`, it's required to know its size.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	639	Its size is deduced from the size of `Literals_Section`:
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	640	`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	641
				642
				643	#### `Sequences_Section_Header`
				644
				645	Consists of 2 items:
				646	- `Number_of_Sequences`
				647	- Symbol compression modes
				648
				649	__`Number_of_Sequences`__
				650
				651	This is a variable size field using between 1 and 3 bytes.
				652	Let's call its first byte `byte0`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	653	- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
Yann Collet	1f83b7c	2023-06-05 09:51:52 -0700	[diff] [blame]	654	- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes.
				655	Note that the 2 bytes format fully overlaps the 1 byte format.
				656	- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	657
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame]	658	`if (Number_of_Sequences == 0)` : there are no sequences.
				659	The sequence section stops immediately,
				660	FSE tables used in `Repeat_Mode` aren't updated.
				661	Block's decompressed content is defined solely by the Literals Section content.
				662
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	663	__Symbol compression modes__
				664
				665	This is a single byte, defining the compression mode of each symbol type.
				666
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	667	\|Bit number\| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	668	\| -------- \| ----------------------- \| -------------- \| -------------------- \| ---------- \|
				669	\|Field name\| `Literals_Lengths_Mode` \| `Offsets_Mode` \| `Match_Lengths_Mode` \| `Reserved` \|
				670
				671	The last field, `Reserved`, must be all-zeroes.
				672
				673	`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	674	literals lengths, offsets, and match lengths symbols respectively.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	675
				676	They follow the same enumeration :
				677
				678	\| Value \| 0 \| 1 \| 2 \| 3 \|
				679	\| ------------------ \| ----------------- \| ---------- \| --------------------- \| ------------- \|
				680	\| `Compression_Mode` \| `Predefined_Mode` \| `RLE_Mode` \| `FSE_Compressed_Mode` \| `Repeat_Mode` \|
				681
				682	- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
				683	[default distributions](#default-distributions).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	684	No distribution table will be present.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	685	- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	686	This symbol will be used for all sequences.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	687	- `FSE_Compressed_Mode` : standard FSE compression.
				688	A distribution table will be present.
Yann Collet	a935d67	2017-03-31 16:19:04 -0700	[diff] [blame]	689	The format of this distribution table is described in [FSE Table Description](#fse-table-description).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	690	Note that the maximum allowed accuracy log for literals length and match length tables is 9,
				691	and the maximum accuracy log for the offsets table is 8.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	692	`FSE_Compressed_Mode` must not be used when only one symbol is present,
				693	`RLE_Mode` should be used instead (although any other mode will work).
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	694	- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
				695	or if this is the first block, table in the dictionary will be used.
				696	Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
				697	It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
				698	No distribution table will be present.
				699	If this mode is used without any previous sequence table in the frame
				700	(nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	701
				702	#### The codes for literals lengths, match lengths, and offsets.
				703
				704	Each symbol is a _code_ in its own context,
				705	which specifies `Baseline` and `Number_of_Bits` to add.
				706	_Codes_ are FSE compressed,
				707	and interleaved with raw additional bits in the same bitstream.
				708
				709	##### Literals length codes
				710
				711	Literals length codes are values ranging from `0` to `35` included.
				712	They define lengths from 0 to 131071 bytes.
				713	The literals length is equal to the decoded `Baseline` plus
				714	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	715	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	716
				717	\| `Literals_Length_Code` \| 0-15 \|
				718	\| ---------------------- \| ---------------------- \|
				719	\| length \| `Literals_Length_Code` \|
				720	\| `Number_of_Bits` \| 0 \|
				721
				722	\| `Literals_Length_Code` \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				723	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				724	\| `Baseline` \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
				725	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				726
				727	\| `Literals_Length_Code` \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				728	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				729	\| `Baseline` \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
				730	\| `Number_of_Bits` \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
				731
				732	\| `Literals_Length_Code` \| 32 \| 33 \| 34 \| 35 \|
				733	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \|
				734	\| `Baseline` \| 8192 \|16384 \|32768 \|65536 \|
				735	\| `Number_of_Bits` \| 13 \| 14 \| 15 \| 16 \|
				736
				737
				738	##### Match length codes
				739
				740	Match length codes are values ranging from `0` to `52` included.
				741	They define lengths from 3 to 131074 bytes.
				742	The match length is equal to the decoded `Baseline` plus
				743	the result of reading `Number_of_Bits` bits from the bitstream,
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	744	as a __little-endian__ value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	745
				746	\| `Match_Length_Code` \| 0-31 \|
				747	\| ------------------- \| ----------------------- \|
				748	\| value \| `Match_Length_Code` + 3 \|
				749	\| `Number_of_Bits` \| 0 \|
				750
				751	\| `Match_Length_Code` \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				752	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				753	\| `Baseline` \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
				754	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
				755
				756	\| `Match_Length_Code` \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				757	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				758	\| `Baseline` \| 67 \| 83 \| 99 \| 131 \| 259 \| 515 \| 1027 \| 2051 \|
				759	\| `Number_of_Bits` \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
				760
				761	\| `Match_Length_Code` \| 48 \| 49 \| 50 \| 51 \| 52 \|
				762	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				763	\| `Baseline` \| 4099 \| 8195 \|16387 \|32771 \|65539 \|
				764	\| `Number_of_Bits` \| 12 \| 13 \| 14 \| 15 \| 16 \|
				765
				766	##### Offset codes
				767
				768	Offset codes are values ranging from `0` to `N`.
				769
				770	A decoder is free to limit its maximum `N` supported.
				771	Recommendation is to support at least up to `22`.
				772	For information, at the time of this writing.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	773	the reference decoder supports a maximum `N` value of `31`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	774
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	775	An offset code is also the number of additional bits to read in __little-endian__ fashion,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	776	and can be translated into an `Offset_Value` using the following formulas :
				777
				778	```
				779	Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
				780	if (Offset_Value > 3) offset = Offset_Value - 3;
				781	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	782	It means that maximum `Offset_Value` is `(2^(N+1))-1`
				783	supporting back-reference distances up to `(2^(N+1))-4`,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	784	but is limited by [maximum back-reference distance](#window_descriptor).
				785
				786	`Offset_Value` from 1 to 3 are special : they define "repeat codes".
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	787	This is described in more detail in [Repeat Offsets](#repeat-offsets).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	788
				789	#### Decoding Sequences
				790	FSE bitstreams are read in reverse direction than written. In zstd,
				791	the compressor writes bits forward into a block and the decompressor
				792	must read the bitstream _backwards_.
				793
				794	To find the start of the bitstream it is therefore necessary to
				795	know the offset of the last byte of the block which can be found
				796	by counting `Block_Size` bytes after the block header.
				797
				798	After writing the last bit containing information, the compressor
				799	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				800	padding. The last byte of the compressed bitstream cannot be `0` for
				801	that reason.
				802
				803	When decompressing, the last byte containing the padding is the first
				804	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				805	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				806	begins.
				807
				808	FSE decoding requires a 'state' to be carried from symbol to symbol.
				809	For more explanation on FSE decoding, see the [FSE section](#fse).
				810
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	811	For sequence decoding, a separate state keeps track of each
				812	literal lengths, offsets, and match lengths symbols.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	813	Some FSE primitives are also used.
				814	For more details on the operation of these primitives, see the [FSE section](#fse).
				815
				816	##### Starting states
				817	The bitstream starts with initial FSE state values,
				818	each using the required number of bits in their respective _accuracy_,
				819	decoded previously from their normalized distribution.
				820
				821	It starts by `Literals_Length_State`,
				822	followed by `Offset_State`,
				823	and finally `Match_Length_State`.
				824
				825	Reminder : always keep in mind that all values are read _backward_,
				826	so the 'start' of the bitstream is at the highest position in memory,
				827	immediately before the last `1`-bit for padding.
				828
				829	After decoding the starting states, a single sequence is decoded
				830	`Number_Of_Sequences` times.
				831	These sequences are decoded in order from first to last.
				832	Since the compressor writes the bitstream in the forward direction,
				833	this means the compressor must encode the sequences starting with the last
				834	one and ending with the first.
				835
				836	##### Decoding a sequence
				837	For each of the symbol types, the FSE state can be used to determine the appropriate code.
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	838	The code then defines the `Baseline` and `Number_of_Bits` to read for each type.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	839	See the [description of the codes] for how to determine these values.
				840
				841	[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
				842
				843	Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	844	It then does the same for `Match_Length`, and then for `Literals_Length`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	845	This sequence is then used for [sequence execution](#sequence-execution).
				846
				847	If it is not the last sequence in the block,
				848	the next operation is to update states.
				849	Using the rules pre-calculated in the decoding tables,
				850	`Literals_Length_State` is updated,
				851	followed by `Match_Length_State`,
				852	and then `Offset_State`.
				853	See the [FSE section](#fse) for details on how to update states from the bitstream.
				854
				855	This operation will be repeated `Number_of_Sequences` times.
				856	At the end, the bitstream shall be entirely consumed,
				857	otherwise the bitstream is considered corrupted.
				858
				859	#### Default Distributions
				860	If `Predefined_Mode` is selected for a symbol type,
				861	its FSE decoding table is generated from a predefined distribution table defined here.
				862	For details on how to convert this distribution into a decoding table, see the [FSE section].
				863
				864	[FSE section]: #from-normalized-distribution-to-decoding-tables
				865
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	866	##### Literals Length
				867	The decoding table uses an accuracy log of 6 bits (64 states).
				868	```
				869	short literalsLength_defaultDistribution[36] =
				870	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				871	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				872	-1,-1,-1,-1 };
				873	```
				874
				875	##### Match Length
				876	The decoding table uses an accuracy log of 6 bits (64 states).
				877	```
				878	short matchLengths_defaultDistribution[53] =
				879	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				880	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				881	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				882	-1,-1,-1,-1,-1 };
				883	```
				884
				885	##### Offset Codes
				886	The decoding table uses an accuracy log of 5 bits (32 states),
				887	and supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
				888
				889	If any sequence in the compressed block requires a larger offset than this,
				890	it's not possible to use the default distribution to represent it.
				891	```
				892	short offsetCodes_defaultDistribution[29] =
				893	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				894	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				895	```
				896
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	897
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	898	Sequence Execution
				899	------------------
				900	Once literals and sequences have been decoded,
				901	they are combined to produce the decoded content of a block.
				902
				903	Each sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
Sean Purcell	3bee41a	2017-02-21 10:20:36 -0800	[diff] [blame]	904	decoded as described in the [Sequences Section](#sequences-section).
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	905	To execute a sequence, first copy `literals_length` bytes
				906	from the decoded literals to the output.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	907
				908	Then `match_length` bytes are copied from previous decoded data.
				909	The offset to copy from is determined by `offset_value`:
				910	if `offset_value > 3`, then the offset is `offset_value - 3`.
				911	If `offset_value` is from 1-3, the offset is a special repeat offset value.
				912	See the [repeat offset](#repeat-offsets) section for how the offset is determined
				913	in this case.
				914
				915	The offset is defined as from the current position, so an offset of 6
				916	and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	917	Note that all offsets leading to previously decoded data
				918	must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	919
				920	#### Repeat offsets
				921	As seen in [Sequence Execution](#sequence-execution),
				922	the first 3 values define a repeated offset and we will call them
				923	`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
				924	They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
				925
				926	If `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
				927
				928	There is an exception though, when current sequence's `literals_length = 0`.
				929	In this case, repeated offsets are shifted by one,
				930	so an `offset_value` of 1 means `Repeated_Offset2`,
				931	an `offset_value` of 2 means `Repeated_Offset3`,
elasota	f06b18b	2023-11-19 15:33:37 -0500	[diff] [blame]	932	and an `offset_value` of 3 means `Repeated_Offset1 - 1`.
				933
				934	In the final case, if `Repeated_Offset1 - 1` evaluates to 0, then the
				935	data is considered corrupted.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	936
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	937	For the first block, the starting offset history is populated with following values :
				938	`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	939	unless a dictionary is used, in which case they come from the dictionary.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	940
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	941	Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
				942	Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	943
				944	[Offset Codes]: #offset-codes
				945
				946	###### Offset updates rules
				947
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	948	During the execution of the sequences of a `Compressed_Block`, the
				949	`Repeated_Offsets`' values are kept up to date, so that they always represent
				950	the three most-recently used offsets. In order to achieve that, they are
				951	updated after executing each sequence in the following way:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	952
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	953	When the sequence's `offset_value` does not refer to one of the
				954	`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
				955	and the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
				956	are shifted back one, and `Repeated_Offset1` takes on the value of the
				957	just-used offset.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	958
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	959	Otherwise, when the sequence's `offset_value` refers to one of the
				960	`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
				961	sequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
				962	so that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
				963	the existing values are pushed back from the first `Repeated_Offset` through to
				964	the `Repeated_Offset` selected by the `offset_value`. This effectively performs
				965	a single-stepped wrapping rotation of the values of these offsets, so that
				966	their order again reflects the recency of their use.
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	967
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	968	The following table shows the values of the `Repeated_Offsets` as a series of
				969	sequences are applied to them:
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	970
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	971	\| `offset_value` \| `literals_length` \| `Repeated_Offset1` \| `Repeated_Offset2` \| `Repeated_Offset3` \| Comment \|
				972	\|:--------------:\|:-----------------:\|:------------------:\|:------------------:\|:------------------:\|:-----------------------:\|
				973	\| \| \| 1 \| 4 \| 8 \| starting values \|
				974	\| 1114 \| 11 \| 1111 \| 1 \| 4 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	975	\| 1 \| 22 \| 1111 \| 1 \| 4 \| repeat 1: no change \|
W. Felix Handte	2d46d76	2020-12-09 20:00:48 -0500	[diff] [blame]	976	\| 2225 \| 22 \| 2222 \| 1111 \| 1 \| non-repeat \|
				977	\| 1114 \| 111 \| 1111 \| 2222 \| 1111 \| non-repeat \|
				978	\| 3336 \| 33 \| 3333 \| 1111 \| 2222 \| non-repeat \|
Yann Collet	f33ccd2	2022-05-24 04:47:49 -0700	[diff] [blame]	979	\| 2 \| 22 \| 1111 \| 3333 \| 2222 \| repeat 2: swap 1 & 2 \|
				980	\| 3 \| 33 \| 2222 \| 1111 \| 3333 \| repeat 3: rotate 3 to 1 \|
				981	\| 3 \| 0 \| 2221 \| 2222 \| 1111 \| special case : insert `repeat1 - 1` \|
				982	\| 1 \| 0 \| 2222 \| 2221 \| 1111 \| == repeat 2 \|
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	983
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	984
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	985	Skippable Frames
				986	----------------
				987
				988	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				989	\|:--------------:\|:------------:\|:-----------:\|
				990	\| 4 bytes \| 4 bytes \| n bytes \|
				991
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	992	Skippable frames allow the insertion of user-defined metadata
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	993	into a flow of concatenated frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	994
				995	Skippable frames defined in this specification are compatible with [LZ4] ones.
				996
Danielle Rozenblit	4dffc35	2022-12-14 06:58:35 -0800	[diff] [blame]	997	[LZ4]:https://lz4.github.io/lz4/
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	998
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	999	From a compliant decoder perspective, skippable frames need just be skipped,
				1000	and their content ignored, resuming decoding after the skippable frame.
				1001
				1002	It can be noted that a skippable frame
				1003	can be used to watermark a stream of concatenated frames
Dominique Pelle	b772f53	2022-03-12 08:52:40 +0100	[diff] [blame]	1004	embedding any kind of tracking information (even just a UUID).
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1005	Users wary of such possibility should scan the stream of concatenated frames
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1006	in an attempt to detect such frame for analysis or removal.
				1007
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1008	__`Magic_Number`__
				1009
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1010	4 Bytes, __little-endian__ format.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1011	Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
				1012	All 16 values are valid to identify a skippable frame.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1013	This specification doesn't detail any specific tagging for skippable frames.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1014
				1015	__`Frame_Size`__
				1016
				1017	This is the size, in bytes, of the following `User_Data`
				1018	(without including the magic number nor the size field itself).
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1019	This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1020	This means `User_Data` can’t be bigger than (2^32-1) bytes.
				1021
				1022	__`User_Data`__
				1023
				1024	The `User_Data` can be anything. Data will just be skipped by the decoder.
				1025
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1026
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1027
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1028	Entropy Encoding
				1029	----------------
				1030	Two types of entropy encoding are used by the Zstandard format:
				1031	FSE, and Huffman coding.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1032	Huffman is used to compress literals,
				1033	while FSE is used for all other symbols
				1034	(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
				1035	and to compress Huffman headers.
				1036
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1037
				1038	FSE
				1039	---
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1040	FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1041	FSE encoding/decoding involves a state that is carried over between symbols.
				1042	Decoding must be done in the opposite direction as encoding.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1043	Therefore, all FSE bitstreams are read from end to beginning.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1044	Note that the order of the bits in the stream is not reversed,
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1045	we just read each multi-bits element in the reverse order they are encoded.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1046
				1047	For additional details on FSE, see [Finite State Entropy].
				1048
				1049	[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
				1050
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1051	FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1052	`Symbol`, `Num_Bits`, and `Baseline`.
				1053	The `log2` of the table size is its `Accuracy_Log`.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1054	An FSE state value represents an index in this table.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1055
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1056	To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1057	The first symbol in the stream is the `Symbol` indicated in the table for that state.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1058	To obtain the next state value,
				1059	the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1060
				1061	[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
				1062
				1063	### FSE Table Description
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1064	To decode an FSE bitstream, it is necessary to build its FSE decoding table.
				1065	The decoding table is derived from a distribution of Probabilities.
				1066	The Zstandard format encodes distributions of Probabilities as follows:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1067
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1068	The distribution of probabilities is described in a bitstream which is read forward,
				1069	in __little-endian__ fashion.
				1070	The amount of bytes consumed from the bitstream to describe the distribution
				1071	is discovered at the end of the decoding process.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1072
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1073	The bitstream starts by reporting on which scale the distribution operates.
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1074	Let's `low4Bits` designate the lowest 4 bits of the first byte :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1075	`Accuracy_Log = low4bits + 5`.
				1076
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1077	An FSE distribution table describes the probabilities of all symbols
				1078	from `0` to the last present one (included) in natural order.
				1079	The sum of probabilities is normalized to reach a power of 2 total of `1 << Accuracy_Log` .
				1080	There must be two or more symbols with non-zero probabilities.
				1081
				1082	The number of bits used to decode each probability is variable.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1083	It depends on :
				1084
				1085	- Remaining probabilities + 1 :
				1086	__example__ :
				1087	Presuming an `Accuracy_Log` of 8,
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1088	and presuming 100 probability points have already been distributed,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1089	the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
elasota	324cce4	2023-10-31 11:42:00 -0400	[diff] [blame]	1090	Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
				1091	is the smallest integer `T` that satisfies `(1 << T) > N`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1092
				1093	- Value decoded : small values use 1 less bit :
				1094	__example__ :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1095	Presuming values from 0 to 157 (inclusive) are possible,
				1096	255-157 = 98 values are remaining in an 8-bits field.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1097	They are used this way :
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1098	first 98 values (hence from 0 to 97) use only 7 bits,
				1099	values from 98 to 157 use 8 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1100	This is achieved through this scheme :
				1101
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1102	\| 8-bit field read \| Value decoded \| Nb of bits consumed \|
				1103	\| ---------------- \| ------------- \| ------------------- \|
				1104	\| 0 - 97 \| 0 - 97 \| 7 \|
				1105	\| 98 - 127 \| 98 - 127 \| 8 \|
				1106	\| 128 - 225 \| 0 - 97 \| 7 \|
				1107	\| 226 - 255 \| 128 - 157 \| 8 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1108
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1109	Probability is derived from Value decoded using the following formula:
				1110	`Probality = Value - 1`
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1111
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1112	Consequently, a Probability of `0` is described by a Value `1`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1113
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1114	A Value `0` is used to signal a special case, named "Probability `-1`".
				1115	It describes a probability which should have been "less than 1".
				1116	Its effect on the decoding table building process is described in the [next section].
				1117	For the purpose of counting total allocated probability points, it counts as one.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1118
				1119	[next section]:#from-normalized-distribution-to-decoding-tables
				1120
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1121	Symbols probabilities are read one by one, in order.
				1122	After each probability is decoded, the total nb of probability points is updated.
Dimitri Papadopoulos	fcf88ae	2024-11-26 11:15:39 +0100	[diff] [blame]	1123	This is used to determine how many bits must be read to decode the probability of next symbol.
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1124
				1125	When a symbol has a __probability__ of `zero` (decoded from reading a Value `1`),
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1126	it is followed by a 2-bits repeat flag.
				1127	This repeat flag tells how many probabilities of zeroes follow the current one.
				1128	It provides a number ranging from 0 to 3.
				1129	If it is a 3, another 2-bits repeat flag follows, and so on.
				1130
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1131	When the Probability for a symbol makes cumulated total reach `1 << Accuracy_Log`,
				1132	then it's the last symbol, and decoding is complete.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1133
				1134	Then the decoder can tell how many bytes were used in this process,
				1135	and how many symbols are present.
				1136	The bitstream consumes a round number of bytes.
				1137	Any remaining bit within the last byte is just unused.
				1138
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1139	If this process results in a non-zero probability for a symbol outside of the
				1140	valid range of symbols that the FSE table is defined for, even if that symbol is
				1141	not used, then the data is considered corrupted.
				1142	For the specific case of offset codes,
				1143	a decoder implementation may reject a frame containing a non-zero probability
				1144	for an offset code larger than the largest offset code supported by the decoder
				1145	implementation.
				1146
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1147	#### From normalized distribution to decoding tables
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1148
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1149	The normalized distribution of probabilities is enough
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1150	to create a unique decoding table.
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1151	It is generated using the following build rule :
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1152
				1153	The table has a size of `Table_Size = 1 << Accuracy_Log`.
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1154	Each row specifies the decoded symbol,
				1155	and instructions to reach the next state (`Number_of_Bits` and `Baseline`).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1156
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1157	Symbols are first scanned in their natural order for "less than 1" probabilities
				1158	(previously decoded from a Value of `0`).
				1159	Symbols with this special probability are being attributed a single row,
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1160	starting from the end of the table and retreating.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1161	These symbols define a full state reset, reading `Accuracy_Log` bits.
				1162
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1163	Then, all remaining symbols, sorted in natural order, are allocated rows.
				1164	Starting from smallest present symbol, and table position `0`,
				1165	each symbol gets allocated as many rows as its probability.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1166
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1167	Row allocation is not linear, it follows this order, in modular arithmetic:
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1168	```
				1169	position += (tableSize>>1) + (tableSize>>3) + 3;
				1170	position &= tableSize-1;
				1171	```
				1172
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1173	Using above ordering rule, each symbol gets allocated as many rows as its probability.
				1174	If a position is already occupied by a "less than 1" probability symbol,
				1175	it is simply skipped, and the next position is allocated instead.
				1176	Once enough rows have been allocated for the current symbol,
				1177	the allocation process continues, using the next symbol, in natural order.
				1178	This process guarantees that the table is entirely and exactly filled.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1179
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1180	Each row specifies a decoded symbol, and is accessed by current state value.
				1181	It also specifies `Number_of_Bits` and `Baseline`, which are required to determine next state value.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1182
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1183	To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order,
				1184	and then attribute N+1 bits to lower rows, and N bits to higher rows,
				1185	following the process described below (using an example):
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1186
				1187	__Example__ :
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1188	Presuming an `Accuracy_Log` of 7,
				1189	let's imagine a symbol with a Probability of 5:
				1190	it receives 5 rows, corresponding to 5 state values between `0` and `127`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1191
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1192	In this example, the first state value happens to be `1` (after unspecified previous symbols).
				1193	The next 4 states are then determined using above modular arithmetic rule,
				1194	which specifies to add `64+16+3 = 83` modulo `128` to jump to next position,
				1195	producing the following series: `1`, `84`, `39`, `122`, `77` (modular arithmetic).
				1196	(note: the next symbol will then start at `32`).
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1197
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1198	These state values are then sorted in natural order,
				1199	resulting in the following series: `1`, `39`, `77`, `84`, `122`.
				1200
				1201	The next power of 2 after 5 is 8.
				1202	Therefore, the probability space will be divided into 8 equal parts.
				1203	Since the probability space is `1<<7 = 128` large, each share is `128/8 = 16` large.
				1204
				1205	In order to reach 8 shares, the `8-5 = 3` lowest states will count "double",
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1206	doubling their shares (32 in width), hence requiring one more bit.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1207
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1208	Baseline is assigned starting from the lowest state using fewer bits,
				1209	continuing in natural state order, looping back at the beginning.
				1210	Each state takes its allocated range from Baseline, sized by its `Number_of_Bits`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1211
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1212	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1213	\| ---------------- \| ----- \| ----- \| ------ \| ---- \| ------ \|
elasota	52e41b9	2023-11-09 12:22:27 -0500	[diff] [blame]	1214	\| state value \| 1 \| 39 \| 77 \| 84 \| 122 \|
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1215	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1216	\| `Number_of_Bits` \| 5 \| 5 \| 5 \| 4 \| 4 \|
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1217	\| allocation order \| 3 \| 4 \| 5 \| 1 \| 2 \|
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1218	\| `Baseline` \| 32 \| 64 \| 96 \| 0 \| 16 \|
				1219	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1220
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1221	During decoding, the next state value is determined by using current state value as row number,
				1222	then reading the required `Number_of_Bits` from the bitstream, and adding the specified `Baseline`.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1223
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1224	Note:
				1225	as a trivial example, it follows that, for a symbol with a Probability of `1`,
				1226	`Baseline` is necessarily `0`, and `Number_of_Bits` is necessarily `Accuracy_Log`.
				1227
				1228	See [Appendix A] to see the outcome of this process applied to the default distributions.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1229
				1230	[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
				1231
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1232
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1233	Huffman Coding
				1234	--------------
				1235	Zstandard Huffman-coded streams are read backwards,
				1236	similar to the FSE bitstreams.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1237	Therefore, to find the start of the bitstream, it is required to
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1238	know the offset of the last byte of the Huffman-coded stream.
				1239
				1240	After writing the last bit containing information, the compressor
				1241	writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
				1242	padding. The last byte of the compressed bitstream cannot be `0` for
				1243	that reason.
				1244
				1245	When decompressing, the last byte containing the padding is the first
				1246	byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
				1247	the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
				1248	begins.
				1249
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1250	The bitstream contains Huffman-coded symbols in __little-endian__ order,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1251	with the codes defined by the method below.
				1252
				1253	### Huffman Tree Description
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1254
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1255	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1256	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1257	in a manner such that different symbols may be represented
				1258	by bit sequences of different lengths,
				1259	but a parser can always parse an encoded string
				1260	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1261
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1262	Given an alphabet with known symbol frequencies,
				1263	the Huffman algorithm allows the construction of an optimal prefix code
				1264	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1265
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1266	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1267	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1268	and require more memory or more complex decoding operations.
				1269	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1270
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1271	#### Representation
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1272
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1273	All literal symbols from zero (included) to last present one (excluded)
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1274	are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
				1275	Transformation from `Weight` to `Number_of_Bits` follows this formula :
				1276	```
				1277	Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
				1278	```
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1279	When a literal symbol is not present, it receives a `Weight` of 0.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1280	The least frequent symbol receives a `Weight` of 1.
elasota	05059e5	2023-11-08 23:46:37 -0500	[diff] [blame]	1281	If no literal has a `Weight` of 1, then the data is considered corrupted.
				1282	If there are not at least two literals with non-zero `Weight`, then the data
				1283	is considered corrupted.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1284	The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
				1285	The last symbol's `Weight` is deduced from previously retrieved Weights,
				1286	by completing to the nearest power of 2. It's necessarily non 0.
				1287	If it's not possible to reach a clean power of 2 with a single `Weight` value,
				1288	the Huffman Tree Description is considered invalid.
				1289	This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1290	`Max_Number_of_Bits` must be <= 11,
				1291	otherwise the representation is considered corrupted.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1292
				1293	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1294	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1295
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1296	\| literal symbol \| A \| B \| C \| D \| E \| F \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1297	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1298	\| `Number_of_Bits` \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1299
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1300	The tree depth is 4, since its longest elements uses 4 bits
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1301	(longest elements are the ones with smallest frequency).
				1302
				1303	All symbols will now receive a `Weight` instead of `Number_of_Bits`.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1304	Weight formula is :
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1305	```
				1306	Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
				1307	```
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1308	It gives the following series of Weights :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1309
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1310	\| literal symbol \| A \| B \| C \| D \| E \| F \|
				1311	\| -------------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1312	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
				1313
				1314	This list will be sent to the decoder, with the following modifications:
				1315
				1316	- `F` will not be listed, because it can be determined from previous symbols
				1317	- nor will symbols above `F` as they are all 0
				1318	- on the other hand, all symbols before `A`, starting with `\0`, will be listed, with a Weight of 0.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1319
				1320	The decoder will do the inverse operation :
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1321	having collected weights of literal symbols from `A` to `E`,
				1322	it knows the last literal, `F`, is present with a non-zero `Weight`.
				1323	The `Weight` of `F` can be determined by advancing to the next power of 2.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1324	The sum of `2^(Weight-1)` (excluding 0's) is :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1325	`8 + 4 + 2 + 0 + 1 = 15`.
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1326	Nearest larger power of 2 value is 16.
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1327	Therefore, `Max_Number_of_Bits = log2(16) = 4` and `Weight[F] = log_2(16 - 15) + 1 = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1328
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1329	#### Huffman Tree header
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1330
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1331	This is a single byte value (0-255),
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1332	which describes how the series of weights is encoded.
				1333
				1334	- if `headerByte` < 128 :
				1335	the series of weights is compressed using FSE (see below).
				1336	The length of the FSE-compressed series is equal to `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1337
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1338	- if `headerByte` >= 128 :
				1339	+ the series of weights uses a direct representation,
				1340	where each `Weight` is encoded directly as a 4 bits field (0-15).
				1341	+ They are encoded forward, 2 weights to a byte,
				1342	first weight taking the top four bits and second one taking the bottom four.
				1343	* e.g. the following operations could be used to read the weights:
				1344	`Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
				1345	+ The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
				1346	meaning it uses only full bytes even if `Number_of_Weights` is odd.
				1347	+ `Number_of_Weights = headerByte - 127`.
				1348	* Note that maximum `Number_of_Weights` is 255-127 = 128,
				1349	therefore, only up to 128 `Weight` can be encoded using direct representation.
				1350	* Since the last non-zero `Weight` is _not_ encoded,
				1351	this scheme is compatible with alphabet sizes of up to 129 symbols,
				1352	hence including literal symbol 128.
				1353	* If any literal symbol > 128 has a non-zero `Weight`,
				1354	direct representation is not possible.
				1355	In such case, it's necessary to use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1356
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1357
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1358	#### Finite State Entropy (FSE) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1359
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1360	In this case, the series of Huffman weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1361	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1362	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1363
				1364	To decode an FSE bitstream, it is necessary to know its compressed size.
				1365	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1366	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1367	which is `255`, since literal symbols span from `0` to `255`,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1368	and last symbol's `Weight` is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1369
				1370	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1371	It will create a Decoding Table.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1372	For a list of Huffman weights, the maximum accuracy log is 6 bits.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1373	For more description see the [FSE header description](#fse-table-description)
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1374
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1375	The Huffman header compression uses 2 states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1376	which share the same FSE distribution table.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1377	The first state (`State1`) encodes the even indexed symbols,
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1378	and the second (`State2`) encodes the odd indexed symbols.
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1379	`State1` is initialized first, and then `State2`, and they take turns
				1380	decoding a single symbol and updating their state.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1381	For more details on these FSE operations, see the [FSE section](#fse).
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	1382
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1383	The number of symbols to decode is determined
				1384	by tracking bitStream overflow condition:
				1385	If updating state after decoding a symbol would require more bits than
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1386	remain in the stream, it is assumed that extra bits are 0. Then,
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1387	symbols for each of the final states are decoded and the process is complete.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1388
elasota	e61e3ff	2023-11-08 20:06:58 -0500	[diff] [blame]	1389	If this process would produce more weights than the maximum number of decoded
				1390	weights (255), then the data is considered corrupted.
				1391
elasota	0938308	2024-06-20 15:19:58 -0400	[diff] [blame]	1392	If either of the 2 initial states are absent or truncated, then the data is
				1393	considered corrupted. Consequently, it is not possible to encode fewer than
				1394	2 weights using this mode.
				1395
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1396	#### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	1397
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1398	All present symbols shall now have a `Weight` value.
Yann Collet	c1e6347	2018-06-21 18:08:11 -0700	[diff] [blame]	1399	It is possible to transform weights into `Number_of_Bits`, using this formula:
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1400	```
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1401	Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1402	```
Yann Collet	3e7c66a	2024-10-09 01:06:24 -0700	[diff] [blame]	1403	In order to determine which prefix code is assigned to each Symbol,
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1404	Symbols are first sorted by `Weight`, then by natural sequential order.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1405	Symbols with a `Weight` of zero are removed.
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1406	Then, starting from lowest `Weight` (hence highest `Number_of_Bits`),
				1407	prefix codes are assigned in ascending order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1408
				1409	__Example__ :
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1410	Let's assume the following list of weights has been decoded:
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1411
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1412	\| Literal \| A \| B \| C \| D \| E \| F \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1413	\| -------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				1414	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1415
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1416	Sorted by weight and then natural sequential order,
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1417	it gives the following prefix codes distribution:
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1418
Yann Collet	3e7c66a	2024-10-09 01:06:24 -0700	[diff] [blame]	1419	\| Literal \| D \| E \| F \| C \| B \| A \|
				1420	\| ---------------- \| --- \| ---- \| ---- \| ---- \| ---- \| ---- \|
				1421	\| `Weight` \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
				1422	\| `Number_of_Bits` \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
				1423	\| prefix code \| N/A \| 0000 \| 0001 \| 001 \| 01 \| 1 \|
				1424	\| ascending order \| N/A \| 0000 \| 0001 \| 001x \| 01xx \| 1xxx \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1425
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1426	### Huffman-coded Streams
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1427
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1428	Given a Huffman decoding table,
				1429	it's possible to decode a Huffman-coded stream.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1430
				1431	Each bitstream must be read _backward_,
				1432	that is starting from the end down to the beginning.
				1433	Therefore it's necessary to know the size of each bitstream.
				1434
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1435	It's also necessary to know exactly which _bit_ is the last one.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1436	This is detected by a final bit flag :
				1437	the highest bit of latest byte is a final-bit-flag.
				1438	Consequently, a last byte of `0` is not possible.
				1439	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	1440	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1441
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1442	Starting from the end,
				1443	it's possible to read the bitstream in a __little-endian__ fashion,
				1444	keeping track of already used bits. Since the bitstream is encoded in reverse
				1445	order, starting from the end read symbols in forward order.
				1446
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1447	For example, if the literal sequence `ABEF` was encoded using above prefix code,
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1448	it would be encoded (in reverse order) as:
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1449
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1450	\|Symbol \| F \| E \| B \| A \| Padding \|
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1451	\|--------\|------\|------\|----\|---\|---------\|
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1452	\|Encoding\|`0000`\|`0001`\|`01`\|`1`\| `00001` \|
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1453
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1454	Resulting in following 2-bytes bitstream :
				1455	```
				1456	00010000 00001101
				1457	```
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1458
Yann Collet	e8d35cc	2017-08-20 10:39:20 -0700	[diff] [blame]	1459	Here is an alternative representation with the symbol codes separated by underscore:
Yann Collet	d0d06e4	2017-08-19 12:26:09 -0700	[diff] [blame]	1460	```
				1461	0001_0000 00001_1_01
				1462	```
				1463
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1464	Reading highest `Max_Number_of_Bits` bits,
				1465	it's possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1466	determining the symbol to decode and number of bits to discard.
				1467
				1468	The process continues up to reading the required number of symbols per stream.
				1469	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	1470	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	1471	the decoding process is considered faulty.
				1472
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1473
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1474	Dictionary Format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1475	-----------------
				1476
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1477	Zstandard is compatible with "raw content" dictionaries,
				1478	free of any format restriction, except that they must be at least 8 bytes.
				1479	These dictionaries function as if they were just the `Content` part
				1480	of a formatted dictionary.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1481
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1482	But dictionaries created by `zstd --train` follow a format, described here.
				1483
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1484	__Pre-requisites__ : a dictionary has a size,
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1485	defined either by a buffer limit, or a file size.
				1486
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1487	\| `Magic_Number` \| `Dictionary_ID` \| `Entropy_Tables` \| `Content` \|
				1488	\| -------------- \| --------------- \| ---------------- \| --------- \|
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1489
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1490	__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1491
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1492	__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1493	`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1494	It's used by decoders to check if they use the correct dictionary.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1495
				1496	_Reserved ranges :_
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1497	If the dictionary is going to be distributed in a public environment,
				1498	the following ranges of `Dictionary_ID` are reserved for some future registrar
				1499	and shall not be used :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1500
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1501	- low range : <= 32767
				1502	- high range : >= (2^31)
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1503
Yann Collet	11a392c	2020-05-26 13:15:35 -0700	[diff] [blame]	1504	Outside of these ranges, any value of `Dictionary_ID`
				1505	which is both `>= 32768` and `< (1<<31)` can be used freely,
				1506	even in public environment.
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1507
				1508
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1509	__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1510	See the relevant [FSE](#fse-table-description)
				1511	and [Huffman](#huffman-tree-description) sections for how to decode these tables.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1512	They are stored in following order :
				1513	Huffman tables for literals, FSE table for offsets,
				1514	FSE table for match lengths, and FSE table for literals lengths.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1515	These tables populate the Repeat Stats literals mode and
				1516	Repeat distribution mode for sequence decoding.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1517	It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1518	stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
senhuang42	8adeb9f	2020-09-22 13:24:27 -0400	[diff] [blame]	1519	Each recent offset must have a value <= dictionary content size, and cannot equal 0.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1520
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1521	__`Content`__ : The rest of the dictionary is its content.
Sean Purcell	ab226d4	2017-01-25 16:41:52 -0800	[diff] [blame]	1522	The content act as a "past" in front of data to compress or decompress,
				1523	so it can be referenced in sequence commands.
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1524	As long as the amount of data decoded from this frame is less than or
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1525	equal to `Window_Size`, sequence commands may specify offsets longer
				1526	than the total length of decoded output so far to reference back to the
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1527	dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1528	After the total output has surpassed `Window_Size` however,
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1529	this is no longer allowed and the dictionary is no longer accessible.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1530
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1531	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1532
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1533	If a dictionary is provided by an external source,
				1534	it should be loaded with great care, its content considered untrusted.
				1535
				1536
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1537
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1538	Appendix A - Decoding tables for predefined codes
				1539	-------------------------------------------------
				1540
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1541	This appendix contains FSE decoding tables
				1542	for the predefined literal length, match length, and offset codes.
				1543	The tables have been constructed using the algorithm as given above in chapter
				1544	"from normalized distribution to decoding tables".
				1545	The tables here can be used as examples
				1546	to crosscheck that an implementation build its decoding tables correctly.
Johannes Rudolph	6fb4d67	2016-09-14 19:01:04 +0200	[diff] [blame]	1547
				1548	#### Literal Length Code:
				1549
				1550	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1551	\| ----- \| ------ \| -------------- \| ---- \|
				1552	\| 0 \| 0 \| 4 \| 0 \|
				1553	\| 1 \| 0 \| 4 \| 16 \|
				1554	\| 2 \| 1 \| 5 \| 32 \|
				1555	\| 3 \| 3 \| 5 \| 0 \|
				1556	\| 4 \| 4 \| 5 \| 0 \|
				1557	\| 5 \| 6 \| 5 \| 0 \|
				1558	\| 6 \| 7 \| 5 \| 0 \|
				1559	\| 7 \| 9 \| 5 \| 0 \|
				1560	\| 8 \| 10 \| 5 \| 0 \|
				1561	\| 9 \| 12 \| 5 \| 0 \|
				1562	\| 10 \| 14 \| 6 \| 0 \|
				1563	\| 11 \| 16 \| 5 \| 0 \|
				1564	\| 12 \| 18 \| 5 \| 0 \|
				1565	\| 13 \| 19 \| 5 \| 0 \|
				1566	\| 14 \| 21 \| 5 \| 0 \|
				1567	\| 15 \| 22 \| 5 \| 0 \|
				1568	\| 16 \| 24 \| 5 \| 0 \|
				1569	\| 17 \| 25 \| 5 \| 32 \|
				1570	\| 18 \| 26 \| 5 \| 0 \|
				1571	\| 19 \| 27 \| 6 \| 0 \|
				1572	\| 20 \| 29 \| 6 \| 0 \|
				1573	\| 21 \| 31 \| 6 \| 0 \|
				1574	\| 22 \| 0 \| 4 \| 32 \|
				1575	\| 23 \| 1 \| 4 \| 0 \|
				1576	\| 24 \| 2 \| 5 \| 0 \|
				1577	\| 25 \| 4 \| 5 \| 32 \|
				1578	\| 26 \| 5 \| 5 \| 0 \|
				1579	\| 27 \| 7 \| 5 \| 32 \|
				1580	\| 28 \| 8 \| 5 \| 0 \|
				1581	\| 29 \| 10 \| 5 \| 32 \|
				1582	\| 30 \| 11 \| 5 \| 0 \|
				1583	\| 31 \| 13 \| 6 \| 0 \|
				1584	\| 32 \| 16 \| 5 \| 32 \|
				1585	\| 33 \| 17 \| 5 \| 0 \|
				1586	\| 34 \| 19 \| 5 \| 32 \|
				1587	\| 35 \| 20 \| 5 \| 0 \|
				1588	\| 36 \| 22 \| 5 \| 32 \|
				1589	\| 37 \| 23 \| 5 \| 0 \|
				1590	\| 38 \| 25 \| 4 \| 0 \|
				1591	\| 39 \| 25 \| 4 \| 16 \|
				1592	\| 40 \| 26 \| 5 \| 32 \|
				1593	\| 41 \| 28 \| 6 \| 0 \|
				1594	\| 42 \| 30 \| 6 \| 0 \|
				1595	\| 43 \| 0 \| 4 \| 48 \|
				1596	\| 44 \| 1 \| 4 \| 16 \|
				1597	\| 45 \| 2 \| 5 \| 32 \|
				1598	\| 46 \| 3 \| 5 \| 32 \|
				1599	\| 47 \| 5 \| 5 \| 32 \|
				1600	\| 48 \| 6 \| 5 \| 32 \|
				1601	\| 49 \| 8 \| 5 \| 32 \|
				1602	\| 50 \| 9 \| 5 \| 32 \|
				1603	\| 51 \| 11 \| 5 \| 32 \|
				1604	\| 52 \| 12 \| 5 \| 32 \|
				1605	\| 53 \| 15 \| 6 \| 0 \|
				1606	\| 54 \| 17 \| 5 \| 32 \|
				1607	\| 55 \| 18 \| 5 \| 32 \|
				1608	\| 56 \| 20 \| 5 \| 32 \|
				1609	\| 57 \| 21 \| 5 \| 32 \|
				1610	\| 58 \| 23 \| 5 \| 32 \|
				1611	\| 59 \| 24 \| 5 \| 32 \|
				1612	\| 60 \| 35 \| 6 \| 0 \|
				1613	\| 61 \| 34 \| 6 \| 0 \|
				1614	\| 62 \| 33 \| 6 \| 0 \|
				1615	\| 63 \| 32 \| 6 \| 0 \|
				1616
				1617	#### Match Length Code:
				1618
				1619	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1620	\| ----- \| ------ \| -------------- \| ---- \|
				1621	\| 0 \| 0 \| 6 \| 0 \|
				1622	\| 1 \| 1 \| 4 \| 0 \|
				1623	\| 2 \| 2 \| 5 \| 32 \|
				1624	\| 3 \| 3 \| 5 \| 0 \|
				1625	\| 4 \| 5 \| 5 \| 0 \|
				1626	\| 5 \| 6 \| 5 \| 0 \|
				1627	\| 6 \| 8 \| 5 \| 0 \|
				1628	\| 7 \| 10 \| 6 \| 0 \|
				1629	\| 8 \| 13 \| 6 \| 0 \|
				1630	\| 9 \| 16 \| 6 \| 0 \|
				1631	\| 10 \| 19 \| 6 \| 0 \|
				1632	\| 11 \| 22 \| 6 \| 0 \|
				1633	\| 12 \| 25 \| 6 \| 0 \|
				1634	\| 13 \| 28 \| 6 \| 0 \|
				1635	\| 14 \| 31 \| 6 \| 0 \|
				1636	\| 15 \| 33 \| 6 \| 0 \|
				1637	\| 16 \| 35 \| 6 \| 0 \|
				1638	\| 17 \| 37 \| 6 \| 0 \|
				1639	\| 18 \| 39 \| 6 \| 0 \|
				1640	\| 19 \| 41 \| 6 \| 0 \|
				1641	\| 20 \| 43 \| 6 \| 0 \|
				1642	\| 21 \| 45 \| 6 \| 0 \|
				1643	\| 22 \| 1 \| 4 \| 16 \|
				1644	\| 23 \| 2 \| 4 \| 0 \|
				1645	\| 24 \| 3 \| 5 \| 32 \|
				1646	\| 25 \| 4 \| 5 \| 0 \|
				1647	\| 26 \| 6 \| 5 \| 32 \|
				1648	\| 27 \| 7 \| 5 \| 0 \|
				1649	\| 28 \| 9 \| 6 \| 0 \|
				1650	\| 29 \| 12 \| 6 \| 0 \|
				1651	\| 30 \| 15 \| 6 \| 0 \|
				1652	\| 31 \| 18 \| 6 \| 0 \|
				1653	\| 32 \| 21 \| 6 \| 0 \|
				1654	\| 33 \| 24 \| 6 \| 0 \|
				1655	\| 34 \| 27 \| 6 \| 0 \|
				1656	\| 35 \| 30 \| 6 \| 0 \|
				1657	\| 36 \| 32 \| 6 \| 0 \|
				1658	\| 37 \| 34 \| 6 \| 0 \|
				1659	\| 38 \| 36 \| 6 \| 0 \|
				1660	\| 39 \| 38 \| 6 \| 0 \|
				1661	\| 40 \| 40 \| 6 \| 0 \|
				1662	\| 41 \| 42 \| 6 \| 0 \|
				1663	\| 42 \| 44 \| 6 \| 0 \|
				1664	\| 43 \| 1 \| 4 \| 32 \|
				1665	\| 44 \| 1 \| 4 \| 48 \|
				1666	\| 45 \| 2 \| 4 \| 16 \|
				1667	\| 46 \| 4 \| 5 \| 32 \|
				1668	\| 47 \| 5 \| 5 \| 32 \|
				1669	\| 48 \| 7 \| 5 \| 32 \|
				1670	\| 49 \| 8 \| 5 \| 32 \|
				1671	\| 50 \| 11 \| 6 \| 0 \|
				1672	\| 51 \| 14 \| 6 \| 0 \|
				1673	\| 52 \| 17 \| 6 \| 0 \|
				1674	\| 53 \| 20 \| 6 \| 0 \|
				1675	\| 54 \| 23 \| 6 \| 0 \|
				1676	\| 55 \| 26 \| 6 \| 0 \|
				1677	\| 56 \| 29 \| 6 \| 0 \|
				1678	\| 57 \| 52 \| 6 \| 0 \|
				1679	\| 58 \| 51 \| 6 \| 0 \|
				1680	\| 59 \| 50 \| 6 \| 0 \|
				1681	\| 60 \| 49 \| 6 \| 0 \|
				1682	\| 61 \| 48 \| 6 \| 0 \|
				1683	\| 62 \| 47 \| 6 \| 0 \|
				1684	\| 63 \| 46 \| 6 \| 0 \|
				1685
				1686	#### Offset Code:
				1687
				1688	\| State \| Symbol \| Number_Of_Bits \| Base \|
				1689	\| ----- \| ------ \| -------------- \| ---- \|
				1690	\| 0 \| 0 \| 5 \| 0 \|
				1691	\| 1 \| 6 \| 4 \| 0 \|
				1692	\| 2 \| 9 \| 5 \| 0 \|
				1693	\| 3 \| 15 \| 5 \| 0 \|
				1694	\| 4 \| 21 \| 5 \| 0 \|
				1695	\| 5 \| 3 \| 5 \| 0 \|
				1696	\| 6 \| 7 \| 4 \| 0 \|
				1697	\| 7 \| 12 \| 5 \| 0 \|
				1698	\| 8 \| 18 \| 5 \| 0 \|
				1699	\| 9 \| 23 \| 5 \| 0 \|
				1700	\| 10 \| 5 \| 5 \| 0 \|
				1701	\| 11 \| 8 \| 4 \| 0 \|
				1702	\| 12 \| 14 \| 5 \| 0 \|
				1703	\| 13 \| 20 \| 5 \| 0 \|
				1704	\| 14 \| 2 \| 5 \| 0 \|
				1705	\| 15 \| 7 \| 4 \| 16 \|
				1706	\| 16 \| 11 \| 5 \| 0 \|
				1707	\| 17 \| 17 \| 5 \| 0 \|
				1708	\| 18 \| 22 \| 5 \| 0 \|
				1709	\| 19 \| 4 \| 5 \| 0 \|
				1710	\| 20 \| 8 \| 4 \| 16 \|
				1711	\| 21 \| 13 \| 5 \| 0 \|
				1712	\| 22 \| 19 \| 5 \| 0 \|
				1713	\| 23 \| 1 \| 5 \| 0 \|
				1714	\| 24 \| 6 \| 4 \| 16 \|
				1715	\| 25 \| 10 \| 5 \| 0 \|
				1716	\| 26 \| 16 \| 5 \| 0 \|
				1717	\| 27 \| 28 \| 5 \| 0 \|
				1718	\| 28 \| 27 \| 5 \| 0 \|
				1719	\| 29 \| 26 \| 5 \| 0 \|
				1720	\| 30 \| 25 \| 5 \| 0 \|
				1721	\| 31 \| 24 \| 5 \| 0 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1722
Yann Collet	7639db9	2018-06-21 17:48:34 -0700	[diff] [blame]	1723
				1724
				1725	Appendix B - Resources for implementers
				1726	-------------------------------------------------
				1727
				1728	An open source reference implementation is available on :
				1729	https://github.com/facebook/zstd
				1730
				1731	The project contains a frame generator, called [decodeCorpus],
				1732	which can be used by any 3rd-party implementation
				1733	to verify that a tested decoder is compliant with the specification.
				1734
				1735	[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
				1736
				1737	`decodeCorpus` generates random valid frames.
				1738	A compliant decoder should be able to decode them all,
				1739	or at least provide a meaningful error code explaining for which reason it cannot
				1740	(memory limit restrictions for example).
				1741
				1742
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1743	Version changes
				1744	---------------
Yann Collet	3b343dc	2024-10-07 17:15:07 -0700	[diff] [blame]	1745	- 0.4.3 : clarifications for Huffman prefix code assignment example
Yann Collet	a8b86d0	2024-10-02 22:57:36 -0700	[diff] [blame]	1746	- 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
				1747	- 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
Yann Collet	3732a08	2023-06-05 16:03:00 -0700	[diff] [blame]	1748	- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
Yann Collet	64e8511	2023-03-08 15:30:27 -0800	[diff] [blame]	1749	- 0.3.9 : clarifications for Huffman-compressed literal sizes.
Yann Collet	832f559	2023-02-18 18:16:00 -0800	[diff] [blame]	1750	- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
Yann Collet	0b0b62d	2021-05-15 23:04:46 -0700	[diff] [blame]	1751	- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
Yann Collet	bb3c9bf	2020-05-25 08:15:09 -0700	[diff] [blame]	1752	- 0.3.6 : clarifications for Dictionary_ID
Yann Collet	098b36e	2019-11-13 09:50:15 -0800	[diff] [blame]	1753	- 0.3.5 : clarifications for Block_Maximum_Size
Yann Collet	ff7bd16	2019-10-18 17:48:12 -0700	[diff] [blame]	1754	- 0.3.4 : clarifications for FSE decoding table
Yann Collet	1e07eb4	2019-08-16 15:13:42 +0200	[diff] [blame]	1755	- 0.3.3 : clarifications for field Block_Size
W. Felix Handte	a2861d7	2019-07-17 17:55:15 -0400	[diff] [blame]	1756	- 0.3.2 : remove additional block size restriction on compressed blocks
Yann Collet	9bf0070	2018-10-26 15:51:51 -0700	[diff] [blame]	1757	- 0.3.1 : minor clarification regarding offset history update rules
Yann Collet	72a3adf	2018-09-25 16:34:26 -0700	[diff] [blame]	1758	- 0.3.0 : minor edits to match RFC8478
Yann Collet	55a8f84	2018-09-05 12:25:35 -0700	[diff] [blame]	1759	- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
Yann Collet	a4c9c4d	2018-05-31 10:47:44 -0700	[diff] [blame]	1760	- 0.2.8 : clarifications for IETF RFC discuss
Yann Collet	82ad249	2018-04-30 11:35:49 -0700	[diff] [blame]	1761	- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1762	- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
Yann Collet	14433ca	2017-03-31 10:54:45 -0700	[diff] [blame]	1763	- 0.2.5 : minor typos and clarifications
Sean Purcell	042419e	2017-02-17 16:24:26 -0800	[diff] [blame]	1764	- 0.2.4 : section restructuring, by Sean Purcell
Yann Collet	20bed42	2017-01-27 12:16:16 -0800	[diff] [blame]	1765	- 0.2.3 : clarified several details, by Sean Purcell
Yann Collet	55981a9	2016-09-15 02:13:18 +0200	[diff] [blame]	1766	- 0.2.2 : added predefined codes, by Johannes Rudolph
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1767	- 0.2.1 : clarify field names, by Przemyslaw Skibinski
Yann Collet	8b12812	2017-08-19 12:17:57 -0700	[diff] [blame]	1768	- 0.2.0 : numerous format adjustments for zstd v0.8+
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1769	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1770	- 0.1.1 : reserved dictID ranges
				1771	- 0.1.0 : initial release