Blame - zstd_compression_format.md - external_zstd

blob: d7f21939ae50824bb823b1f4a11c5b9974d378f2 [file] [log] [blame] [view]

Yann Collet	5cc1882	2016-07-03 19:03:13 +0200	[diff] [blame]	1	Zstandard Compression Format
				2	============================
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	3
				4	### Notices
				5
				6	Copyright (c) 2016 Yann Collet
				7
				8	Permission is granted to copy and distribute this document
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	9	for any purpose and without charge,
				10	including translations into other languages
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	11	and incorporation into compilations,
				12	provided that the copyright notice and this notice are preserved,
				13	and that any substantive changes or deletions from the original
				14	are clearly marked.
				15	Distribution of this document is unlimited.
				16
				17	### Version
				18
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	19	0.2.0 (22/07/16)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	20
				21
				22	Introduction
				23	------------
				24
				25	The purpose of this document is to define a lossless compressed data format,
				26	that is independent of CPU type, operating system,
				27	file system and character set, suitable for
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	28	file compression, pipe and streaming compression,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	29	using the [Zstandard algorithm](http://www.zstandard.org).
				30
				31	The data can be produced or consumed,
				32	even for an arbitrarily long sequentially presented input data stream,
				33	using only an a priori bounded amount of intermediate storage,
				34	and hence can be used in data communications.
				35	The format uses the Zstandard compression method,
				36	and optional [xxHash-64 checksum method](http://www.xxhash.org),
				37	for detection of data corruption.
				38
				39	The data format defined by this specification
				40	does not attempt to allow random access to compressed data.
				41
				42	This specification is intended for use by implementers of software
				43	to compress data into Zstandard format and/or decompress data from Zstandard format.
				44	The text of the specification assumes a basic background in programming
				45	at the level of bits and other primitive data representations.
				46
				47	Unless otherwise indicated below,
				48	a compliant compressor must produce data sets
				49	that conform to the specifications presented here.
				50	It doesn’t need to support all options though.
				51
				52	A compliant decompressor must be able to decompress
				53	at least one working set of parameters
				54	that conforms to the specifications presented here.
				55	It may also ignore informative fields, such as checksum.
				56	Whenever it does not support a parameter defined in the compressed stream,
				57	it must produce a non-ambiguous error code and associated error message
				58	explaining which parameter is unsupported.
				59
				60
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	61	Overall conventions
				62	-----------
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	63	In this document:
				64	- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
				65	- a naming convention for identifiers is `Mixed_Case_With_Underscores`
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	66
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	67	Definitions
				68	-----------
				69	A content compressed by Zstandard is transformed into a Zstandard __frame__.
				70	Multiple frames can be appended into a single file or stream.
				71	A frame is totally independent, has a defined beginning and end,
				72	and a set of parameters which tells the decoder how to decompress it.
				73
				74	A frame encapsulates one or multiple __blocks__.
				75	Each block can be compressed or not,
				76	and has a guaranteed maximum content size, which depends on frame parameters.
				77	Unlike frames, each block depends on previous blocks for proper decoding.
				78	However, each block can be decompressed without waiting for its successor,
				79	allowing streaming operations.
				80
				81
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	82	Frame Concatenation
				83	-------------------
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	84
				85	In some circumstances, it may be required to append multiple frames,
				86	for example in order to add new data to an existing compressed file
				87	without re-framing it.
				88
				89	In such case, each frame brings its own set of descriptor flags.
				90	Each frame is considered independent.
				91	The only relation between frames is their sequential order.
				92
				93	The ability to decode multiple concatenated frames
				94	within a single stream or file is left outside of this specification.
				95	As an example, the reference `zstd` command line utility is able
				96	to decode all concatenated frames in their sequential order,
				97	delivering the final decompressed result as if it was a single content.
				98
				99
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	100	Skippable Frames
				101	----------------
				102
				103	\| `Magic_Number` \| `Frame_Size` \| `User_Data` \|
				104	\|:--------------:\|:------------:\|:-----------:\|
				105	\| 4 bytes \| 4 bytes \| n bytes \|
				106
				107	Skippable frames allow the insertion of user-defined data
				108	into a flow of concatenated frames.
				109	Its design is pretty straightforward,
				110	with the sole objective to allow the decoder to quickly skip
				111	over user-defined data and continue decoding.
				112
				113	Skippable frames defined in this specification are compatible with [LZ4] ones.
				114
				115	[LZ4]:http://www.lz4.org
				116
				117	__`Magic_Number`__
				118
				119	4 Bytes, little-endian format.
				120	Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
				121	All 16 values are valid to identify a skippable frame.
				122
				123	__`Frame_Size`__
				124
				125	This is the size, in bytes, of the following `User_Data`
				126	(without including the magic number nor the size field itself).
				127	This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
				128	This means `User_Data` can’t be bigger than (2^32-1) bytes.
				129
				130	__`User_Data`__
				131
				132	The `User_Data` can be anything. Data will just be skipped by the decoder.
				133
				134
				135
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	136	General Structure of Zstandard Frame format
				137	-------------------------------------------
				138	The structure of a single Zstandard frame is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	139
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	140	\| `Magic_Number` \| `Frame_Header` \|`Data_Block`\| [More data blocks] \| [`Content_Checksum`] \|
				141	\|:--------------:\|:--------------:\|:----------:\| ------------------ \|:--------------------:\|
				142	\| 4 bytes \| 2-14 bytes \| n bytes \| \| 0-4 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	143
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	144	__`Magic_Number`__
				145
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	146	4 Bytes, little-endian format.
Yann Collet	7bdfcea	2016-09-05 17:43:31 +0200	[diff] [blame^]	147	Value : 0xFD2FB528
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	148
				149	__`Frame_Header`__
				150
				151	2 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
				152
				153	__`Data_Block`__
				154
				155	Detailed in [next chapter](#the-structure-of-data_block).
				156	That’s where compressed data is stored.
				157
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	158	__`Content_Checksum`__
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	159
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	160	An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	161	The content checksum is the result
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	162	of [xxh64() hash function](http://www.xxhash.org)
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	163	digesting the original (decoded) data as input, and a seed of zero.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	164	The low 4 bytes of the checksum are stored in little endian format.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	165
				166
				167	The structure of `Frame_Header`
				168	-------------------------------
				169	The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	170	and up to 14 bytes depending on optional parameters.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	171	The structure of `Frame_Header` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	172
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	173	\| `Frame_Header_Descriptor` \| [`Window_Descriptor`] \| [`Dictionary_ID`] \| [`Frame_Content_Size`] \|
				174	\| ------------------------- \| --------------------- \| ----------------- \| ---------------------- \|
				175	\| 1 byte \| 0-1 byte \| 0-4 bytes \| 0-8 bytes \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	176
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	177	### `Frame_Header_Descriptor`
				178
				179	The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	180	It tells which other fields are present.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	181	Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	182
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	183	\| Bit number \| Field name \|
				184	\| ---------- \| ---------- \|
				185	\| 7-6 \| `Frame_Content_Size_flag` \|
				186	\| 5 \| `Single_Segment_flag` \|
				187	\| 4 \| `Unused_bit` \|
				188	\| 3 \| `Reserved_bit` \|
				189	\| 2 \| `Content_Checksum_flag` \|
				190	\| 1-0 \| `Dictionary_ID_flag` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	191
				192	In this table, bit 7 is highest bit, while bit 0 is lowest.
				193
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	194	__`Frame_Content_Size_flag`__
				195
				196	This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
				197	specifying if decompressed data size is provided within the header.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	198	The `Flag_Value` can be converted into `Field_Size`,
				199	which is the number of bytes used by `Frame_Content_Size`
				200	according to the following table:
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	201
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	202	\|`Flag_Value`\| 0 \| 1 \| 2 \| 3 \|
				203	\| ---------- \| ------ \| --- \| --- \| --- \|
				204	\|`Field_Size`\| 0 or 1 \| 2 \| 4 \| 8 \|
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	205
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	206	When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
				207	if `Single_Segment_flag` is set, `Field_Size` is 1.
				208	Otherwise, `Field_Size` is 0 (content size not provided).
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	209
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	210	__`Single_Segment_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	211
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	212	If this flag is set,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	213	data must be regenerated within a single continuous memory segment.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	214
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	215	In this case, `Frame_Content_Size` is necessarily present,
				216	but `Window_Descriptor` byte is skipped.
inikep	49ec6d1	2016-07-25 12:26:39 +0200	[diff] [blame]	217	As a consequence, the decoder must allocate a memory segment
				218	of size equal or bigger than `Frame_Content_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	219
				220	In order to preserve the decoder from unreasonable memory requirement,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	221	a decoder can reject a compressed frame
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	222	which requests a memory size beyond decoder's authorized range.
				223
				224	For broader compatibility, decoders are recommended to support
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	225	memory sizes of at least 8 MB.
				226	This is just a recommendation,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	227	each decoder is free to support higher or lower limits,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	228	depending on local limitations.
				229
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	230	__`Unused_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	231
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	232	The value of this bit should be set to zero.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	233	A decoder compliant with this specification version shall not interpret it.
Yann Collet	f0bc673	2016-07-13 17:30:21 +0200	[diff] [blame]	234	It might be used in a future version,
				235	to signal a property which is not mandatory to properly decode the frame.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	236
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	237	__`Reserved_bit`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	238
				239	This bit is reserved for some future feature.
				240	Its value _must be zero_.
				241	A decoder compliant with this specification version must ensure it is not set.
				242	This bit may be used in a future revision,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	243	to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	244
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	245	__`Content_Checksum_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	246
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	247	If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
				248	See `Content_Checksum` paragraph.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	249
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	250	__`Dictionary_ID_flag`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	251
				252	This is a 2-bits flag (`= FHD & 3`),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	253	telling if a dictionary ID is provided within the header.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	254	It also specifies the size of this field as `Field_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	255
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	256	\|`Flag_Value`\| 0 \| 1 \| 2 \| 3 \|
				257	\| ---------- \| --- \| --- \| --- \| --- \|
				258	\|`Field_Size`\| 0 \| 1 \| 2 \| 4 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	259
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	260	### `Window_Descriptor`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	261
				262	Provides guarantees on maximum back-reference distance
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	263	that will be used within compressed data.
				264	This information is important for decoders to allocate enough memory.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	265
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	266	The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	267	In this case, the maximum back-reference distance is the content size itself,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	268	which can be any value from 1 to 2^64-1 bytes (16 EB).
				269
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	270	\| Bit numbers \| 7-3 \| 0-2 \|
				271	\| ----------- \| ---------- \| ---------- \|
				272	\| Field name \| `Exponent` \| `Mantissa` \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	273
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	274	Maximum distance is given by the following formulas :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	275	```
				276	windowLog = 10 + Exponent;
				277	windowBase = 1 << windowLog;
				278	windowAdd = (windowBase / 8) * Mantissa;
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	279	Window_Size = windowBase + windowAdd;
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	280	```
				281	The minimum window size is 1 KB.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	282	The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	283
				284	To properly decode compressed data,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	285	a decoder will need to allocate a buffer of at least `Window_Size` bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	286
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	287	In order to preserve decoder from unreasonable memory requirements,
				288	a decoder can refuse a compressed frame
				289	which requests a memory size beyond decoder's authorized range.
				290
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	291	For improved interoperability,
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	292	decoders are recommended to be compatible with window sizes of 8 MB,
				293	and encoders are recommended to not request more than 8 MB.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	294	It's merely a recommendation though,
				295	decoders are free to support larger or lower limits,
				296	depending on local limitations.
				297
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	298	### `Dictionary_ID`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	299
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	300	This is a variable size field, which contains
				301	the ID of the dictionary required to properly decode the frame.
				302	Note that this field is optional. When it's not present,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	303	it's up to the caller to make sure it uses the correct dictionary.
				304
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	305	Field size depends on `Dictionary_ID_flag`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	306	1 byte can represent an ID 0-255.
				307	2 bytes can represent an ID 0-65535.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	308	4 bytes can represent an ID 0-4294967295.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	309
				310	It's allowed to represent a small ID (for example `13`)
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	311	with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	312
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	313	_Reserved ranges :_
				314	If the frame is going to be distributed in a private environment,
				315	any dictionary ID can be used.
				316	However, for public distribution of compressed frames using a dictionary,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	317	the following ranges are reserved for future use and should not be used :
				318	- low range : 1 - 32767
				319	- high range : >= (2^31)
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	320
				321
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	322	### `Frame_Content_Size`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	323
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	324	This is the original (uncompressed) size. This information is optional.
				325	The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
				326	The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	327	Format is little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	328
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	329	\| `Field_Size` \| Range \|
				330	\| ------------ \| ---------- \|
				331	\| 1 \| 0 - 255 \|
				332	\| 2 \| 256 - 65791\|
				333	\| 4 \| 0 - 2^32-1 \|
				334	\| 8 \| 0 - 2^64-1 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	335
inikep	2fc3752	2016-07-25 12:47:02 +0200	[diff] [blame]	336	When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
				337	When `Field_Size` is 2, _the offset of 256 is added_.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	338	It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	339
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	340
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	341	The structure of `Data_Block`
				342	-----------------------------
				343	The structure of `Data_Block` is following:
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	344
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	345	\| `Last_Block` \| `Block_Type` \| `Block_Size` \| `Block_Content` \|
				346	\|:------------:\|:------------:\|:------------:\|:---------------:\|
				347	\| 1 bit \| 2 bits \| 21 bits \| n bytes \|
				348
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	349	The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	350
				351	__`Last_Block`__
				352
				353	The lowest bit signals if this block is the last one.
				354	Frame ends right after this block.
				355	It may be followed by an optional `Content_Checksum` .
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	356
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	357	__`Block_Type` and `Block_Size`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	358
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	359	The next 2 bits represent the `Block_Type`,
				360	while the remaining 21 bits represent the `Block_Size`.
				361	Format is __little-endian__.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	362
				363	There are 4 block types :
				364
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	365	\| Value \| 0 \| 1 \| 2 \| 3 \|
				366	\| ------------ \| ----------- \| ----------- \| ------------------ \| --------- \|
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	367	\| `Block_Type` \| `Raw_Block` \| `RLE_Block` \| `Compressed_Block` \| `Reserved`\|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	368
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	369	- `Raw_Block` - this is an uncompressed block.
				370	`Block_Size` is the number of bytes to read and copy.
				371	- `RLE_Block` - this is a single byte, repeated N times.
				372	In which case, `Block_Size` is the size to regenerate,
				373	while the "compressed" block is just 1 byte (the byte to repeat).
				374	- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	375	detailed in another section of this specification.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	376	`Block_Size` is the compressed size.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	377	Decompressed size is unknown,
				378	but its maximum possible value is guaranteed (see below)
Yann Collet	c991cc1	2016-07-28 00:55:43 +0200	[diff] [blame]	379	- `Reserved` - this is not a block.
				380	This value cannot be used with current version of this specification.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	381
				382	Block sizes must respect a few rules :
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	383	- In compressed mode, compressed size if always strictly `< decompressed size`.
				384	- Block decompressed size is always <= maximum back-reference distance .
				385	- Block decompressed size is always <= 128 KB
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	386
				387
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	388	__`Block_Content`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	389
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	390	The `Block_Content` is where the actual data to decode stands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	391	It might be compressed or not, depending on previous field indications.
				392	A data block is not necessarily "full" :
				393	since an arbitrary “flush” may happen anytime,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	394	block decompressed content can be any size,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	395	up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	396	- Maximum back-reference distance
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	397	- 128 KB
				398
				399
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	400
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	401	The format of `Compressed_Block`
				402	--------------------------------
				403	The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
				404	The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	405	in order to properly allocate destination buffer.
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	406	See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	407
				408	A compressed block consists of 2 sections :
inikep	0079425	2016-08-04 14:43:21 +0200	[diff] [blame]	409	- [`Literals_Section`](#literals_section)
				410	- [`Sequences_Section`](#sequences_section)
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	411
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	412	### Prerequisites
				413	To decode a compressed block, the following elements are necessary :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	414	- Previous decoded blocks, up to a distance of `Window_Size`,
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	415	or all previous blocks when `Single_Segment_flag` is set.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	416	- List of "recent offsets" from previous compressed block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	417	- Decoding tables of previous compressed block for each symbol type
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	418	(literals, literals lengths, match lengths, offsets).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	419
				420
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	421	### `Literals_Section`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	422
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	423	During sequence phase, literals will be entangled with match copy operations.
				424	All literals are regrouped in the first part of the block.
				425	They can be decoded first, and then copied during sequence operations,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	426	or they can be decoded on the flow, as needed by sequence commands.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	427
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame]	428	\| `Literals_Section_Header` \| [`Huffman_Tree_Description`] \| Stream1 \| [Stream2] \| [Stream3] \| [Stream4] \|
				429	\| ------------------------- \| ---------------------------- \| ------- \| --------- \| --------- \| --------- \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	430
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	431	Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	432	When compressed, an optional tree description can be present,
				433	followed by 1 or 4 streams.
				434
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	435
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	436	#### `Literals_Section_Header`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	437
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	438	Header is in charge of describing how literals are packed.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	439	It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	440	using little-endian convention.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	441
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	442	\| `Literals_Block_Type` \| `Size_Format` \| `Regenerated_Size` \| [`Compressed_Size`] \|
				443	\| --------------------- \| ------------- \| ------------------ \| ----------------- \|
				444	\| 2 bits \| 1 - 2 bits \| 5 - 20 bits \| 0 - 18 bits \|
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	445
				446	In this representation, bits on the left are smallest bits.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	447
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	448	__`Literals_Block_Type`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	449
Yann Collet	198e6aa	2016-07-20 20:12:24 +0200	[diff] [blame]	450	This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	451
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	452	\| `Literals_Block_Type` \| Value \|
				453	\| ----------------------------- \| ----- \|
				454	\| `Raw_Literals_Block` \| 0 \|
				455	\| `RLE_Literals_Block` \| 1 \|
				456	\| `Compressed_Literals_Block` \| 2 \|
				457	\| `Repeat_Stats_Literals_Block` \| 3 \|
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	458
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	459	- `Raw_Literals_Block` - Literals are stored uncompressed.
				460	- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
				461	- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	462	starting with a Huffman tree description.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	463	See details below.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	464	- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	465	using Huffman tree _from previous Huffman-compressed literals block_.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	466	Huffman tree description will be skipped.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	467
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	468	__`Size_Format`__
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	469
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	470	`Size_Format` is divided into 2 families :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	471
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	472	- For `Compressed_Block`, it requires to decode both `Compressed_Size`
				473	and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	474	- For `Raw_Literals_Block` and `RLE_Literals_Block` it's enough to decode `Regenerated_Size`.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	475
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	476	For values spanning several bytes, convention is little-endian.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	477
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	478	__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	479
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	480	- Value x0 : `Regenerated_Size` uses 5 bits (0-31).
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	481	`Literals_Section_Header` has 1 byte.
				482	`Regenerated_Size = Header[0]>>3`
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	483	- Value 01 : `Regenerated_Size` uses 12 bits (0-4095).
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	484	`Literals_Section_Header` has 2 bytes.
				485	`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	486	- Value 11 : `Regenerated_Size` uses 20 bits (0-1048575).
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	487	`Literals_Section_Header` has 3 bytes.
				488	`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	489
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	490	Note : it's allowed to represent a short value (for example `13`)
				491	using a long format, accepting the increased compressed data size.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	492
inikep	9d003c1	2016-08-04 10:41:49 +0200	[diff] [blame]	493	__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	494
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	495	- Value 00 : _A single stream_.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	496	Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
				497	`Literals_Section_Header` has 3 bytes.
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	498	- Value 01 : 4 streams.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	499	Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
				500	`Literals_Section_Header` has 3 bytes.
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	501	- Value 10 : 4 streams.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	502	Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
				503	`Literals_Section_Header` has 4 bytes.
inikep	0132375	2016-08-25 12:20:38 +0200	[diff] [blame]	504	- Value 11 : 4 streams.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	505	Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
				506	`Literals_Section_Header` has 5 bytes.
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	507
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	508	Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
inikep	f896c1d	2016-08-03 16:37:42 +0200	[diff] [blame]	509
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	510
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame]	511	#### `Huffman_Tree_Description`
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	512
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	513	This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	514
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	515	Prefix coding represents symbols from an a priori known alphabet
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	516	by bit sequences (codewords), one codeword for each symbol,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	517	in a manner such that different symbols may be represented
				518	by bit sequences of different lengths,
				519	but a parser can always parse an encoded string
				520	unambiguously symbol-by-symbol.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	521
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	522	Given an alphabet with known symbol frequencies,
				523	the Huffman algorithm allows the construction of an optimal prefix code
				524	using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	525
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	526	Prefix code must not exceed a maximum code length.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	527	More bits improve accuracy but cost more header size,
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	528	and require more memory or more complex decoding operations.
				529	This specification limits maximum code length to 11 bits.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	530
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	531
				532	##### Representation
				533
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	534	All literal values from zero (included) to last present one (excluded)
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	535	are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
				536	Transformation from `Weight` to `Number_of_Bits` follows this formula :
				537	```
				538	Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
				539	```
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	540	The last symbol's `Weight` is deduced from previously decoded ones,
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	541	by completing to the nearest power of 2.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	542	This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	543
				544	__Example__ :
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	545	Let's presume the following Huffman tree must be described :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	546
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	547	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				548	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				549	\| `Number_of_Bits` \| 1 \| 2 \| 3 \| 0 \| 4 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	550
				551	The tree depth is 4, since its smallest element uses 4 bits.
				552	Value `5` will not be listed, nor will values above `5`.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	553	Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	554	Weight formula is :
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	555	```
				556	Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
				557	```
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	558	It gives the following serie of weights :
				559
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	560	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \|
				561	\| -------- \| --- \| --- \| --- \| --- \| --- \|
				562	\| literal \| 0 \| 1 \| 2 \| 3 \| 4 \|
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	563
				564	The decoder will do the inverse operation :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	565	having collected weights of literals from `0` to `4`,
				566	it knows the last literal, `5`, is present with a non-zero weight.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	567	The weight of `5` can be deducted by joining to the nearest power of 2.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	568	Sum of `2^(Weight-1)` (excluding 0) is :
				569	`8 + 4 + 2 + 0 + 1 = 15`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	570	Nearest power of 2 is 16.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	571	Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	572
				573	##### Huffman Tree header
				574
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	575	This is a single byte value (0-255),
				576	which tells how to decode the list of weights.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	577
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	578	- if `headerByte` >= 128 : this is a direct representation,
				579	where each `Weight` is written directly as a 4 bits field (0-15).
				580	The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
				581	meaning it uses a last full byte even if `Number_of_Symbols` is odd.
				582	`Number_of_Symbols = headerByte - 127`.
				583	Note that maximum `Number_of_Symbols` is 255-127 = 128.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	584	A larger serie must necessarily use FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	585
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	586	- if `headerByte` < 128 :
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	587	the serie of weights is compressed by FSE.
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	588	The length of the FSE-compressed serie is equal to `headerByte` (0-127).
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	589
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	590	##### Finite State Entropy (FSE) compression of Huffman weights
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	591
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	592	The serie of weights is compressed using FSE compression.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	593	It's a single bitstream with 2 interleaved states,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	594	sharing a single distribution table.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	595
				596	To decode an FSE bitstream, it is necessary to know its compressed size.
				597	Compressed size is provided by `headerByte`.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	598	It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	599	which is `255`, since literal values span from `0` to `255`,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	600	and last symbol value is not represented.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	601
				602	An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	603	It will create a Decoding Table.
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	604	Table must be pre-allocated, which requires to support a maximum accuracy.
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	605	For a list of Huffman weights, maximum accuracy is 7 bits.
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	606
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	607	FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
				608	and so is [FSE bitstream](#bitstream).
				609	The main difference is that Huffman header compression uses 2 states,
				610	which share the same FSE distribution table.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	611	Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet	26f6814	2016-07-08 10:42:59 +0200	[diff] [blame]	612	The number of symbols to decode is discovered
				613	by tracking bitStream overflow condition.
				614	When both states have overflowed the bitstream, end is reached.
				615
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	616
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	617	##### Conversion from weights to Huffman prefix codes
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	618
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	619	All present symbols shall now have a `Weight` value.
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	620	It is possible to transform weights into Number_of_Bits, using this formula:
				621	```
				622	Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0
				623	```
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	624	Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
				625	Symbols with a `Weight` of zero are removed.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	626	Then, starting from lowest weight, prefix codes are distributed in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	627
				628	__Example__ :
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	629	Let's presume the following list of weights has been decoded :
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	630
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	631	\| Literal \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \|
				632	\| -------- \| --- \| --- \| --- \| --- \| --- \| --- \|
				633	\| `Weight` \| 4 \| 3 \| 2 \| 0 \| 1 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	634
				635	Sorted by weight and then natural order,
				636	it gives the following distribution :
				637
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	638	\| Literal \| 3 \| 4 \| 5 \| 2 \| 1 \| 0 \|
				639	\| ---------------- \| --- \| --- \| --- \| --- \| --- \| ---- \|
				640	\| `Weight` \| 0 \| 1 \| 1 \| 2 \| 3 \| 4 \|
				641	\| `Number_of_Bits` \| 0 \| 4 \| 4 \| 3 \| 2 \| 1 \|
				642	\| prefix codes \| N/A \| 0000\| 0001\| 001 \| 01 \| 1 \|
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	643
				644
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	645	#### The content of Huffman-compressed literal stream
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	646
				647	##### Bitstreams sizes
				648
				649	As seen in a previous paragraph,
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	650	there are 2 types of Huffman-compressed literals :
				651	a single stream and 4 streams.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	652
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	653	Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	654	Since each stream can be decoded independently,
				655	it's possible to decode them up to 4x faster than a single stream,
				656	presuming the CPU has enough parallelism available.
				657
				658	For single stream, header provides both the compressed and regenerated size.
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	659	For 4 streams though,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	660	header only provides compressed and regenerated size of all 4 streams combined.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	661	In order to properly decode the 4 streams,
				662	it's necessary to know the compressed and regenerated size of each stream.
				663
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	664	Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
				665	except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	666
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	667	Compressed size is provided explicitly : in the 4-streams variant,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	668	bitstreams are preceded by 3 unsigned little-endian 16-bits values.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	669	Each value represents the compressed size of one stream, in order.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	670	The last stream size is deducted from total compressed size
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	671	and from previously decoded stream sizes :
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	672
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	673	`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	674
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	675
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	676	##### Bitstreams read and decode
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	677
				678	Each bitstream must be read _backward_,
				679	that is starting from the end down to the beginning.
				680	Therefore it's necessary to know the size of each bitstream.
				681
				682	It's also necessary to know exactly which _bit_ is the latest.
				683	This is detected by a final bit flag :
				684	the highest bit of latest byte is a final-bit-flag.
				685	Consequently, a last byte of `0` is not possible.
				686	And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet	38b75dd	2016-07-24 15:35:59 +0200	[diff] [blame]	687	Hence, the last byte contains between 0 and 7 useful bits.
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	688
				689	Starting from the end,
				690	it's possible to read the bitstream in a little-endian fashion,
				691	keeping track of already used bits.
				692
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	693	Reading the last `Max_Number_of_Bits` bits,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	694	it's then possible to compare extracted value to decoding table,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	695	determining the symbol to decode and number of bits to discard.
				696
				697	The process continues up to reading the required number of symbols per stream.
				698	If a bitstream is not entirely and exactly consumed,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	699	hence reaching exactly its beginning position with _all_ bits consumed,
Yann Collet	d916c90	2016-07-04 00:42:58 +0200	[diff] [blame]	700	the decoding process is considered faulty.
				701
Yann Collet	698cb63	2016-07-03 18:49:35 +0200	[diff] [blame]	702
inikep	4f270ac	2016-08-04 11:25:52 +0200	[diff] [blame]	703	### `Sequences_Section`
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	704
				705	A compressed block is a succession of _sequences_ .
				706	A sequence is a literal copy command, followed by a match copy command.
				707	A literal copy command specifies a length.
				708	It is the number of bytes to be copied (or extracted) from the literal section.
				709	A match copy command specifies an offset and a length.
				710	The offset gives the position to copy from,
Yann Collet	b21e9cb	2016-07-15 17:31:13 +0200	[diff] [blame]	711	which can be within a previous block.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	712
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	713	When all _sequences_ are decoded,
				714	if there is any literal left in the _literal section_,
				715	these bytes are added at the end of the block.
				716
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	717	The `Sequences_Section` regroup all symbols required to decode commands.
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	718	There are 3 symbol types : literals lengths, offsets and match lengths.
				719	They are encoded together, interleaved, in a single _bitstream_.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	720
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	721	The `Sequences_Section` starts by a header,
				722	followed by optional probability tables for each symbol type,
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	723	followed by the bitstream.
				724
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	725	\| `Sequences_Section_Header` \| [`Literals_Length_Table`] \| [`Offset_Table`] \| [`Match_Length_Table`] \| bitStream \|
				726	\| -------------------------- \| ------------------------- \| ---------------- \| ---------------------- \| --------- \|
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	727
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	728	To decode the `Sequences_Section`, it's required to know its size.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	729	This size is deducted from `blockSize - literalSectionSize`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	730
				731
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	732	#### `Sequences_Section_Header`
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	733
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	734	Consists in 2 items :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	735	- `Number_of_Sequences`
				736	- Symbol compression modes
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	737
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	738	__`Number_of_Sequences`__
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	739
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	740	This is a variable size field using between 1 and 3 bytes.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	741	Let's call its first byte `byte0`.
				742	- `if (byte0 == 0)` : there are no sequences.
				743	The sequence section stops there.
				744	Regenerated content is defined entirely by literals section.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	745	- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
				746	- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
				747	- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	748
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	749	__Symbol compression modes__
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	750
				751	This is a single byte, defining the compression mode of each symbol type.
				752
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	753	\|Bit number\| 7-6 \| 5-4 \| 3-2 \| 1-0 \|
				754	\| -------- \| ----------------------- \| -------------- \| -------------------- \| ---------- \|
				755	\|Field name\| `Literals_Lengths_Mode` \| `Offsets_Mode` \| `Match_Lengths_Mode` \| `Reserved` \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	756
				757	The last field, `Reserved`, must be all-zeroes.
				758
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	759	`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
				760	literals lengths, offsets, and match lengths respectively.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	761
				762	They follow the same enumeration :
				763
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	764	\| Value \| 0 \| 1 \| 2 \| 3 \|
				765	\| ------------------ \| ----------------- \| ---------- \| --------------------- \| ------------- \|
				766	\| `Compression_Mode` \| `Predefined_Mode` \| `RLE_Mode` \| `FSE_Compressed_Mode` \| `Repeat_Mode` \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	767
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	768	- `Predefined_Mode` : uses a predefined distribution table.
				769	- `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times.
				770	- `Repeat_Mode` : re-use distribution table from previous compressed block.
				771	- `FSE_Compressed_Mode` : standard FSE compression.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	772	A distribution table will be present.
				773	It will be described in [next part](#distribution-tables).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	774
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	775	#### The codes for literals lengths, match lengths, and offsets.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	776
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	777	Each symbol is a _code_ in its own context,
				778	which specifies `Baseline` and `Number_of_Bits` to add.
				779	_Codes_ are FSE compressed,
				780	and interleaved with raw additional bits in the same bitstream.
				781
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	782	##### Literals length codes
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	783
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	784	Literals length codes are values ranging from `0` to `35` included.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	785	They define lengths from 0 to 131071 bytes.
				786
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	787	\| `Literals_Length_Code` \| 0-15 \|
				788	\| ---------------------- \| ---------------------- \|
				789	\| length \| `Literals_Length_Code` \|
				790	\| `Number_of_Bits` \| 0 \|
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	791
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	792	\| `Literals_Length_Code` \| 16 \| 17 \| 18 \| 19 \| 20 \| 21 \| 22 \| 23 \|
				793	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	794	\| `Baseline` \| 16 \| 18 \| 20 \| 22 \| 24 \| 28 \| 32 \| 40 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	795	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	796
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	797	\| `Literals_Length_Code` \| 24 \| 25 \| 26 \| 27 \| 28 \| 29 \| 30 \| 31 \|
				798	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	799	\| `Baseline` \| 48 \| 64 \| 128 \| 256 \| 512 \| 1024 \| 2048 \| 4096 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	800	\| `Number_of_Bits` \| 4 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	801
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	802	\| `Literals_Length_Code` \| 32 \| 33 \| 34 \| 35 \|
				803	\| ---------------------- \| ---- \| ---- \| ---- \| ---- \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	804	\| `Baseline` \| 8192 \|16384 \|32768 \|65536 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	805	\| `Number_of_Bits` \| 13 \| 14 \| 15 \| 16 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	806
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	807	##### Default distribution for literals length codes
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	808
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	809	When `Compression_Mode` is `Predefined_Mode`,
				810	a predefined distribution is used for FSE compression.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	811
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	812	Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	813	```
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	814	short literalsLength_defaultDistribution[36] =
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	815	{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
				816	2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
				817	-1,-1,-1,-1 };
				818	```
				819
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	820	##### Match length codes
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	821
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	822	Match length codes are values ranging from `0` to `52` included.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	823	They define lengths from 3 to 131074 bytes.
				824
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	825	\| `Match_Length_Code` \| 0-31 \|
				826	\| ------------------- \| ----------------------- \|
				827	\| value \| `Match_Length_Code` + 3 \|
				828	\| `Number_of_Bits` \| 0 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	829
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	830	\| `Match_Length_Code` \| 32 \| 33 \| 34 \| 35 \| 36 \| 37 \| 38 \| 39 \|
				831	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	832	\| `Baseline` \| 35 \| 37 \| 39 \| 41 \| 43 \| 47 \| 51 \| 59 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	833	\| `Number_of_Bits` \| 1 \| 1 \| 1 \| 1 \| 2 \| 2 \| 3 \| 3 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	834
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	835	\| `Match_Length_Code` \| 40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \|
				836	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	837	\| `Baseline` \| 67 \| 83 \| 99 \| 131 \| 258 \| 514 \| 1026 \| 2050 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	838	\| `Number_of_Bits` \| 4 \| 4 \| 5 \| 7 \| 8 \| 9 \| 10 \| 11 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	839
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	840	\| `Match_Length_Code` \| 48 \| 49 \| 50 \| 51 \| 52 \|
				841	\| ------------------- \| ---- \| ---- \| ---- \| ---- \| ---- \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	842	\| `Baseline` \| 4098 \| 8194 \|16486 \|32770 \|65538 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	843	\| `Number_of_Bits` \| 12 \| 13 \| 14 \| 15 \| 16 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	844
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	845	##### Default distribution for match length codes
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	846
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	847	When `Compression_Mode` is defined as `Predefined_Mode`,
				848	a predefined distribution is used for FSE compression.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	849
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	850	Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	851	```
				852	short matchLengths_defaultDistribution[53] =
				853	{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				854	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
				855	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
				856	-1,-1,-1,-1,-1 };
				857	```
				858
				859	##### Offset codes
				860
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	861	Offset codes are values ranging from `0` to `N`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	862
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	863	A decoder is free to limit its maximum `N` supported.
				864	Recommendation is to support at least up to `22`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	865	For information, at the time of this writing.
				866	the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
				867
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	868	An offset code is also the number of additional bits to read,
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	869	and can be translated into an `Offset_Value` using the following formulas :
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	870
				871	```
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	872	Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
				873	if (Offset_Value > 3) offset = Offset_Value - 3;
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	874	```
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	875	It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4`
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	876	but is limited by [maximum back-reference distance](#window_descriptor).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	877
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	878	`Offset_Value` from 1 to 3 are special : they define "repeat codes",
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	879	which means one of the previous offsets will be repeated.
				880	They are sorted in recency order, with 1 meaning the most recent one.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	881	See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	882
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	883
				884	##### Default distribution for offset codes
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	885
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	886	When `Compression_Mode` is defined as `Predefined_Mode`,
				887	a predefined distribution is used for FSE compression.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	888
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	889	Below is its definition. It uses an accuracy of 5 bits (32 states),
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	890	and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	891
				892	If any sequence in the compressed block requires an offset larger than this,
				893	it's not possible to use the default distribution to represent it.
				894
				895	```
				896	short offsetCodes_defaultDistribution[53] =
				897	{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
				898	1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
				899	```
				900
				901	#### Distribution tables
				902
				903	Following the header, up to 3 distribution tables can be described.
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	904	When present, they are in this order :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	905	- Literals lengths
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	906	- Offsets
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	907	- Match Lengths
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	908
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	909	The content to decode depends on their respective encoding mode :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	910	- `Predefined_Mode` : no content. Use predefined distribution table.
				911	- `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
				912	- `FSE_Compressed_Mode` : A distribution table is present.
				913	- `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	914
				915	##### FSE distribution table : condensed format
				916
				917	An FSE distribution table describes the probabilities of all symbols
				918	from `0` to the last present one (included)
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	919	on a normalized scale of `1 << Accuracy_Log` .
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	920
				921	It's a bitstream which is read forward, in little-endian fashion.
				922	It's not necessary to know its exact size,
				923	since it will be discovered and reported by the decoding process.
				924
				925	The bitstream starts by reporting on which scale it operates.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	926	`Accuracy_Log = low4bits + 5`.
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	927	Note that maximum `Accuracy_Log` for literal and match lengths is `9`,
				928	and for offsets is `8`. Higher values are considered errors.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	929
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	930	Then follows each symbol value, from `0` to last present one.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	931	The number of bits used by each field is variable.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	932	It depends on :
				933
				934	- Remaining probabilities + 1 :
				935	__example__ :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	936	Presuming an `Accuracy_Log` of 8,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	937	and presuming 100 probabilities points have already been distributed,
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	938	the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	939	Therefore, it must read `log2sup(156) == 8` bits.
				940
				941	- Value decoded : small values use 1 less bit :
				942	__example__ :
				943	Presuming values from 0 to 156 (included) are possible,
				944	255-156 = 99 values are remaining in an 8-bits field.
				945	They are used this way :
				946	first 99 values (hence from 0 to 98) use only 7 bits,
				947	values from 99 to 156 use 8 bits.
				948	This is achieved through this scheme :
				949
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	950	\| Value read \| Value decoded \| Number of bits used \|
				951	\| ---------- \| ------------- \| ------------------- \|
				952	\| 0 - 98 \| 0 - 98 \| 7 \|
				953	\| 99 - 127 \| 99 - 127 \| 8 \|
				954	\| 128 - 226 \| 0 - 98 \| 7 \|
				955	\| 227 - 255 \| 128 - 156 \| 8 \|
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	956
				957	Symbols probabilities are read one by one, in order.
				958
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	959	Probability is obtained from Value decoded by following formula :
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	960	`Proba = value - 1`
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	961
				962	It means value `0` becomes negative probability `-1`.
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	963	`-1` is a special probability, which means "less than 1".
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	964	Its effect on distribution table is described in [next paragraph].
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	965	For the purpose of calculating cumulated distribution, it counts as one.
				966
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	967	[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
				968
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	969	When a symbol has a probability of `zero`,
				970	it is followed by a 2-bits repeat flag.
				971	This repeat flag tells how many probabilities of zeroes follow the current one.
				972	It provides a number ranging from 0 to 3.
				973	If it is a 3, another 2-bits repeat flag follows, and so on.
				974
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	975	When last symbol reaches cumulated total of `1 << Accuracy_Log`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	976	decoding is complete.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	977	If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	978	distribution is considered corrupted.
				979
Yann Collet	10b9c13	2016-07-24 01:21:53 +0200	[diff] [blame]	980	Then the decoder can tell how many bytes were used in this process,
				981	and how many symbols are present.
				982	The bitstream consumes a round number of bytes.
				983	Any remaining bit within the last byte is just unused.
				984
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	985	##### FSE decoding : from normalized distribution to decoding tables
				986
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	987	The distribution of normalized probabilities is enough
				988	to create a unique decoding table.
				989
				990	It follows the following build rule :
				991
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	992	The table has a size of `tableSize = 1 << Accuracy_Log`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	993	Each cell describes the symbol decoded,
				994	and instructions to get the next state.
				995
inikep	12731a9	2016-08-25 15:19:37 +0200	[diff] [blame]	996	Symbols are scanned in their natural order for "less than 1" probabilities.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	997	Symbols with this probability are being attributed a single cell,
				998	starting from the end of the table.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	999	These symbols define a full state reset, reading `Accuracy_Log` bits.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1000
				1001	All remaining symbols are sorted in their natural order.
				1002	Starting from symbol `0` and table position `0`,
				1003	each symbol gets attributed as many cells as its probability.
				1004	Cell allocation is spreaded, not linear :
				1005	each successor position follow this rule :
				1006
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1007	```
				1008	position += (tableSize>>1) + (tableSize>>3) + 3;
				1009	position &= tableSize-1;
				1010	```
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1011
				1012	A position is skipped if already occupied,
				1013	typically by a "less than 1" probability symbol.
				1014
				1015	The result is a list of state values.
				1016	Each state will decode the current symbol.
				1017
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1018	To get the `Number_of_Bits` and `Baseline` required for next state,
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1019	it's first necessary to sort all states in their natural order.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1020	The lower states will need 1 more bit than higher ones.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1021
				1022	__Example__ :
				1023	Presuming a symbol has a probability of 5.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1024	It receives 5 state values. States are sorted in natural order.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1025
				1026	Next power of 2 is 8.
				1027	Space of probabilities is divided into 8 equal parts.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1028	Presuming the `Accuracy_Log` is 7, it defines 128 states.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1029	Divided by 8, each share is 16 large.
				1030
				1031	In order to reach 8, 8-5=3 lowest states will count "double",
				1032	taking shares twice larger,
				1033	requiring one more bit in the process.
				1034
				1035	Numbering starts from higher states using less bits.
				1036
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1037	\| state order \| 0 \| 1 \| 2 \| 3 \| 4 \|
				1038	\| ---------------- \| ----- \| ----- \| ------ \| ---- \| ----- \|
				1039	\| width \| 32 \| 32 \| 32 \| 16 \| 16 \|
				1040	\| `Number_of_Bits` \| 5 \| 5 \| 5 \| 4 \| 4 \|
				1041	\| range number \| 2 \| 4 \| 6 \| 0 \| 1 \|
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1042	\| `Baseline` \| 32 \| 64 \| 96 \| 0 \| 16 \|
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1043	\| range \| 32-63 \| 64-95 \| 96-127 \| 0-15 \| 16-31 \|
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1044
				1045	Next state is determined from current state
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1046	by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
Yann Collet	23f05cc	2016-07-04 16:13:11 +0200	[diff] [blame]	1047
				1048
				1049	#### Bitstream
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1050
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1051	All sequences are stored in a single bitstream, read _backward_.
				1052	It is therefore necessary to know the bitstream size,
				1053	which is deducted from compressed block size.
				1054
Yann Collet	c40ba71	2016-07-08 15:39:02 +0200	[diff] [blame]	1055	The last useful bit of the stream is followed by an end-bit-flag.
Yann Collet	cd25a91	2016-07-05 11:50:37 +0200	[diff] [blame]	1056	Highest bit of last byte is this flag.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1057	It does not belong to the useful part of the bitstream.
				1058	Therefore, last byte has 0-7 useful bits.
				1059	Note that it also means that last byte cannot be `0`.
				1060
				1061	##### Starting states
				1062
				1063	The bitstream starts with initial state values,
				1064	each using the required number of bits in their respective _accuracy_,
				1065	decoded previously from their normalized distribution.
				1066
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1067	It starts by `Literals_Length_State`,
				1068	followed by `Offset_State`,
				1069	and finally `Match_Length_State`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1070
				1071	Reminder : always keep in mind that all values are read _backward_.
				1072
				1073	##### Decoding a sequence
				1074
				1075	A state gives a code.
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1076	A code provides `Baseline` and `Number_of_Bits` to add.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1077	See [Symbol Decoding] section for details on each symbol.
				1078
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1079	Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
				1080	It then does the same for `Match_Length`,
				1081	and then for `Literals_Length`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1082
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1083	`Offset`, `Match_Length`, and `Literals_Length` define a sequence.
				1084	It starts by inserting the number of literals defined by `Literals_Length`,
				1085	then continue by copying `Match_Length` bytes from `currentPos - Offset`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1086
				1087	The next operation is to update states.
				1088	Using rules pre-calculated in the decoding tables,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1089	`Literals_Length_State` is updated,
				1090	followed by `Match_Length_State`,
				1091	and then `Offset_State`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1092
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1093	This operation will be repeated `Number_of_Sequences` times.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1094	At the end, the bitstream shall be entirely consumed,
				1095	otherwise bitstream is considered corrupted.
				1096
inikep	de9d130	2016-08-25 14:59:08 +0200	[diff] [blame]	1097	[Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1098
				1099	##### Repeat offsets
				1100
inikep	12731a9	2016-08-25 15:19:37 +0200	[diff] [blame]	1101	As seen in [Offset Codes], the first 3 values define a repeated offset and we will call them `Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
				1102	They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1103
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1104	There is an exception though, when current sequence's literals length is `0`.
inikep	12731a9	2016-08-25 15:19:37 +0200	[diff] [blame]	1105	In which case, repeated offsets are "pushed by one",
				1106	so `Repeated_Offset1` becomes `Repeated_Offset2`, `Repeated_Offset2` becomes `Repeated_Offset3`,
				1107	and `Repeated_Offset3` becomes `Repeated_Offset1 - 1_byte`.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1108
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1109	On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1110
				1111	Then each block receives its start value from previous compressed block.
				1112	Note that non-compressed blocks are skipped,
				1113	they do not contribute to offset history.
				1114
				1115	[Offset Codes]: #offset-codes
				1116
				1117	###### Offset updates rules
				1118
Yann Collet	917fe18	2016-07-31 04:01:57 +0200	[diff] [blame]	1119	New offset take the lead in offset history,
				1120	up to its previous place if it was already present.
Yann Collet	9ca7336	2016-07-05 10:53:38 +0200	[diff] [blame]	1121
inikep	12731a9	2016-08-25 15:19:37 +0200	[diff] [blame]	1122	It means that when `Repeated_Offset1` (most recent) is used, history is unmodified.
				1123	When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`.
Yann Collet	00d44ab	2016-07-04 01:29:47 +0200	[diff] [blame]	1124
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1125
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1126	Dictionary format
				1127	-----------------
				1128
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1129	`zstd` is compatible with "raw content" dictionaries, free of any format restriction.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1130	But dictionaries created by `zstd --train` follow a format, described here.
				1131
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1132	__Pre-requisites__ : a dictionary has a size,
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1133	defined either by a buffer limit, or a file size.
				1134
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1135	\| `Magic_Number` \| `Dictionary_ID` \| `Entropy_Tables` \| `Content` \|
				1136	\| -------------- \| --------------- \| ---------------- \| --------- \|
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1137
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1138	__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1139
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1140	__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
				1141	`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
Yann Collet	722e14b	2016-07-08 19:22:16 +0200	[diff] [blame]	1142	It's used by decoders to check if they use the correct dictionary.
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1143
				1144	_Reserved ranges :_
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	1145	If the frame is going to be distributed in a private environment,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1146	any `Dictionary_ID` can be used.
Yann Collet	f6ff53c	2016-07-15 17:03:38 +0200	[diff] [blame]	1147	However, for public distribution of compressed frames,
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1148	the following ranges are reserved for future use and should not be used :
Yann Collet	6cacd34	2016-07-15 17:58:13 +0200	[diff] [blame]	1149
Yann Collet	70c2326	2016-08-21 00:24:18 +0200	[diff] [blame]	1150	- low range : 1 - 32767
				1151	- high range : >= (2^31)
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1152
inikep	e81f2cb	2016-08-13 09:36:24 +0200	[diff] [blame]	1153	__`Entropy_Tables`__ : following the same format as a [compressed blocks].
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1154	They are stored in following order :
				1155	Huffman tables for literals, FSE table for offsets,
				1156	FSE table for match lengths, and FSE table for literals lengths.
				1157	It's finally followed by 3 offset values, populating recent offsets,
				1158	stored in order, 4-bytes little-endian each, for a total of 12 bytes.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1159
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1160	__`Content`__ : The rest of the dictionary is its content.
				1161	The content act as a "past" in front of data to compress or decompress.
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1162
inikep	f9c3cce	2016-07-25 11:04:56 +0200	[diff] [blame]	1163	[compressed blocks]: #the-format-of-compressed_block
Yann Collet	bd10607	2016-07-08 19:16:57 +0200	[diff] [blame]	1164
Yann Collet	2fa9904	2016-07-01 20:55:28 +0200	[diff] [blame]	1165
				1166	Version changes
				1167	---------------
Yann Collet	855766d	2016-09-02 17:04:49 -0700	[diff] [blame]	1168	- 0.2.1 : clarify field names, by Przemyslaw Skibinski
Yann Collet	6fa05a2	2016-07-20 14:58:49 +0200	[diff] [blame]	1169	- 0.2.0 : numerous format adjustments for zstd v0.8
inikep	586a055	2016-08-03 16:16:38 +0200	[diff] [blame]	1170	- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collet	e557fd5	2016-07-17 16:21:37 +0200	[diff] [blame]	1171	- 0.1.1 : reserved dictID ranges
				1172	- 0.1.0 : initial release