blob: d7f21939ae50824bb823b1f4a11c5b9974d378f2 [file] [log] [blame] [view]
Yann Collet5cc18822016-07-03 19:03:13 +02001Zstandard Compression Format
2============================
Yann Collet2fa99042016-07-01 20:55:28 +02003
4### Notices
5
6Copyright (c) 2016 Yann Collet
7
8Permission is granted to copy and distribute this document
inikepf9c3cce2016-07-25 11:04:56 +02009for any purpose and without charge,
10including translations into other languages
Yann Collet2fa99042016-07-01 20:55:28 +020011and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
Yann Collet6fa05a22016-07-20 14:58:49 +0200190.2.0 (22/07/16)
Yann Collet2fa99042016-07-01 20:55:28 +020020
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
Yann Collet9ca73362016-07-05 10:53:38 +020028file compression, pipe and streaming compression,
Yann Collet2fa99042016-07-01 20:55:28 +020029using the [Zstandard algorithm](http://www.zstandard.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the Zstandard compression method,
36and optional [xxHash-64 checksum method](http://www.xxhash.org),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into Zstandard format and/or decompress data from Zstandard format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore informative fields, such as checksum.
56Whenever it does not support a parameter defined in the compressed stream,
57it must produce a non-ambiguous error code and associated error message
58explaining which parameter is unsupported.
59
60
inikepf9c3cce2016-07-25 11:04:56 +020061Overall conventions
62-----------
inikepe81f2cb2016-08-13 09:36:24 +020063In this document:
64- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
65- a naming convention for identifiers is `Mixed_Case_With_Underscores`
inikepf9c3cce2016-07-25 11:04:56 +020066
Yann Collet2fa99042016-07-01 20:55:28 +020067Definitions
68-----------
69A content compressed by Zstandard is transformed into a Zstandard __frame__.
70Multiple frames can be appended into a single file or stream.
71A frame is totally independent, has a defined beginning and end,
72and a set of parameters which tells the decoder how to decompress it.
73
74A frame encapsulates one or multiple __blocks__.
75Each block can be compressed or not,
76and has a guaranteed maximum content size, which depends on frame parameters.
77Unlike frames, each block depends on previous blocks for proper decoding.
78However, each block can be decompressed without waiting for its successor,
79allowing streaming operations.
80
81
inikepf9c3cce2016-07-25 11:04:56 +020082Frame Concatenation
83-------------------
Yann Collet2fa99042016-07-01 20:55:28 +020084
85In some circumstances, it may be required to append multiple frames,
86for example in order to add new data to an existing compressed file
87without re-framing it.
88
89In such case, each frame brings its own set of descriptor flags.
90Each frame is considered independent.
91The only relation between frames is their sequential order.
92
93The ability to decode multiple concatenated frames
94within a single stream or file is left outside of this specification.
95As an example, the reference `zstd` command line utility is able
96to decode all concatenated frames in their sequential order,
97delivering the final decompressed result as if it was a single content.
98
99
inikep01323752016-08-25 12:20:38 +0200100Skippable Frames
101----------------
102
103| `Magic_Number` | `Frame_Size` | `User_Data` |
104|:--------------:|:------------:|:-----------:|
105| 4 bytes | 4 bytes | n bytes |
106
107Skippable frames allow the insertion of user-defined data
108into a flow of concatenated frames.
109Its design is pretty straightforward,
110with the sole objective to allow the decoder to quickly skip
111over user-defined data and continue decoding.
112
113Skippable frames defined in this specification are compatible with [LZ4] ones.
114
115[LZ4]:http://www.lz4.org
116
117__`Magic_Number`__
118
1194 Bytes, little-endian format.
120Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
121All 16 values are valid to identify a skippable frame.
122
123__`Frame_Size`__
124
125This is the size, in bytes, of the following `User_Data`
126(without including the magic number nor the size field itself).
127This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
128This means `User_Data` can’t be bigger than (2^32-1) bytes.
129
130__`User_Data`__
131
132The `User_Data` can be anything. Data will just be skipped by the decoder.
133
134
135
inikepf9c3cce2016-07-25 11:04:56 +0200136General Structure of Zstandard Frame format
137-------------------------------------------
138The structure of a single Zstandard frame is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200139
Yann Colletc991cc12016-07-28 00:55:43 +0200140| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] |
141|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:|
142| 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200143
inikepf9c3cce2016-07-25 11:04:56 +0200144__`Magic_Number`__
145
inikepe81f2cb2016-08-13 09:36:24 +02001464 Bytes, little-endian format.
Yann Collet7bdfcea2016-09-05 17:43:31 +0200147Value : 0xFD2FB528
inikepf9c3cce2016-07-25 11:04:56 +0200148
149__`Frame_Header`__
150
1512 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
152
153__`Data_Block`__
154
155Detailed in [next chapter](#the-structure-of-data_block).
156That’s where compressed data is stored.
157
Yann Colletc991cc12016-07-28 00:55:43 +0200158__`Content_Checksum`__
inikepf9c3cce2016-07-25 11:04:56 +0200159
Yann Colletc991cc12016-07-28 00:55:43 +0200160An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikepf9c3cce2016-07-25 11:04:56 +0200161The content checksum is the result
inikepe81f2cb2016-08-13 09:36:24 +0200162of [xxh64() hash function](http://www.xxhash.org)
inikepf9c3cce2016-07-25 11:04:56 +0200163digesting the original (decoded) data as input, and a seed of zero.
Yann Colletc991cc12016-07-28 00:55:43 +0200164The low 4 bytes of the checksum are stored in little endian format.
inikepf9c3cce2016-07-25 11:04:56 +0200165
166
167The structure of `Frame_Header`
168-------------------------------
169The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet2fa99042016-07-01 20:55:28 +0200170and up to 14 bytes depending on optional parameters.
inikepf9c3cce2016-07-25 11:04:56 +0200171The structure of `Frame_Header` is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200172
inikepf9c3cce2016-07-25 11:04:56 +0200173| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
174| ------------------------- | --------------------- | ----------------- | ---------------------- |
175| 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200176
inikepf9c3cce2016-07-25 11:04:56 +0200177### `Frame_Header_Descriptor`
178
179The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet2fa99042016-07-01 20:55:28 +0200180It tells which other fields are present.
inikepf9c3cce2016-07-25 11:04:56 +0200181Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet2fa99042016-07-01 20:55:28 +0200182
inikepf9c3cce2016-07-25 11:04:56 +0200183| Bit number | Field name |
184| ---------- | ---------- |
185| 7-6 | `Frame_Content_Size_flag` |
186| 5 | `Single_Segment_flag` |
187| 4 | `Unused_bit` |
188| 3 | `Reserved_bit` |
189| 2 | `Content_Checksum_flag` |
190| 1-0 | `Dictionary_ID_flag` |
Yann Collet2fa99042016-07-01 20:55:28 +0200191
192In this table, bit 7 is highest bit, while bit 0 is lowest.
193
inikep49ec6d12016-07-25 12:26:39 +0200194__`Frame_Content_Size_flag`__
195
196This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
197specifying if decompressed data size is provided within the header.
Yann Colletc991cc12016-07-28 00:55:43 +0200198The `Flag_Value` can be converted into `Field_Size`,
199which is the number of bytes used by `Frame_Content_Size`
200according to the following table:
inikep49ec6d12016-07-25 12:26:39 +0200201
inikep01323752016-08-25 12:20:38 +0200202|`Flag_Value`| 0 | 1 | 2 | 3 |
203| ---------- | ------ | --- | --- | --- |
204|`Field_Size`| 0 or 1 | 2 | 4 | 8 |
inikep49ec6d12016-07-25 12:26:39 +0200205
Yann Colletc991cc12016-07-28 00:55:43 +0200206When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
207if `Single_Segment_flag` is set, `Field_Size` is 1.
208Otherwise, `Field_Size` is 0 (content size not provided).
inikep49ec6d12016-07-25 12:26:39 +0200209
inikepf9c3cce2016-07-25 11:04:56 +0200210__`Single_Segment_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200211
inikep49ec6d12016-07-25 12:26:39 +0200212If this flag is set,
Yann Colletc991cc12016-07-28 00:55:43 +0200213data must be regenerated within a single continuous memory segment.
Yann Collet2fa99042016-07-01 20:55:28 +0200214
Yann Colletc991cc12016-07-28 00:55:43 +0200215In this case, `Frame_Content_Size` is necessarily present,
216but `Window_Descriptor` byte is skipped.
inikep49ec6d12016-07-25 12:26:39 +0200217As a consequence, the decoder must allocate a memory segment
218of size equal or bigger than `Frame_Content_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200219
220In order to preserve the decoder from unreasonable memory requirement,
Yann Collet23f05cc2016-07-04 16:13:11 +0200221a decoder can reject a compressed frame
Yann Collet2fa99042016-07-01 20:55:28 +0200222which requests a memory size beyond decoder's authorized range.
223
224For broader compatibility, decoders are recommended to support
Yann Collet23f05cc2016-07-04 16:13:11 +0200225memory sizes of at least 8 MB.
226This is just a recommendation,
Yann Collet9ca73362016-07-05 10:53:38 +0200227each decoder is free to support higher or lower limits,
Yann Collet2fa99042016-07-01 20:55:28 +0200228depending on local limitations.
229
inikepf9c3cce2016-07-25 11:04:56 +0200230__`Unused_bit`__
Yann Collet2fa99042016-07-01 20:55:28 +0200231
Yann Colletf0bc6732016-07-13 17:30:21 +0200232The value of this bit should be set to zero.
Yann Colletc991cc12016-07-28 00:55:43 +0200233A decoder compliant with this specification version shall not interpret it.
Yann Colletf0bc6732016-07-13 17:30:21 +0200234It might be used in a future version,
235to signal a property which is not mandatory to properly decode the frame.
Yann Collet2fa99042016-07-01 20:55:28 +0200236
inikepf9c3cce2016-07-25 11:04:56 +0200237__`Reserved_bit`__
Yann Collet2fa99042016-07-01 20:55:28 +0200238
239This bit is reserved for some future feature.
240Its value _must be zero_.
241A decoder compliant with this specification version must ensure it is not set.
242This bit may be used in a future revision,
Yann Colletc991cc12016-07-28 00:55:43 +0200243to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet2fa99042016-07-01 20:55:28 +0200244
inikepf9c3cce2016-07-25 11:04:56 +0200245__`Content_Checksum_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200246
Yann Colletc991cc12016-07-28 00:55:43 +0200247If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
248See `Content_Checksum` paragraph.
Yann Collet2fa99042016-07-01 20:55:28 +0200249
inikepf9c3cce2016-07-25 11:04:56 +0200250__`Dictionary_ID_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200251
252This is a 2-bits flag (`= FHD & 3`),
Yann Collet9ca73362016-07-05 10:53:38 +0200253telling if a dictionary ID is provided within the header.
inikepe81f2cb2016-08-13 09:36:24 +0200254It also specifies the size of this field as `Field_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200255
inikepe81f2cb2016-08-13 09:36:24 +0200256|`Flag_Value`| 0 | 1 | 2 | 3 |
257| ---------- | --- | --- | --- | --- |
258|`Field_Size`| 0 | 1 | 2 | 4 |
Yann Collet2fa99042016-07-01 20:55:28 +0200259
inikepf9c3cce2016-07-25 11:04:56 +0200260### `Window_Descriptor`
Yann Collet2fa99042016-07-01 20:55:28 +0200261
262Provides guarantees on maximum back-reference distance
Yann Colletc991cc12016-07-28 00:55:43 +0200263that will be used within compressed data.
264This information is important for decoders to allocate enough memory.
Yann Collet2fa99042016-07-01 20:55:28 +0200265
Yann Colletc991cc12016-07-28 00:55:43 +0200266The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
inikepf9c3cce2016-07-25 11:04:56 +0200267In this case, the maximum back-reference distance is the content size itself,
Yann Colletcd25a912016-07-05 11:50:37 +0200268which can be any value from 1 to 2^64-1 bytes (16 EB).
269
inikepe81f2cb2016-08-13 09:36:24 +0200270| Bit numbers | 7-3 | 0-2 |
271| ----------- | ---------- | ---------- |
272| Field name | `Exponent` | `Mantissa` |
Yann Collet2fa99042016-07-01 20:55:28 +0200273
inikepde9d1302016-08-25 14:59:08 +0200274Maximum distance is given by the following formulas :
Yann Collet2fa99042016-07-01 20:55:28 +0200275```
276windowLog = 10 + Exponent;
277windowBase = 1 << windowLog;
278windowAdd = (windowBase / 8) * Mantissa;
inikepe81f2cb2016-08-13 09:36:24 +0200279Window_Size = windowBase + windowAdd;
Yann Collet2fa99042016-07-01 20:55:28 +0200280```
281The minimum window size is 1 KB.
Yann Colletcd25a912016-07-05 11:50:37 +0200282The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet2fa99042016-07-01 20:55:28 +0200283
284To properly decode compressed data,
inikepe81f2cb2016-08-13 09:36:24 +0200285a decoder will need to allocate a buffer of at least `Window_Size` bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200286
Yann Collet2fa99042016-07-01 20:55:28 +0200287In order to preserve decoder from unreasonable memory requirements,
288a decoder can refuse a compressed frame
289which requests a memory size beyond decoder's authorized range.
290
Yann Colletcd25a912016-07-05 11:50:37 +0200291For improved interoperability,
Yann Colletc991cc12016-07-28 00:55:43 +0200292decoders are recommended to be compatible with window sizes of 8 MB,
293and encoders are recommended to not request more than 8 MB.
Yann Collet2fa99042016-07-01 20:55:28 +0200294It's merely a recommendation though,
295decoders are free to support larger or lower limits,
296depending on local limitations.
297
inikepf9c3cce2016-07-25 11:04:56 +0200298### `Dictionary_ID`
Yann Collet23f05cc2016-07-04 16:13:11 +0200299
Yann Colletf6ff53c2016-07-15 17:03:38 +0200300This is a variable size field, which contains
301the ID of the dictionary required to properly decode the frame.
302Note that this field is optional. When it's not present,
Yann Collet23f05cc2016-07-04 16:13:11 +0200303it's up to the caller to make sure it uses the correct dictionary.
304
inikepf9c3cce2016-07-25 11:04:56 +0200305Field size depends on `Dictionary_ID_flag`.
Yann Collet23f05cc2016-07-04 16:13:11 +02003061 byte can represent an ID 0-255.
3072 bytes can represent an ID 0-65535.
Yann Colletcd25a912016-07-05 11:50:37 +02003084 bytes can represent an ID 0-4294967295.
Yann Collet23f05cc2016-07-04 16:13:11 +0200309
310It's allowed to represent a small ID (for example `13`)
Yann Colletcd25a912016-07-05 11:50:37 +0200311with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet23f05cc2016-07-04 16:13:11 +0200312
Yann Colletf6ff53c2016-07-15 17:03:38 +0200313_Reserved ranges :_
314If the frame is going to be distributed in a private environment,
315any dictionary ID can be used.
316However, for public distribution of compressed frames using a dictionary,
inikepf9c3cce2016-07-25 11:04:56 +0200317the following ranges are reserved for future use and should not be used :
318- low range : 1 - 32767
319- high range : >= (2^31)
Yann Colletf6ff53c2016-07-15 17:03:38 +0200320
321
inikepf9c3cce2016-07-25 11:04:56 +0200322### `Frame_Content_Size`
Yann Collet2fa99042016-07-01 20:55:28 +0200323
inikep2fc37522016-07-25 12:47:02 +0200324This is the original (uncompressed) size. This information is optional.
325The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
326The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
inikepe81f2cb2016-08-13 09:36:24 +0200327Format is little-endian.
Yann Collet2fa99042016-07-01 20:55:28 +0200328
inikep2fc37522016-07-25 12:47:02 +0200329| `Field_Size` | Range |
330| ------------ | ---------- |
331| 1 | 0 - 255 |
332| 2 | 256 - 65791|
333| 4 | 0 - 2^32-1 |
334| 8 | 0 - 2^64-1 |
Yann Collet2fa99042016-07-01 20:55:28 +0200335
inikep2fc37522016-07-25 12:47:02 +0200336When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
337When `Field_Size` is 2, _the offset of 256 is added_.
inikepf9c3cce2016-07-25 11:04:56 +0200338It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet2fa99042016-07-01 20:55:28 +0200339
Yann Collet2fa99042016-07-01 20:55:28 +0200340
inikepf9c3cce2016-07-25 11:04:56 +0200341The structure of `Data_Block`
342-----------------------------
343The structure of `Data_Block` is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200344
Yann Colletc991cc12016-07-28 00:55:43 +0200345| `Last_Block` | `Block_Type` | `Block_Size` | `Block_Content` |
346|:------------:|:------------:|:------------:|:---------------:|
347| 1 bit | 2 bits | 21 bits | n bytes |
348
inikepe81f2cb2016-08-13 09:36:24 +0200349The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
Yann Colletc991cc12016-07-28 00:55:43 +0200350
351__`Last_Block`__
352
353The lowest bit signals if this block is the last one.
354Frame ends right after this block.
355It may be followed by an optional `Content_Checksum` .
Yann Collet2fa99042016-07-01 20:55:28 +0200356
inikepf9c3cce2016-07-25 11:04:56 +0200357__`Block_Type` and `Block_Size`__
Yann Collet2fa99042016-07-01 20:55:28 +0200358
Yann Colletc991cc12016-07-28 00:55:43 +0200359The next 2 bits represent the `Block_Type`,
360while the remaining 21 bits represent the `Block_Size`.
361Format is __little-endian__.
Yann Collet2fa99042016-07-01 20:55:28 +0200362
363There are 4 block types :
364
inikepf9c3cce2016-07-25 11:04:56 +0200365| Value | 0 | 1 | 2 | 3 |
366| ------------ | ----------- | ----------- | ------------------ | --------- |
Yann Colletc991cc12016-07-28 00:55:43 +0200367| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
Yann Collet2fa99042016-07-01 20:55:28 +0200368
inikepf9c3cce2016-07-25 11:04:56 +0200369- `Raw_Block` - this is an uncompressed block.
370 `Block_Size` is the number of bytes to read and copy.
371- `RLE_Block` - this is a single byte, repeated N times.
372 In which case, `Block_Size` is the size to regenerate,
373 while the "compressed" block is just 1 byte (the byte to repeat).
374- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet9ca73362016-07-05 10:53:38 +0200375 detailed in another section of this specification.
inikepf9c3cce2016-07-25 11:04:56 +0200376 `Block_Size` is the compressed size.
Yann Collet2fa99042016-07-01 20:55:28 +0200377 Decompressed size is unknown,
378 but its maximum possible value is guaranteed (see below)
Yann Colletc991cc12016-07-28 00:55:43 +0200379- `Reserved` - this is not a block.
380 This value cannot be used with current version of this specification.
Yann Collet2fa99042016-07-01 20:55:28 +0200381
382Block sizes must respect a few rules :
Yann Collet9ca73362016-07-05 10:53:38 +0200383- In compressed mode, compressed size if always strictly `< decompressed size`.
384- Block decompressed size is always <= maximum back-reference distance .
385- Block decompressed size is always <= 128 KB
Yann Collet2fa99042016-07-01 20:55:28 +0200386
387
inikepf9c3cce2016-07-25 11:04:56 +0200388__`Block_Content`__
Yann Collet2fa99042016-07-01 20:55:28 +0200389
inikepf9c3cce2016-07-25 11:04:56 +0200390The `Block_Content` is where the actual data to decode stands.
Yann Collet2fa99042016-07-01 20:55:28 +0200391It might be compressed or not, depending on previous field indications.
392A data block is not necessarily "full" :
393since an arbitrary “flush” may happen anytime,
Yann Colletcd25a912016-07-05 11:50:37 +0200394block decompressed content can be any size,
inikepf9c3cce2016-07-25 11:04:56 +0200395up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Colletcd25a912016-07-05 11:50:37 +0200396- Maximum back-reference distance
Yann Collet2fa99042016-07-01 20:55:28 +0200397- 128 KB
398
399
Yann Collet2fa99042016-07-01 20:55:28 +0200400
inikepf9c3cce2016-07-25 11:04:56 +0200401The format of `Compressed_Block`
402--------------------------------
403The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
404The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet2fa99042016-07-01 20:55:28 +0200405in order to properly allocate destination buffer.
inikepf9c3cce2016-07-25 11:04:56 +0200406See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet2fa99042016-07-01 20:55:28 +0200407
408A compressed block consists of 2 sections :
inikep00794252016-08-04 14:43:21 +0200409- [`Literals_Section`](#literals_section)
410- [`Sequences_Section`](#sequences_section)
Yann Collet2fa99042016-07-01 20:55:28 +0200411
Yann Collet23f05cc2016-07-04 16:13:11 +0200412### Prerequisites
413To decode a compressed block, the following elements are necessary :
inikepe81f2cb2016-08-13 09:36:24 +0200414- Previous decoded blocks, up to a distance of `Window_Size`,
inikepf9c3cce2016-07-25 11:04:56 +0200415 or all previous blocks when `Single_Segment_flag` is set.
Yann Collet698cb632016-07-03 18:49:35 +0200416- List of "recent offsets" from previous compressed block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200417- Decoding tables of previous compressed block for each symbol type
inikepde9d1302016-08-25 14:59:08 +0200418 (literals, literals lengths, match lengths, offsets).
Yann Collet698cb632016-07-03 18:49:35 +0200419
420
inikepf896c1d2016-08-03 16:37:42 +0200421### `Literals_Section`
Yann Collet2fa99042016-07-01 20:55:28 +0200422
Yann Collet2fa99042016-07-01 20:55:28 +0200423During sequence phase, literals will be entangled with match copy operations.
424All literals are regrouped in the first part of the block.
425They can be decoded first, and then copied during sequence operations,
Yann Collet00d44ab2016-07-04 01:29:47 +0200426or they can be decoded on the flow, as needed by sequence commands.
Yann Collet2fa99042016-07-01 20:55:28 +0200427
inikep4f270ac2016-08-04 11:25:52 +0200428| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
429| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
Yann Collet2fa99042016-07-01 20:55:28 +0200430
inikepf9c3cce2016-07-25 11:04:56 +0200431Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet2fa99042016-07-01 20:55:28 +0200432When compressed, an optional tree description can be present,
433followed by 1 or 4 streams.
434
inikepf9c3cce2016-07-25 11:04:56 +0200435
inikepf896c1d2016-08-03 16:37:42 +0200436#### `Literals_Section_Header`
Yann Collet2fa99042016-07-01 20:55:28 +0200437
Yann Collet00d44ab2016-07-04 01:29:47 +0200438Header is in charge of describing how literals are packed.
Yann Collet2fa99042016-07-01 20:55:28 +0200439It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet198e6aa2016-07-20 20:12:24 +0200440using little-endian convention.
Yann Collet2fa99042016-07-01 20:55:28 +0200441
inikepf896c1d2016-08-03 16:37:42 +0200442| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
443| --------------------- | ------------- | ------------------ | ----------------- |
444| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
Yann Collet198e6aa2016-07-20 20:12:24 +0200445
446In this representation, bits on the left are smallest bits.
Yann Collet2fa99042016-07-01 20:55:28 +0200447
Yann Collet70c23262016-08-21 00:24:18 +0200448__`Literals_Block_Type`__
Yann Collet2fa99042016-07-01 20:55:28 +0200449
Yann Collet198e6aa2016-07-20 20:12:24 +0200450This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet2fa99042016-07-01 20:55:28 +0200451
inikep01323752016-08-25 12:20:38 +0200452| `Literals_Block_Type` | Value |
453| ----------------------------- | ----- |
454| `Raw_Literals_Block` | 0 |
455| `RLE_Literals_Block` | 1 |
456| `Compressed_Literals_Block` | 2 |
457| `Repeat_Stats_Literals_Block` | 3 |
Yann Collet2fa99042016-07-01 20:55:28 +0200458
inikepf896c1d2016-08-03 16:37:42 +0200459- `Raw_Literals_Block` - Literals are stored uncompressed.
460- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
461- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep586a0552016-08-03 16:16:38 +0200462 starting with a Huffman tree description.
Yann Colletcd25a912016-07-05 11:50:37 +0200463 See details below.
inikepf896c1d2016-08-03 16:37:42 +0200464- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
inikep586a0552016-08-03 16:16:38 +0200465 using Huffman tree _from previous Huffman-compressed literals block_.
Yann Colletcd25a912016-07-05 11:50:37 +0200466 Huffman tree description will be skipped.
Yann Collet2fa99042016-07-01 20:55:28 +0200467
Yann Collet70c23262016-08-21 00:24:18 +0200468__`Size_Format`__
Yann Collet2fa99042016-07-01 20:55:28 +0200469
inikepf896c1d2016-08-03 16:37:42 +0200470`Size_Format` is divided into 2 families :
Yann Collet2fa99042016-07-01 20:55:28 +0200471
inikepf896c1d2016-08-03 16:37:42 +0200472- For `Compressed_Block`, it requires to decode both `Compressed_Size`
473 and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
inikep01323752016-08-25 12:20:38 +0200474- For `Raw_Literals_Block` and `RLE_Literals_Block` it's enough to decode `Regenerated_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200475
inikepe81f2cb2016-08-13 09:36:24 +0200476For values spanning several bytes, convention is little-endian.
Yann Collet2fa99042016-07-01 20:55:28 +0200477
inikep9d003c12016-08-04 10:41:49 +0200478__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200479
inikep01323752016-08-25 12:20:38 +0200480- Value x0 : `Regenerated_Size` uses 5 bits (0-31).
inikepe81f2cb2016-08-13 09:36:24 +0200481 `Literals_Section_Header` has 1 byte.
482 `Regenerated_Size = Header[0]>>3`
inikep01323752016-08-25 12:20:38 +0200483- Value 01 : `Regenerated_Size` uses 12 bits (0-4095).
inikepe81f2cb2016-08-13 09:36:24 +0200484 `Literals_Section_Header` has 2 bytes.
485 `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
inikep01323752016-08-25 12:20:38 +0200486- Value 11 : `Regenerated_Size` uses 20 bits (0-1048575).
inikepe81f2cb2016-08-13 09:36:24 +0200487 `Literals_Section_Header` has 3 bytes.
488 `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`
Yann Collet2fa99042016-07-01 20:55:28 +0200489
inikep01323752016-08-25 12:20:38 +0200490Note : it's allowed to represent a short value (for example `13`)
491using a long format, accepting the increased compressed data size.
Yann Collet2fa99042016-07-01 20:55:28 +0200492
inikep9d003c12016-08-04 10:41:49 +0200493__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200494
inikep01323752016-08-25 12:20:38 +0200495- Value 00 : _A single stream_.
inikepe81f2cb2016-08-13 09:36:24 +0200496 Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
497 `Literals_Section_Header` has 3 bytes.
inikep01323752016-08-25 12:20:38 +0200498- Value 01 : 4 streams.
inikepe81f2cb2016-08-13 09:36:24 +0200499 Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
500 `Literals_Section_Header` has 3 bytes.
inikep01323752016-08-25 12:20:38 +0200501- Value 10 : 4 streams.
inikepe81f2cb2016-08-13 09:36:24 +0200502 Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
503 `Literals_Section_Header` has 4 bytes.
inikep01323752016-08-25 12:20:38 +0200504- Value 11 : 4 streams.
inikepe81f2cb2016-08-13 09:36:24 +0200505 Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
506 `Literals_Section_Header` has 5 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200507
inikepe81f2cb2016-08-13 09:36:24 +0200508Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
inikepf896c1d2016-08-03 16:37:42 +0200509
Yann Collet698cb632016-07-03 18:49:35 +0200510
inikep4f270ac2016-08-04 11:25:52 +0200511#### `Huffman_Tree_Description`
Yann Collet698cb632016-07-03 18:49:35 +0200512
inikepde9d1302016-08-25 14:59:08 +0200513This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Yann Collet00d44ab2016-07-04 01:29:47 +0200514
Yann Collet9ca73362016-07-05 10:53:38 +0200515Prefix coding represents symbols from an a priori known alphabet
Yann Colletb21e9cb2016-07-15 17:31:13 +0200516by bit sequences (codewords), one codeword for each symbol,
Yann Collet9ca73362016-07-05 10:53:38 +0200517in a manner such that different symbols may be represented
518by bit sequences of different lengths,
519but a parser can always parse an encoded string
520unambiguously symbol-by-symbol.
Yann Collet00d44ab2016-07-04 01:29:47 +0200521
Yann Collet9ca73362016-07-05 10:53:38 +0200522Given an alphabet with known symbol frequencies,
523the Huffman algorithm allows the construction of an optimal prefix code
524using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet00d44ab2016-07-04 01:29:47 +0200525
Yann Collet9ca73362016-07-05 10:53:38 +0200526Prefix code must not exceed a maximum code length.
Yann Collet00d44ab2016-07-04 01:29:47 +0200527More bits improve accuracy but cost more header size,
Yann Collete557fd52016-07-17 16:21:37 +0200528and require more memory or more complex decoding operations.
529This specification limits maximum code length to 11 bits.
Yann Collet00d44ab2016-07-04 01:29:47 +0200530
Yann Collet698cb632016-07-03 18:49:35 +0200531
532##### Representation
533
Yann Collet00d44ab2016-07-04 01:29:47 +0200534All literal values from zero (included) to last present one (excluded)
inikepde9d1302016-08-25 14:59:08 +0200535are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
536Transformation from `Weight` to `Number_of_Bits` follows this formula :
537```
538Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
539```
inikepe81f2cb2016-08-13 09:36:24 +0200540The last symbol's `Weight` is deduced from previously decoded ones,
Yann Collet698cb632016-07-03 18:49:35 +0200541by completing to the nearest power of 2.
inikepe81f2cb2016-08-13 09:36:24 +0200542This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
Yann Collet698cb632016-07-03 18:49:35 +0200543
544__Example__ :
inikep586a0552016-08-03 16:16:38 +0200545Let's presume the following Huffman tree must be described :
Yann Collet698cb632016-07-03 18:49:35 +0200546
inikepe81f2cb2016-08-13 09:36:24 +0200547| literal | 0 | 1 | 2 | 3 | 4 | 5 |
548| ---------------- | --- | --- | --- | --- | --- | --- |
549| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200550
551The tree depth is 4, since its smallest element uses 4 bits.
552Value `5` will not be listed, nor will values above `5`.
inikepe81f2cb2016-08-13 09:36:24 +0200553Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Yann Collet855766d2016-09-02 17:04:49 -0700554Weight formula is :
inikepde9d1302016-08-25 14:59:08 +0200555```
556Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
557```
Yann Collet698cb632016-07-03 18:49:35 +0200558It gives the following serie of weights :
559
inikepe81f2cb2016-08-13 09:36:24 +0200560| `Weight` | 4 | 3 | 2 | 0 | 1 |
561| -------- | --- | --- | --- | --- | --- |
562| literal | 0 | 1 | 2 | 3 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200563
564The decoder will do the inverse operation :
Yann Colletd916c902016-07-04 00:42:58 +0200565having collected weights of literals from `0` to `4`,
566it knows the last literal, `5`, is present with a non-zero weight.
Yann Colletcd25a912016-07-05 11:50:37 +0200567The weight of `5` can be deducted by joining to the nearest power of 2.
inikepe81f2cb2016-08-13 09:36:24 +0200568Sum of `2^(Weight-1)` (excluding 0) is :
569`8 + 4 + 2 + 0 + 1 = 15`.
Yann Collet698cb632016-07-03 18:49:35 +0200570Nearest power of 2 is 16.
inikepe81f2cb2016-08-13 09:36:24 +0200571Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
Yann Collet698cb632016-07-03 18:49:35 +0200572
573##### Huffman Tree header
574
Yann Collet9ca73362016-07-05 10:53:38 +0200575This is a single byte value (0-255),
576which tells how to decode the list of weights.
Yann Collet698cb632016-07-03 18:49:35 +0200577
inikepe81f2cb2016-08-13 09:36:24 +0200578- if `headerByte` >= 128 : this is a direct representation,
579 where each `Weight` is written directly as a 4 bits field (0-15).
580 The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
581 meaning it uses a last full byte even if `Number_of_Symbols` is odd.
582 `Number_of_Symbols = headerByte - 127`.
583 Note that maximum `Number_of_Symbols` is 255-127 = 128.
Yann Collet26f68142016-07-08 10:42:59 +0200584 A larger serie must necessarily use FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200585
inikepe81f2cb2016-08-13 09:36:24 +0200586- if `headerByte` < 128 :
Yann Collet698cb632016-07-03 18:49:35 +0200587 the serie of weights is compressed by FSE.
inikepde9d1302016-08-25 14:59:08 +0200588 The length of the FSE-compressed serie is equal to `headerByte` (0-127).
Yann Collet698cb632016-07-03 18:49:35 +0200589
inikepde9d1302016-08-25 14:59:08 +0200590##### Finite State Entropy (FSE) compression of Huffman weights
Yann Collet698cb632016-07-03 18:49:35 +0200591
Yann Collet26f68142016-07-08 10:42:59 +0200592The serie of weights is compressed using FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200593It's a single bitstream with 2 interleaved states,
Yann Collet26f68142016-07-08 10:42:59 +0200594sharing a single distribution table.
Yann Collet698cb632016-07-03 18:49:35 +0200595
596To decode an FSE bitstream, it is necessary to know its compressed size.
597Compressed size is provided by `headerByte`.
Yann Collet38b75dd2016-07-24 15:35:59 +0200598It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet26f68142016-07-08 10:42:59 +0200599which is `255`, since literal values span from `0` to `255`,
Yann Colletcd25a912016-07-05 11:50:37 +0200600and last symbol value is not represented.
Yann Collet698cb632016-07-03 18:49:35 +0200601
602An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet00d44ab2016-07-04 01:29:47 +0200603It will create a Decoding Table.
Yann Collet26f68142016-07-08 10:42:59 +0200604Table must be pre-allocated, which requires to support a maximum accuracy.
inikep586a0552016-08-03 16:16:38 +0200605For a list of Huffman weights, maximum accuracy is 7 bits.
Yann Collet698cb632016-07-03 18:49:35 +0200606
Yann Collet26f68142016-07-08 10:42:59 +0200607FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
608and so is [FSE bitstream](#bitstream).
609The main difference is that Huffman header compression uses 2 states,
610which share the same FSE distribution table.
Yann Collet38b75dd2016-07-24 15:35:59 +0200611Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet26f68142016-07-08 10:42:59 +0200612The number of symbols to decode is discovered
613by tracking bitStream overflow condition.
614When both states have overflowed the bitstream, end is reached.
615
Yann Collet698cb632016-07-03 18:49:35 +0200616
inikep586a0552016-08-03 16:16:38 +0200617##### Conversion from weights to Huffman prefix codes
Yann Collet698cb632016-07-03 18:49:35 +0200618
inikepe81f2cb2016-08-13 09:36:24 +0200619All present symbols shall now have a `Weight` value.
inikepde9d1302016-08-25 14:59:08 +0200620It is possible to transform weights into Number_of_Bits, using this formula:
621```
622Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0
623```
inikepe81f2cb2016-08-13 09:36:24 +0200624Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
625Symbols with a `Weight` of zero are removed.
Yann Collet38b75dd2016-07-24 15:35:59 +0200626Then, starting from lowest weight, prefix codes are distributed in order.
Yann Colletd916c902016-07-04 00:42:58 +0200627
628__Example__ :
Yann Colletb21e9cb2016-07-15 17:31:13 +0200629Let's presume the following list of weights has been decoded :
Yann Colletd916c902016-07-04 00:42:58 +0200630
inikepe81f2cb2016-08-13 09:36:24 +0200631| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
632| -------- | --- | --- | --- | --- | --- | --- |
633| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Yann Colletd916c902016-07-04 00:42:58 +0200634
635Sorted by weight and then natural order,
636it gives the following distribution :
637
inikepe81f2cb2016-08-13 09:36:24 +0200638| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
639| ---------------- | --- | --- | --- | --- | --- | ---- |
640| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
641| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
642| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
Yann Colletd916c902016-07-04 00:42:58 +0200643
644
inikepde9d1302016-08-25 14:59:08 +0200645#### The content of Huffman-compressed literal stream
Yann Colletd916c902016-07-04 00:42:58 +0200646
647##### Bitstreams sizes
648
649As seen in a previous paragraph,
inikepde9d1302016-08-25 14:59:08 +0200650there are 2 types of Huffman-compressed literals :
651a single stream and 4 streams.
Yann Colletd916c902016-07-04 00:42:58 +0200652
inikepde9d1302016-08-25 14:59:08 +0200653Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations.
Yann Colletd916c902016-07-04 00:42:58 +0200654Since each stream can be decoded independently,
655it's possible to decode them up to 4x faster than a single stream,
656presuming the CPU has enough parallelism available.
657
658For single stream, header provides both the compressed and regenerated size.
inikepde9d1302016-08-25 14:59:08 +0200659For 4 streams though,
Yann Colletd916c902016-07-04 00:42:58 +0200660header only provides compressed and regenerated size of all 4 streams combined.
Yann Colletd916c902016-07-04 00:42:58 +0200661In order to properly decode the 4 streams,
662it's necessary to know the compressed and regenerated size of each stream.
663
Yann Collet38b75dd2016-07-24 15:35:59 +0200664Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
665except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Colletd916c902016-07-04 00:42:58 +0200666
Yann Collet38b75dd2016-07-24 15:35:59 +0200667Compressed size is provided explicitly : in the 4-streams variant,
inikepe81f2cb2016-08-13 09:36:24 +0200668bitstreams are preceded by 3 unsigned little-endian 16-bits values.
Yann Collet00d44ab2016-07-04 01:29:47 +0200669Each value represents the compressed size of one stream, in order.
Yann Colletd916c902016-07-04 00:42:58 +0200670The last stream size is deducted from total compressed size
Yann Collet38b75dd2016-07-24 15:35:59 +0200671and from previously decoded stream sizes :
inikepde9d1302016-08-25 14:59:08 +0200672
inikepe81f2cb2016-08-13 09:36:24 +0200673`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
Yann Colletd916c902016-07-04 00:42:58 +0200674
inikepde9d1302016-08-25 14:59:08 +0200675
Yann Collet00d44ab2016-07-04 01:29:47 +0200676##### Bitstreams read and decode
Yann Colletd916c902016-07-04 00:42:58 +0200677
678Each bitstream must be read _backward_,
679that is starting from the end down to the beginning.
680Therefore it's necessary to know the size of each bitstream.
681
682It's also necessary to know exactly which _bit_ is the latest.
683This is detected by a final bit flag :
684the highest bit of latest byte is a final-bit-flag.
685Consequently, a last byte of `0` is not possible.
686And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet38b75dd2016-07-24 15:35:59 +0200687Hence, the last byte contains between 0 and 7 useful bits.
Yann Colletd916c902016-07-04 00:42:58 +0200688
689Starting from the end,
690it's possible to read the bitstream in a little-endian fashion,
691keeping track of already used bits.
692
inikepe81f2cb2016-08-13 09:36:24 +0200693Reading the last `Max_Number_of_Bits` bits,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200694it's then possible to compare extracted value to decoding table,
Yann Colletd916c902016-07-04 00:42:58 +0200695determining the symbol to decode and number of bits to discard.
696
697The process continues up to reading the required number of symbols per stream.
698If a bitstream is not entirely and exactly consumed,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200699hence reaching exactly its beginning position with _all_ bits consumed,
Yann Colletd916c902016-07-04 00:42:58 +0200700the decoding process is considered faulty.
701
Yann Collet698cb632016-07-03 18:49:35 +0200702
inikep4f270ac2016-08-04 11:25:52 +0200703### `Sequences_Section`
Yann Collet00d44ab2016-07-04 01:29:47 +0200704
705A compressed block is a succession of _sequences_ .
706A sequence is a literal copy command, followed by a match copy command.
707A literal copy command specifies a length.
708It is the number of bytes to be copied (or extracted) from the literal section.
709A match copy command specifies an offset and a length.
710The offset gives the position to copy from,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200711which can be within a previous block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200712
Yann Collet70c23262016-08-21 00:24:18 +0200713When all _sequences_ are decoded,
714if there is any literal left in the _literal section_,
715these bytes are added at the end of the block.
716
inikepde9d1302016-08-25 14:59:08 +0200717The `Sequences_Section` regroup all symbols required to decode commands.
Yann Collet70c23262016-08-21 00:24:18 +0200718There are 3 symbol types : literals lengths, offsets and match lengths.
719They are encoded together, interleaved, in a single _bitstream_.
Yann Collet00d44ab2016-07-04 01:29:47 +0200720
inikepde9d1302016-08-25 14:59:08 +0200721The `Sequences_Section` starts by a header,
722followed by optional probability tables for each symbol type,
Yann Collet00d44ab2016-07-04 01:29:47 +0200723followed by the bitstream.
724
inikepe81f2cb2016-08-13 09:36:24 +0200725| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
726| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
Yann Colletc40ba712016-07-08 15:39:02 +0200727
inikepde9d1302016-08-25 14:59:08 +0200728To decode the `Sequences_Section`, it's required to know its size.
Yann Colletcd25a912016-07-05 11:50:37 +0200729This size is deducted from `blockSize - literalSectionSize`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200730
731
inikepe81f2cb2016-08-13 09:36:24 +0200732#### `Sequences_Section_Header`
Yann Collet00d44ab2016-07-04 01:29:47 +0200733
Yann Collet23f05cc2016-07-04 16:13:11 +0200734Consists in 2 items :
inikepe81f2cb2016-08-13 09:36:24 +0200735- `Number_of_Sequences`
736- Symbol compression modes
Yann Collet23f05cc2016-07-04 16:13:11 +0200737
inikepe81f2cb2016-08-13 09:36:24 +0200738__`Number_of_Sequences`__
Yann Collet23f05cc2016-07-04 16:13:11 +0200739
inikepe81f2cb2016-08-13 09:36:24 +0200740This is a variable size field using between 1 and 3 bytes.
Yann Collet23f05cc2016-07-04 16:13:11 +0200741Let's call its first byte `byte0`.
742- `if (byte0 == 0)` : there are no sequences.
743 The sequence section stops there.
744 Regenerated content is defined entirely by literals section.
inikepe81f2cb2016-08-13 09:36:24 +0200745- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
746- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
747- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
Yann Collet23f05cc2016-07-04 16:13:11 +0200748
inikepe81f2cb2016-08-13 09:36:24 +0200749__Symbol compression modes__
Yann Collet23f05cc2016-07-04 16:13:11 +0200750
751This is a single byte, defining the compression mode of each symbol type.
752
inikepe81f2cb2016-08-13 09:36:24 +0200753|Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
754| -------- | ----------------------- | -------------- | -------------------- | ---------- |
755|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
Yann Collet23f05cc2016-07-04 16:13:11 +0200756
757The last field, `Reserved`, must be all-zeroes.
758
inikepde9d1302016-08-25 14:59:08 +0200759`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
760literals lengths, offsets, and match lengths respectively.
Yann Collet23f05cc2016-07-04 16:13:11 +0200761
762They follow the same enumeration :
763
inikepe81f2cb2016-08-13 09:36:24 +0200764| Value | 0 | 1 | 2 | 3 |
765| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
766| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
Yann Collet23f05cc2016-07-04 16:13:11 +0200767
inikepe81f2cb2016-08-13 09:36:24 +0200768- `Predefined_Mode` : uses a predefined distribution table.
769- `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times.
770- `Repeat_Mode` : re-use distribution table from previous compressed block.
771- `FSE_Compressed_Mode` : standard FSE compression.
Yann Colletcd25a912016-07-05 11:50:37 +0200772 A distribution table will be present.
773 It will be described in [next part](#distribution-tables).
Yann Collet23f05cc2016-07-04 16:13:11 +0200774
inikepde9d1302016-08-25 14:59:08 +0200775#### The codes for literals lengths, match lengths, and offsets.
Yann Collet23f05cc2016-07-04 16:13:11 +0200776
inikepde9d1302016-08-25 14:59:08 +0200777Each symbol is a _code_ in its own context,
778which specifies `Baseline` and `Number_of_Bits` to add.
779_Codes_ are FSE compressed,
780and interleaved with raw additional bits in the same bitstream.
781
Yann Collet855766d2016-09-02 17:04:49 -0700782##### Literals length codes
Yann Collet23f05cc2016-07-04 16:13:11 +0200783
inikepe81f2cb2016-08-13 09:36:24 +0200784Literals length codes are values ranging from `0` to `35` included.
Yann Collet23f05cc2016-07-04 16:13:11 +0200785They define lengths from 0 to 131071 bytes.
786
inikepe81f2cb2016-08-13 09:36:24 +0200787| `Literals_Length_Code` | 0-15 |
788| ---------------------- | ---------------------- |
789| length | `Literals_Length_Code` |
790| `Number_of_Bits` | 0 |
Yann Colletcd25a912016-07-05 11:50:37 +0200791
inikepe81f2cb2016-08-13 09:36:24 +0200792| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
793| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
inikepde9d1302016-08-25 14:59:08 +0200794| `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
inikepe81f2cb2016-08-13 09:36:24 +0200795| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200796
inikepe81f2cb2016-08-13 09:36:24 +0200797| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
798| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
inikepde9d1302016-08-25 14:59:08 +0200799| `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
inikepe81f2cb2016-08-13 09:36:24 +0200800| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200801
inikepe81f2cb2016-08-13 09:36:24 +0200802| `Literals_Length_Code` | 32 | 33 | 34 | 35 |
803| ---------------------- | ---- | ---- | ---- | ---- |
inikepde9d1302016-08-25 14:59:08 +0200804| `Baseline` | 8192 |16384 |32768 |65536 |
inikepe81f2cb2016-08-13 09:36:24 +0200805| `Number_of_Bits` | 13 | 14 | 15 | 16 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200806
inikepde9d1302016-08-25 14:59:08 +0200807##### Default distribution for literals length codes
Yann Collet23f05cc2016-07-04 16:13:11 +0200808
inikepe81f2cb2016-08-13 09:36:24 +0200809When `Compression_Mode` is `Predefined_Mode`,
810a predefined distribution is used for FSE compression.
Yann Collet23f05cc2016-07-04 16:13:11 +0200811
Yann Colletc40ba712016-07-08 15:39:02 +0200812Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet23f05cc2016-07-04 16:13:11 +0200813```
inikepe81f2cb2016-08-13 09:36:24 +0200814short literalsLength_defaultDistribution[36] =
Yann Collet23f05cc2016-07-04 16:13:11 +0200815 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
816 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
817 -1,-1,-1,-1 };
818```
819
inikepde9d1302016-08-25 14:59:08 +0200820##### Match length codes
Yann Collet23f05cc2016-07-04 16:13:11 +0200821
inikepe81f2cb2016-08-13 09:36:24 +0200822Match length codes are values ranging from `0` to `52` included.
Yann Collet23f05cc2016-07-04 16:13:11 +0200823They define lengths from 3 to 131074 bytes.
824
inikepe81f2cb2016-08-13 09:36:24 +0200825| `Match_Length_Code` | 0-31 |
826| ------------------- | ----------------------- |
827| value | `Match_Length_Code` + 3 |
828| `Number_of_Bits` | 0 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200829
inikepe81f2cb2016-08-13 09:36:24 +0200830| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
831| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
inikepde9d1302016-08-25 14:59:08 +0200832| `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
inikepe81f2cb2016-08-13 09:36:24 +0200833| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200834
inikepe81f2cb2016-08-13 09:36:24 +0200835| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
836| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
inikepde9d1302016-08-25 14:59:08 +0200837| `Baseline` | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
inikepe81f2cb2016-08-13 09:36:24 +0200838| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200839
inikepe81f2cb2016-08-13 09:36:24 +0200840| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
841| ------------------- | ---- | ---- | ---- | ---- | ---- |
inikepde9d1302016-08-25 14:59:08 +0200842| `Baseline` | 4098 | 8194 |16486 |32770 |65538 |
inikepe81f2cb2016-08-13 09:36:24 +0200843| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200844
inikepde9d1302016-08-25 14:59:08 +0200845##### Default distribution for match length codes
Yann Collet23f05cc2016-07-04 16:13:11 +0200846
inikepe81f2cb2016-08-13 09:36:24 +0200847When `Compression_Mode` is defined as `Predefined_Mode`,
848a predefined distribution is used for FSE compression.
Yann Collet23f05cc2016-07-04 16:13:11 +0200849
inikepde9d1302016-08-25 14:59:08 +0200850Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet23f05cc2016-07-04 16:13:11 +0200851```
852short matchLengths_defaultDistribution[53] =
853 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
854 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
855 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
856 -1,-1,-1,-1,-1 };
857```
858
859##### Offset codes
860
inikepe81f2cb2016-08-13 09:36:24 +0200861Offset codes are values ranging from `0` to `N`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200862
Yann Colletcd25a912016-07-05 11:50:37 +0200863A decoder is free to limit its maximum `N` supported.
864Recommendation is to support at least up to `22`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200865For information, at the time of this writing.
866the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
867
inikepe81f2cb2016-08-13 09:36:24 +0200868An offset code is also the number of additional bits to read,
inikepde9d1302016-08-25 14:59:08 +0200869and can be translated into an `Offset_Value` using the following formulas :
Yann Collet23f05cc2016-07-04 16:13:11 +0200870
871```
inikepe81f2cb2016-08-13 09:36:24 +0200872Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
873if (Offset_Value > 3) offset = Offset_Value - 3;
Yann Collet23f05cc2016-07-04 16:13:11 +0200874```
inikepde9d1302016-08-25 14:59:08 +0200875It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4`
inikepe81f2cb2016-08-13 09:36:24 +0200876but is limited by [maximum back-reference distance](#window_descriptor).
Yann Collet23f05cc2016-07-04 16:13:11 +0200877
inikepde9d1302016-08-25 14:59:08 +0200878`Offset_Value` from 1 to 3 are special : they define "repeat codes",
Yann Collet23f05cc2016-07-04 16:13:11 +0200879which means one of the previous offsets will be repeated.
880They are sorted in recency order, with 1 meaning the most recent one.
Yann Colletcd25a912016-07-05 11:50:37 +0200881See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet23f05cc2016-07-04 16:13:11 +0200882
inikepde9d1302016-08-25 14:59:08 +0200883
884##### Default distribution for offset codes
Yann Collet23f05cc2016-07-04 16:13:11 +0200885
inikepe81f2cb2016-08-13 09:36:24 +0200886When `Compression_Mode` is defined as `Predefined_Mode`,
887a predefined distribution is used for FSE compression.
Yann Collet23f05cc2016-07-04 16:13:11 +0200888
inikepde9d1302016-08-25 14:59:08 +0200889Below is its definition. It uses an accuracy of 5 bits (32 states),
Yann Colletcd25a912016-07-05 11:50:37 +0200890and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet23f05cc2016-07-04 16:13:11 +0200891
892If any sequence in the compressed block requires an offset larger than this,
893it's not possible to use the default distribution to represent it.
894
895```
896short offsetCodes_defaultDistribution[53] =
897 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
898 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
899```
900
901#### Distribution tables
902
903Following the header, up to 3 distribution tables can be described.
Yann Collet10b9c132016-07-24 01:21:53 +0200904When present, they are in this order :
inikepe81f2cb2016-08-13 09:36:24 +0200905- Literals lengths
Yann Collet23f05cc2016-07-04 16:13:11 +0200906- Offsets
inikepe81f2cb2016-08-13 09:36:24 +0200907- Match Lengths
Yann Collet23f05cc2016-07-04 16:13:11 +0200908
Yann Collet10b9c132016-07-24 01:21:53 +0200909The content to decode depends on their respective encoding mode :
inikepe81f2cb2016-08-13 09:36:24 +0200910- `Predefined_Mode` : no content. Use predefined distribution table.
911- `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
912- `FSE_Compressed_Mode` : A distribution table is present.
913- `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
Yann Collet23f05cc2016-07-04 16:13:11 +0200914
915##### FSE distribution table : condensed format
916
917An FSE distribution table describes the probabilities of all symbols
918from `0` to the last present one (included)
inikepe81f2cb2016-08-13 09:36:24 +0200919on a normalized scale of `1 << Accuracy_Log` .
Yann Collet23f05cc2016-07-04 16:13:11 +0200920
921It's a bitstream which is read forward, in little-endian fashion.
922It's not necessary to know its exact size,
923since it will be discovered and reported by the decoding process.
924
925The bitstream starts by reporting on which scale it operates.
inikepe81f2cb2016-08-13 09:36:24 +0200926`Accuracy_Log = low4bits + 5`.
Yann Collet70c23262016-08-21 00:24:18 +0200927Note that maximum `Accuracy_Log` for literal and match lengths is `9`,
928and for offsets is `8`. Higher values are considered errors.
Yann Collet23f05cc2016-07-04 16:13:11 +0200929
inikepde9d1302016-08-25 14:59:08 +0200930Then follows each symbol value, from `0` to last present one.
inikepe81f2cb2016-08-13 09:36:24 +0200931The number of bits used by each field is variable.
Yann Collet23f05cc2016-07-04 16:13:11 +0200932It depends on :
933
934- Remaining probabilities + 1 :
935 __example__ :
inikepe81f2cb2016-08-13 09:36:24 +0200936 Presuming an `Accuracy_Log` of 8,
Yann Collet23f05cc2016-07-04 16:13:11 +0200937 and presuming 100 probabilities points have already been distributed,
Yann Colletcd25a912016-07-05 11:50:37 +0200938 the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet23f05cc2016-07-04 16:13:11 +0200939 Therefore, it must read `log2sup(156) == 8` bits.
940
941- Value decoded : small values use 1 less bit :
942 __example__ :
943 Presuming values from 0 to 156 (included) are possible,
944 255-156 = 99 values are remaining in an 8-bits field.
945 They are used this way :
946 first 99 values (hence from 0 to 98) use only 7 bits,
947 values from 99 to 156 use 8 bits.
948 This is achieved through this scheme :
949
inikepe81f2cb2016-08-13 09:36:24 +0200950 | Value read | Value decoded | Number of bits used |
951 | ---------- | ------------- | ------------------- |
952 | 0 - 98 | 0 - 98 | 7 |
953 | 99 - 127 | 99 - 127 | 8 |
954 | 128 - 226 | 0 - 98 | 7 |
955 | 227 - 255 | 128 - 156 | 8 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200956
957Symbols probabilities are read one by one, in order.
958
inikepde9d1302016-08-25 14:59:08 +0200959Probability is obtained from Value decoded by following formula :
inikepe81f2cb2016-08-13 09:36:24 +0200960`Proba = value - 1`
Yann Collet23f05cc2016-07-04 16:13:11 +0200961
962It means value `0` becomes negative probability `-1`.
inikepde9d1302016-08-25 14:59:08 +0200963`-1` is a special probability, which means "less than 1".
Yann Colletc40ba712016-07-08 15:39:02 +0200964Its effect on distribution table is described in [next paragraph].
Yann Collet23f05cc2016-07-04 16:13:11 +0200965For the purpose of calculating cumulated distribution, it counts as one.
966
Yann Colletc40ba712016-07-08 15:39:02 +0200967[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
968
Yann Collet23f05cc2016-07-04 16:13:11 +0200969When a symbol has a probability of `zero`,
970it is followed by a 2-bits repeat flag.
971This repeat flag tells how many probabilities of zeroes follow the current one.
972It provides a number ranging from 0 to 3.
973If it is a 3, another 2-bits repeat flag follows, and so on.
974
inikepe81f2cb2016-08-13 09:36:24 +0200975When last symbol reaches cumulated total of `1 << Accuracy_Log`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200976decoding is complete.
inikepe81f2cb2016-08-13 09:36:24 +0200977If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200978distribution is considered corrupted.
979
Yann Collet10b9c132016-07-24 01:21:53 +0200980Then the decoder can tell how many bytes were used in this process,
981and how many symbols are present.
982The bitstream consumes a round number of bytes.
983Any remaining bit within the last byte is just unused.
984
Yann Collet23f05cc2016-07-04 16:13:11 +0200985##### FSE decoding : from normalized distribution to decoding tables
986
Yann Collet9ca73362016-07-05 10:53:38 +0200987The distribution of normalized probabilities is enough
988to create a unique decoding table.
989
990It follows the following build rule :
991
inikepe81f2cb2016-08-13 09:36:24 +0200992The table has a size of `tableSize = 1 << Accuracy_Log`.
Yann Collet9ca73362016-07-05 10:53:38 +0200993Each cell describes the symbol decoded,
994and instructions to get the next state.
995
inikep12731a92016-08-25 15:19:37 +0200996Symbols are scanned in their natural order for "less than 1" probabilities.
Yann Collet9ca73362016-07-05 10:53:38 +0200997Symbols with this probability are being attributed a single cell,
998starting from the end of the table.
inikepe81f2cb2016-08-13 09:36:24 +0200999These symbols define a full state reset, reading `Accuracy_Log` bits.
Yann Collet9ca73362016-07-05 10:53:38 +02001000
1001All remaining symbols are sorted in their natural order.
1002Starting from symbol `0` and table position `0`,
1003each symbol gets attributed as many cells as its probability.
1004Cell allocation is spreaded, not linear :
1005each successor position follow this rule :
1006
Yann Colletcd25a912016-07-05 11:50:37 +02001007```
1008position += (tableSize>>1) + (tableSize>>3) + 3;
1009position &= tableSize-1;
1010```
Yann Collet9ca73362016-07-05 10:53:38 +02001011
1012A position is skipped if already occupied,
1013typically by a "less than 1" probability symbol.
1014
1015The result is a list of state values.
1016Each state will decode the current symbol.
1017
inikepde9d1302016-08-25 14:59:08 +02001018To get the `Number_of_Bits` and `Baseline` required for next state,
Yann Collet9ca73362016-07-05 10:53:38 +02001019it's first necessary to sort all states in their natural order.
Yann Colletcd25a912016-07-05 11:50:37 +02001020The lower states will need 1 more bit than higher ones.
Yann Collet9ca73362016-07-05 10:53:38 +02001021
1022__Example__ :
1023Presuming a symbol has a probability of 5.
Yann Colletcd25a912016-07-05 11:50:37 +02001024It receives 5 state values. States are sorted in natural order.
Yann Collet9ca73362016-07-05 10:53:38 +02001025
1026Next power of 2 is 8.
1027Space of probabilities is divided into 8 equal parts.
inikepe81f2cb2016-08-13 09:36:24 +02001028Presuming the `Accuracy_Log` is 7, it defines 128 states.
Yann Collet9ca73362016-07-05 10:53:38 +02001029Divided by 8, each share is 16 large.
1030
1031In order to reach 8, 8-5=3 lowest states will count "double",
1032taking shares twice larger,
1033requiring one more bit in the process.
1034
1035Numbering starts from higher states using less bits.
1036
inikepe81f2cb2016-08-13 09:36:24 +02001037| state order | 0 | 1 | 2 | 3 | 4 |
1038| ---------------- | ----- | ----- | ------ | ---- | ----- |
1039| width | 32 | 32 | 32 | 16 | 16 |
1040| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
1041| range number | 2 | 4 | 6 | 0 | 1 |
inikepde9d1302016-08-25 14:59:08 +02001042| `Baseline` | 32 | 64 | 96 | 0 | 16 |
inikepe81f2cb2016-08-13 09:36:24 +02001043| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
Yann Collet9ca73362016-07-05 10:53:38 +02001044
1045Next state is determined from current state
inikepde9d1302016-08-25 14:59:08 +02001046by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
Yann Collet23f05cc2016-07-04 16:13:11 +02001047
1048
1049#### Bitstream
Yann Collet00d44ab2016-07-04 01:29:47 +02001050
Yann Collet9ca73362016-07-05 10:53:38 +02001051All sequences are stored in a single bitstream, read _backward_.
1052It is therefore necessary to know the bitstream size,
1053which is deducted from compressed block size.
1054
Yann Colletc40ba712016-07-08 15:39:02 +02001055The last useful bit of the stream is followed by an end-bit-flag.
Yann Colletcd25a912016-07-05 11:50:37 +02001056Highest bit of last byte is this flag.
Yann Collet9ca73362016-07-05 10:53:38 +02001057It does not belong to the useful part of the bitstream.
1058Therefore, last byte has 0-7 useful bits.
1059Note that it also means that last byte cannot be `0`.
1060
1061##### Starting states
1062
1063The bitstream starts with initial state values,
1064each using the required number of bits in their respective _accuracy_,
1065decoded previously from their normalized distribution.
1066
inikepe81f2cb2016-08-13 09:36:24 +02001067It starts by `Literals_Length_State`,
1068followed by `Offset_State`,
1069and finally `Match_Length_State`.
Yann Collet9ca73362016-07-05 10:53:38 +02001070
1071Reminder : always keep in mind that all values are read _backward_.
1072
1073##### Decoding a sequence
1074
1075A state gives a code.
inikepde9d1302016-08-25 14:59:08 +02001076A code provides `Baseline` and `Number_of_Bits` to add.
Yann Collet9ca73362016-07-05 10:53:38 +02001077See [Symbol Decoding] section for details on each symbol.
1078
inikepde9d1302016-08-25 14:59:08 +02001079Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
1080It then does the same for `Match_Length`,
1081and then for `Literals_Length`.
Yann Collet9ca73362016-07-05 10:53:38 +02001082
inikepde9d1302016-08-25 14:59:08 +02001083`Offset`, `Match_Length`, and `Literals_Length` define a sequence.
1084It starts by inserting the number of literals defined by `Literals_Length`,
1085then continue by copying `Match_Length` bytes from `currentPos - Offset`.
Yann Collet9ca73362016-07-05 10:53:38 +02001086
1087The next operation is to update states.
1088Using rules pre-calculated in the decoding tables,
inikepe81f2cb2016-08-13 09:36:24 +02001089`Literals_Length_State` is updated,
1090followed by `Match_Length_State`,
1091and then `Offset_State`.
Yann Collet9ca73362016-07-05 10:53:38 +02001092
inikepe81f2cb2016-08-13 09:36:24 +02001093This operation will be repeated `Number_of_Sequences` times.
Yann Collet9ca73362016-07-05 10:53:38 +02001094At the end, the bitstream shall be entirely consumed,
1095otherwise bitstream is considered corrupted.
1096
inikepde9d1302016-08-25 14:59:08 +02001097[Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets
Yann Collet9ca73362016-07-05 10:53:38 +02001098
1099##### Repeat offsets
1100
inikep12731a92016-08-25 15:19:37 +02001101As seen in [Offset Codes], the first 3 values define a repeated offset and we will call them `Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
1102They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
Yann Collet9ca73362016-07-05 10:53:38 +02001103
inikepe81f2cb2016-08-13 09:36:24 +02001104There is an exception though, when current sequence's literals length is `0`.
inikep12731a92016-08-25 15:19:37 +02001105In which case, repeated offsets are "pushed by one",
1106so `Repeated_Offset1` becomes `Repeated_Offset2`, `Repeated_Offset2` becomes `Repeated_Offset3`,
1107and `Repeated_Offset3` becomes `Repeated_Offset1 - 1_byte`.
Yann Collet9ca73362016-07-05 10:53:38 +02001108
Yann Collet917fe182016-07-31 04:01:57 +02001109On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
Yann Collet9ca73362016-07-05 10:53:38 +02001110
1111Then each block receives its start value from previous compressed block.
1112Note that non-compressed blocks are skipped,
1113they do not contribute to offset history.
1114
1115[Offset Codes]: #offset-codes
1116
1117###### Offset updates rules
1118
Yann Collet917fe182016-07-31 04:01:57 +02001119New offset take the lead in offset history,
1120up to its previous place if it was already present.
Yann Collet9ca73362016-07-05 10:53:38 +02001121
inikep12731a92016-08-25 15:19:37 +02001122It means that when `Repeated_Offset1` (most recent) is used, history is unmodified.
1123When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`.
Yann Collet00d44ab2016-07-04 01:29:47 +02001124
Yann Collet2fa99042016-07-01 20:55:28 +02001125
Yann Colletbd106072016-07-08 19:16:57 +02001126Dictionary format
1127-----------------
1128
Yann Collet855766d2016-09-02 17:04:49 -07001129`zstd` is compatible with "raw content" dictionaries, free of any format restriction.
Yann Colletbd106072016-07-08 19:16:57 +02001130But dictionaries created by `zstd --train` follow a format, described here.
1131
Yann Collet855766d2016-09-02 17:04:49 -07001132__Pre-requisites__ : a dictionary has a size,
Yann Colletbd106072016-07-08 19:16:57 +02001133 defined either by a buffer limit, or a file size.
1134
inikepe81f2cb2016-08-13 09:36:24 +02001135| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
1136| -------------- | --------------- | ---------------- | --------- |
Yann Colletbd106072016-07-08 19:16:57 +02001137
inikepe81f2cb2016-08-13 09:36:24 +02001138__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
Yann Colletbd106072016-07-08 19:16:57 +02001139
inikepe81f2cb2016-08-13 09:36:24 +02001140__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
1141 `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
Yann Collet722e14b2016-07-08 19:22:16 +02001142 It's used by decoders to check if they use the correct dictionary.
inikepe81f2cb2016-08-13 09:36:24 +02001143
1144_Reserved ranges :_
Yann Colletf6ff53c2016-07-15 17:03:38 +02001145 If the frame is going to be distributed in a private environment,
inikepe81f2cb2016-08-13 09:36:24 +02001146 any `Dictionary_ID` can be used.
Yann Colletf6ff53c2016-07-15 17:03:38 +02001147 However, for public distribution of compressed frames,
inikepe81f2cb2016-08-13 09:36:24 +02001148 the following ranges are reserved for future use and should not be used :
Yann Collet6cacd342016-07-15 17:58:13 +02001149
Yann Collet70c23262016-08-21 00:24:18 +02001150 - low range : 1 - 32767
1151 - high range : >= (2^31)
Yann Colletbd106072016-07-08 19:16:57 +02001152
inikepe81f2cb2016-08-13 09:36:24 +02001153__`Entropy_Tables`__ : following the same format as a [compressed blocks].
Yann Collet855766d2016-09-02 17:04:49 -07001154 They are stored in following order :
1155 Huffman tables for literals, FSE table for offsets,
1156 FSE table for match lengths, and FSE table for literals lengths.
1157 It's finally followed by 3 offset values, populating recent offsets,
1158 stored in order, 4-bytes little-endian each, for a total of 12 bytes.
Yann Colletbd106072016-07-08 19:16:57 +02001159
Yann Collet855766d2016-09-02 17:04:49 -07001160__`Content`__ : The rest of the dictionary is its content.
1161 The content act as a "past" in front of data to compress or decompress.
Yann Colletbd106072016-07-08 19:16:57 +02001162
inikepf9c3cce2016-07-25 11:04:56 +02001163[compressed blocks]: #the-format-of-compressed_block
Yann Colletbd106072016-07-08 19:16:57 +02001164
Yann Collet2fa99042016-07-01 20:55:28 +02001165
1166Version changes
1167---------------
Yann Collet855766d2016-09-02 17:04:49 -07001168- 0.2.1 : clarify field names, by Przemyslaw Skibinski
Yann Collet6fa05a22016-07-20 14:58:49 +02001169- 0.2.0 : numerous format adjustments for zstd v0.8
inikep586a0552016-08-03 16:16:38 +02001170- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collete557fd52016-07-17 16:21:37 +02001171- 0.1.1 : reserved dictID ranges
1172- 0.1.0 : initial release