blob: 1c5908fa558c100c7a73698a651dabc791f29093 [file] [log] [blame] [view]
Yann Collet5cc18822016-07-03 19:03:13 +02001Zstandard Compression Format
2============================
Yann Collet2fa99042016-07-01 20:55:28 +02003
4### Notices
5
6Copyright (c) 2016 Yann Collet
7
8Permission is granted to copy and distribute this document
inikepf9c3cce2016-07-25 11:04:56 +02009for any purpose and without charge,
10including translations into other languages
Yann Collet2fa99042016-07-01 20:55:28 +020011and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
Yann Collet6fa05a22016-07-20 14:58:49 +0200190.2.0 (22/07/16)
Yann Collet2fa99042016-07-01 20:55:28 +020020
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
Yann Collet9ca73362016-07-05 10:53:38 +020028file compression, pipe and streaming compression,
Yann Collet2fa99042016-07-01 20:55:28 +020029using the [Zstandard algorithm](http://www.zstandard.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the Zstandard compression method,
36and optional [xxHash-64 checksum method](http://www.xxhash.org),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into Zstandard format and/or decompress data from Zstandard format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore informative fields, such as checksum.
56Whenever it does not support a parameter defined in the compressed stream,
57it must produce a non-ambiguous error code and associated error message
58explaining which parameter is unsupported.
59
60
inikepf9c3cce2016-07-25 11:04:56 +020061Overall conventions
62-----------
63In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
64
65
Yann Collet2fa99042016-07-01 20:55:28 +020066Definitions
67-----------
68A content compressed by Zstandard is transformed into a Zstandard __frame__.
69Multiple frames can be appended into a single file or stream.
70A frame is totally independent, has a defined beginning and end,
71and a set of parameters which tells the decoder how to decompress it.
72
73A frame encapsulates one or multiple __blocks__.
74Each block can be compressed or not,
75and has a guaranteed maximum content size, which depends on frame parameters.
76Unlike frames, each block depends on previous blocks for proper decoding.
77However, each block can be decompressed without waiting for its successor,
78allowing streaming operations.
79
80
inikepf9c3cce2016-07-25 11:04:56 +020081Frame Concatenation
82-------------------
Yann Collet2fa99042016-07-01 20:55:28 +020083
84In some circumstances, it may be required to append multiple frames,
85for example in order to add new data to an existing compressed file
86without re-framing it.
87
88In such case, each frame brings its own set of descriptor flags.
89Each frame is considered independent.
90The only relation between frames is their sequential order.
91
92The ability to decode multiple concatenated frames
93within a single stream or file is left outside of this specification.
94As an example, the reference `zstd` command line utility is able
95to decode all concatenated frames in their sequential order,
96delivering the final decompressed result as if it was a single content.
97
98
inikepf9c3cce2016-07-25 11:04:56 +020099General Structure of Zstandard Frame format
100-------------------------------------------
101The structure of a single Zstandard frame is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200102
Yann Colletc991cc12016-07-28 00:55:43 +0200103| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] |
104|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:|
105| 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200106
inikepf9c3cce2016-07-25 11:04:56 +0200107__`Magic_Number`__
108
inikep2fc37522016-07-25 12:47:02 +02001094 Bytes, Little-endian format.
inikepf9c3cce2016-07-25 11:04:56 +0200110Value : 0xFD2FB527
111
112__`Frame_Header`__
113
1142 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
115
116__`Data_Block`__
117
118Detailed in [next chapter](#the-structure-of-data_block).
119That’s where compressed data is stored.
120
Yann Colletc991cc12016-07-28 00:55:43 +0200121__`Content_Checksum`__
inikepf9c3cce2016-07-25 11:04:56 +0200122
Yann Colletc991cc12016-07-28 00:55:43 +0200123An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
inikepf9c3cce2016-07-25 11:04:56 +0200124The content checksum is the result
125of [xxh64() hash function](https://www.xxHash.com)
126digesting the original (decoded) data as input, and a seed of zero.
Yann Colletc991cc12016-07-28 00:55:43 +0200127The low 4 bytes of the checksum are stored in little endian format.
inikepf9c3cce2016-07-25 11:04:56 +0200128
129
130The structure of `Frame_Header`
131-------------------------------
132The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet2fa99042016-07-01 20:55:28 +0200133and up to 14 bytes depending on optional parameters.
inikepf9c3cce2016-07-25 11:04:56 +0200134The structure of `Frame_Header` is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200135
inikepf9c3cce2016-07-25 11:04:56 +0200136| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
137| ------------------------- | --------------------- | ----------------- | ---------------------- |
138| 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200139
inikepf9c3cce2016-07-25 11:04:56 +0200140### `Frame_Header_Descriptor`
141
142The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet2fa99042016-07-01 20:55:28 +0200143It tells which other fields are present.
inikepf9c3cce2016-07-25 11:04:56 +0200144Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet2fa99042016-07-01 20:55:28 +0200145
inikepf9c3cce2016-07-25 11:04:56 +0200146| Bit number | Field name |
147| ---------- | ---------- |
148| 7-6 | `Frame_Content_Size_flag` |
149| 5 | `Single_Segment_flag` |
150| 4 | `Unused_bit` |
151| 3 | `Reserved_bit` |
152| 2 | `Content_Checksum_flag` |
153| 1-0 | `Dictionary_ID_flag` |
Yann Collet2fa99042016-07-01 20:55:28 +0200154
155In this table, bit 7 is highest bit, while bit 0 is lowest.
156
inikep49ec6d12016-07-25 12:26:39 +0200157__`Frame_Content_Size_flag`__
158
159This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
160specifying if decompressed data size is provided within the header.
Yann Colletc991cc12016-07-28 00:55:43 +0200161The `Flag_Value` can be converted into `Field_Size`,
162which is the number of bytes used by `Frame_Content_Size`
163according to the following table:
inikep49ec6d12016-07-25 12:26:39 +0200164
Yann Colletc991cc12016-07-28 00:55:43 +0200165|`Flag_Value`| 0 | 1 | 2 | 3 |
inikep49ec6d12016-07-25 12:26:39 +0200166| ---------- | --- | --- | --- | --- |
167|`Field_Size`| 0-1 | 2 | 4 | 8 |
168
Yann Colletc991cc12016-07-28 00:55:43 +0200169When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
170if `Single_Segment_flag` is set, `Field_Size` is 1.
171Otherwise, `Field_Size` is 0 (content size not provided).
inikep49ec6d12016-07-25 12:26:39 +0200172
inikepf9c3cce2016-07-25 11:04:56 +0200173__`Single_Segment_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200174
inikep49ec6d12016-07-25 12:26:39 +0200175If this flag is set,
Yann Colletc991cc12016-07-28 00:55:43 +0200176data must be regenerated within a single continuous memory segment.
Yann Collet2fa99042016-07-01 20:55:28 +0200177
Yann Colletc991cc12016-07-28 00:55:43 +0200178In this case, `Frame_Content_Size` is necessarily present,
179but `Window_Descriptor` byte is skipped.
inikep49ec6d12016-07-25 12:26:39 +0200180As a consequence, the decoder must allocate a memory segment
181of size equal or bigger than `Frame_Content_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200182
183In order to preserve the decoder from unreasonable memory requirement,
Yann Collet23f05cc2016-07-04 16:13:11 +0200184a decoder can reject a compressed frame
Yann Collet2fa99042016-07-01 20:55:28 +0200185which requests a memory size beyond decoder's authorized range.
186
187For broader compatibility, decoders are recommended to support
Yann Collet23f05cc2016-07-04 16:13:11 +0200188memory sizes of at least 8 MB.
189This is just a recommendation,
Yann Collet9ca73362016-07-05 10:53:38 +0200190each decoder is free to support higher or lower limits,
Yann Collet2fa99042016-07-01 20:55:28 +0200191depending on local limitations.
192
inikepf9c3cce2016-07-25 11:04:56 +0200193__`Unused_bit`__
Yann Collet2fa99042016-07-01 20:55:28 +0200194
Yann Colletf0bc6732016-07-13 17:30:21 +0200195The value of this bit should be set to zero.
Yann Colletc991cc12016-07-28 00:55:43 +0200196A decoder compliant with this specification version shall not interpret it.
Yann Colletf0bc6732016-07-13 17:30:21 +0200197It might be used in a future version,
198to signal a property which is not mandatory to properly decode the frame.
Yann Collet2fa99042016-07-01 20:55:28 +0200199
inikepf9c3cce2016-07-25 11:04:56 +0200200__`Reserved_bit`__
Yann Collet2fa99042016-07-01 20:55:28 +0200201
202This bit is reserved for some future feature.
203Its value _must be zero_.
204A decoder compliant with this specification version must ensure it is not set.
205This bit may be used in a future revision,
Yann Colletc991cc12016-07-28 00:55:43 +0200206to signal a feature that must be interpreted to decode the frame correctly.
Yann Collet2fa99042016-07-01 20:55:28 +0200207
inikepf9c3cce2016-07-25 11:04:56 +0200208__`Content_Checksum_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200209
Yann Colletc991cc12016-07-28 00:55:43 +0200210If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
211See `Content_Checksum` paragraph.
Yann Collet2fa99042016-07-01 20:55:28 +0200212
inikepf9c3cce2016-07-25 11:04:56 +0200213__`Dictionary_ID_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200214
215This is a 2-bits flag (`= FHD & 3`),
Yann Collet9ca73362016-07-05 10:53:38 +0200216telling if a dictionary ID is provided within the header.
217It also specifies the size of this field.
Yann Collet2fa99042016-07-01 20:55:28 +0200218
inikepf9c3cce2016-07-25 11:04:56 +0200219| Value | 0 | 1 | 2 | 3 |
220| -------- | --- | --- | --- | --- |
221|Field size| 0 | 1 | 2 | 4 |
Yann Collet2fa99042016-07-01 20:55:28 +0200222
inikepf9c3cce2016-07-25 11:04:56 +0200223### `Window_Descriptor`
Yann Collet2fa99042016-07-01 20:55:28 +0200224
225Provides guarantees on maximum back-reference distance
Yann Colletc991cc12016-07-28 00:55:43 +0200226that will be used within compressed data.
227This information is important for decoders to allocate enough memory.
Yann Collet2fa99042016-07-01 20:55:28 +0200228
Yann Colletc991cc12016-07-28 00:55:43 +0200229The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
inikepf9c3cce2016-07-25 11:04:56 +0200230In this case, the maximum back-reference distance is the content size itself,
Yann Colletcd25a912016-07-05 11:50:37 +0200231which can be any value from 1 to 2^64-1 bytes (16 EB).
232
inikepf9c3cce2016-07-25 11:04:56 +0200233| Bit numbers | 7-3 | 0-2 |
234| ----------- | -------- | -------- |
235| Field name | Exponent | Mantissa |
Yann Collet2fa99042016-07-01 20:55:28 +0200236
237Maximum distance is given by the following formulae :
238```
239windowLog = 10 + Exponent;
240windowBase = 1 << windowLog;
241windowAdd = (windowBase / 8) * Mantissa;
242windowSize = windowBase + windowAdd;
243```
244The minimum window size is 1 KB.
Yann Colletcd25a912016-07-05 11:50:37 +0200245The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet2fa99042016-07-01 20:55:28 +0200246
247To properly decode compressed data,
248a decoder will need to allocate a buffer of at least `windowSize` bytes.
249
Yann Collet2fa99042016-07-01 20:55:28 +0200250In order to preserve decoder from unreasonable memory requirements,
251a decoder can refuse a compressed frame
252which requests a memory size beyond decoder's authorized range.
253
Yann Colletcd25a912016-07-05 11:50:37 +0200254For improved interoperability,
Yann Colletc991cc12016-07-28 00:55:43 +0200255decoders are recommended to be compatible with window sizes of 8 MB,
256and encoders are recommended to not request more than 8 MB.
Yann Collet2fa99042016-07-01 20:55:28 +0200257It's merely a recommendation though,
258decoders are free to support larger or lower limits,
259depending on local limitations.
260
inikepf9c3cce2016-07-25 11:04:56 +0200261### `Dictionary_ID`
Yann Collet23f05cc2016-07-04 16:13:11 +0200262
Yann Colletf6ff53c2016-07-15 17:03:38 +0200263This is a variable size field, which contains
264the ID of the dictionary required to properly decode the frame.
265Note that this field is optional. When it's not present,
Yann Collet23f05cc2016-07-04 16:13:11 +0200266it's up to the caller to make sure it uses the correct dictionary.
267
inikepf9c3cce2016-07-25 11:04:56 +0200268Field size depends on `Dictionary_ID_flag`.
Yann Collet23f05cc2016-07-04 16:13:11 +02002691 byte can represent an ID 0-255.
2702 bytes can represent an ID 0-65535.
Yann Colletcd25a912016-07-05 11:50:37 +02002714 bytes can represent an ID 0-4294967295.
Yann Collet23f05cc2016-07-04 16:13:11 +0200272
273It's allowed to represent a small ID (for example `13`)
Yann Colletcd25a912016-07-05 11:50:37 +0200274with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet23f05cc2016-07-04 16:13:11 +0200275
Yann Colletf6ff53c2016-07-15 17:03:38 +0200276_Reserved ranges :_
277If the frame is going to be distributed in a private environment,
278any dictionary ID can be used.
279However, for public distribution of compressed frames using a dictionary,
inikepf9c3cce2016-07-25 11:04:56 +0200280the following ranges are reserved for future use and should not be used :
281- low range : 1 - 32767
282- high range : >= (2^31)
Yann Colletf6ff53c2016-07-15 17:03:38 +0200283
284
inikepf9c3cce2016-07-25 11:04:56 +0200285### `Frame_Content_Size`
Yann Collet2fa99042016-07-01 20:55:28 +0200286
inikep2fc37522016-07-25 12:47:02 +0200287This is the original (uncompressed) size. This information is optional.
288The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
289The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
290Format is Little-endian.
Yann Collet2fa99042016-07-01 20:55:28 +0200291
inikep2fc37522016-07-25 12:47:02 +0200292| `Field_Size` | Range |
293| ------------ | ---------- |
294| 1 | 0 - 255 |
295| 2 | 256 - 65791|
296| 4 | 0 - 2^32-1 |
297| 8 | 0 - 2^64-1 |
Yann Collet2fa99042016-07-01 20:55:28 +0200298
inikep2fc37522016-07-25 12:47:02 +0200299When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
300When `Field_Size` is 2, _the offset of 256 is added_.
inikepf9c3cce2016-07-25 11:04:56 +0200301It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet2fa99042016-07-01 20:55:28 +0200302
Yann Collet2fa99042016-07-01 20:55:28 +0200303
inikepf9c3cce2016-07-25 11:04:56 +0200304The structure of `Data_Block`
305-----------------------------
306The structure of `Data_Block` is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200307
Yann Colletc991cc12016-07-28 00:55:43 +0200308| `Last_Block` | `Block_Type` | `Block_Size` | `Block_Content` |
309|:------------:|:------------:|:------------:|:---------------:|
310| 1 bit | 2 bits | 21 bits | n bytes |
311
312The block header uses 3-bytes.
313
314__`Last_Block`__
315
316The lowest bit signals if this block is the last one.
317Frame ends right after this block.
318It may be followed by an optional `Content_Checksum` .
Yann Collet2fa99042016-07-01 20:55:28 +0200319
inikepf9c3cce2016-07-25 11:04:56 +0200320__`Block_Type` and `Block_Size`__
Yann Collet2fa99042016-07-01 20:55:28 +0200321
Yann Colletc991cc12016-07-28 00:55:43 +0200322The next 2 bits represent the `Block_Type`,
323while the remaining 21 bits represent the `Block_Size`.
324Format is __little-endian__.
Yann Collet2fa99042016-07-01 20:55:28 +0200325
326There are 4 block types :
327
inikepf9c3cce2016-07-25 11:04:56 +0200328| Value | 0 | 1 | 2 | 3 |
329| ------------ | ----------- | ----------- | ------------------ | --------- |
Yann Colletc991cc12016-07-28 00:55:43 +0200330| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
Yann Collet2fa99042016-07-01 20:55:28 +0200331
inikepf9c3cce2016-07-25 11:04:56 +0200332- `Raw_Block` - this is an uncompressed block.
333 `Block_Size` is the number of bytes to read and copy.
334- `RLE_Block` - this is a single byte, repeated N times.
335 In which case, `Block_Size` is the size to regenerate,
336 while the "compressed" block is just 1 byte (the byte to repeat).
337- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet9ca73362016-07-05 10:53:38 +0200338 detailed in another section of this specification.
inikepf9c3cce2016-07-25 11:04:56 +0200339 `Block_Size` is the compressed size.
Yann Collet2fa99042016-07-01 20:55:28 +0200340 Decompressed size is unknown,
341 but its maximum possible value is guaranteed (see below)
Yann Colletc991cc12016-07-28 00:55:43 +0200342- `Reserved` - this is not a block.
343 This value cannot be used with current version of this specification.
Yann Collet2fa99042016-07-01 20:55:28 +0200344
345Block sizes must respect a few rules :
Yann Collet9ca73362016-07-05 10:53:38 +0200346- In compressed mode, compressed size if always strictly `< decompressed size`.
347- Block decompressed size is always <= maximum back-reference distance .
348- Block decompressed size is always <= 128 KB
Yann Collet2fa99042016-07-01 20:55:28 +0200349
350
inikepf9c3cce2016-07-25 11:04:56 +0200351__`Block_Content`__
Yann Collet2fa99042016-07-01 20:55:28 +0200352
inikepf9c3cce2016-07-25 11:04:56 +0200353The `Block_Content` is where the actual data to decode stands.
Yann Collet2fa99042016-07-01 20:55:28 +0200354It might be compressed or not, depending on previous field indications.
355A data block is not necessarily "full" :
356since an arbitrary “flush” may happen anytime,
Yann Colletcd25a912016-07-05 11:50:37 +0200357block decompressed content can be any size,
inikepf9c3cce2016-07-25 11:04:56 +0200358up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Colletcd25a912016-07-05 11:50:37 +0200359- Maximum back-reference distance
Yann Collet2fa99042016-07-01 20:55:28 +0200360- 128 KB
361
362
363Skippable Frames
364----------------
365
inikepf9c3cce2016-07-25 11:04:56 +0200366| `Magic_Number` | `Frame_Size` | `User_Data` |
367|:--------------:|:------------:|:-----------:|
368| 4 bytes | 4 bytes | n bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200369
370Skippable frames allow the insertion of user-defined data
371into a flow of concatenated frames.
372Its design is pretty straightforward,
373with the sole objective to allow the decoder to quickly skip
374over user-defined data and continue decoding.
375
Yann Colletcd25a912016-07-05 11:50:37 +0200376Skippable frames defined in this specification are compatible with [LZ4] ones.
377
378[LZ4]:http://www.lz4.org
Yann Collet2fa99042016-07-01 20:55:28 +0200379
inikepf9c3cce2016-07-25 11:04:56 +0200380__`Magic_Number`__
Yann Collet2fa99042016-07-01 20:55:28 +0200381
inikep2fc37522016-07-25 12:47:02 +02003824 Bytes, Little-endian format.
Yann Collet2fa99042016-07-01 20:55:28 +0200383Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
384All 16 values are valid to identify a skippable frame.
385
inikepf9c3cce2016-07-25 11:04:56 +0200386__`Frame_Size`__
Yann Collet2fa99042016-07-01 20:55:28 +0200387
inikepf9c3cce2016-07-25 11:04:56 +0200388This is the size, in bytes, of the following `User_Data`
Yann Collet2fa99042016-07-01 20:55:28 +0200389(without including the magic number nor the size field itself).
inikep2fc37522016-07-25 12:47:02 +0200390This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
inikepf9c3cce2016-07-25 11:04:56 +0200391This means `User_Data` can’t be bigger than (2^32-1) bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200392
inikepf9c3cce2016-07-25 11:04:56 +0200393__`User_Data`__
Yann Collet2fa99042016-07-01 20:55:28 +0200394
inikepf9c3cce2016-07-25 11:04:56 +0200395The `User_Data` can be anything. Data will just be skipped by the decoder.
Yann Collet2fa99042016-07-01 20:55:28 +0200396
397
inikepf9c3cce2016-07-25 11:04:56 +0200398The format of `Compressed_Block`
399--------------------------------
400The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
401The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet2fa99042016-07-01 20:55:28 +0200402in order to properly allocate destination buffer.
inikepf9c3cce2016-07-25 11:04:56 +0200403See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet2fa99042016-07-01 20:55:28 +0200404
405A compressed block consists of 2 sections :
inikepf896c1d2016-08-03 16:37:42 +0200406- [Literals_Section](#literals_section)
407- [Sequences_Section](#sequences_section)
Yann Collet2fa99042016-07-01 20:55:28 +0200408
Yann Collet23f05cc2016-07-04 16:13:11 +0200409### Prerequisites
410To decode a compressed block, the following elements are necessary :
Yann Collet698cb632016-07-03 18:49:35 +0200411- Previous decoded blocks, up to a distance of `windowSize`,
inikepf9c3cce2016-07-25 11:04:56 +0200412 or all previous blocks when `Single_Segment_flag` is set.
Yann Collet698cb632016-07-03 18:49:35 +0200413- List of "recent offsets" from previous compressed block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200414- Decoding tables of previous compressed block for each symbol type
Yann Collet9ca73362016-07-05 10:53:38 +0200415 (literals, litLength, matchLength, offset).
Yann Collet698cb632016-07-03 18:49:35 +0200416
417
inikepf896c1d2016-08-03 16:37:42 +0200418### `Literals_Section`
Yann Collet2fa99042016-07-01 20:55:28 +0200419
Yann Collet2fa99042016-07-01 20:55:28 +0200420During sequence phase, literals will be entangled with match copy operations.
421All literals are regrouped in the first part of the block.
422They can be decoded first, and then copied during sequence operations,
Yann Collet00d44ab2016-07-04 01:29:47 +0200423or they can be decoded on the flow, as needed by sequence commands.
Yann Collet2fa99042016-07-01 20:55:28 +0200424
inikep4f270ac2016-08-04 11:25:52 +0200425| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
426| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
Yann Collet2fa99042016-07-01 20:55:28 +0200427
inikepf9c3cce2016-07-25 11:04:56 +0200428Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet2fa99042016-07-01 20:55:28 +0200429When compressed, an optional tree description can be present,
430followed by 1 or 4 streams.
431
inikepf9c3cce2016-07-25 11:04:56 +0200432
inikepf896c1d2016-08-03 16:37:42 +0200433#### `Literals_Section_Header`
Yann Collet2fa99042016-07-01 20:55:28 +0200434
Yann Collet00d44ab2016-07-04 01:29:47 +0200435Header is in charge of describing how literals are packed.
Yann Collet2fa99042016-07-01 20:55:28 +0200436It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet198e6aa2016-07-20 20:12:24 +0200437using little-endian convention.
Yann Collet2fa99042016-07-01 20:55:28 +0200438
inikepf896c1d2016-08-03 16:37:42 +0200439| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
440| --------------------- | ------------- | ------------------ | ----------------- |
441| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
Yann Collet198e6aa2016-07-20 20:12:24 +0200442
443In this representation, bits on the left are smallest bits.
Yann Collet2fa99042016-07-01 20:55:28 +0200444
inikep9d003c12016-08-04 10:41:49 +0200445__`Literals_Block_Type`__
Yann Collet2fa99042016-07-01 20:55:28 +0200446
Yann Collet198e6aa2016-07-20 20:12:24 +0200447This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet2fa99042016-07-01 20:55:28 +0200448
inikep9d003c12016-08-04 10:41:49 +0200449| Value | 0 | 1 | 2 | 3 |
450| --------------------- | -------------------- | -------------------- | --------------------------- | ----------------------------- |
451| `Literals_Block_Type` | `Raw_Literals_Block` | `RLE_Literals_Block` | `Compressed_Literals_Block` | `Repeat_Stats_Literals_Block` |
Yann Collet2fa99042016-07-01 20:55:28 +0200452
inikepf896c1d2016-08-03 16:37:42 +0200453- `Raw_Literals_Block` - Literals are stored uncompressed.
454- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
455- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
inikep586a0552016-08-03 16:16:38 +0200456 starting with a Huffman tree description.
Yann Colletcd25a912016-07-05 11:50:37 +0200457 See details below.
inikepf896c1d2016-08-03 16:37:42 +0200458- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
inikep586a0552016-08-03 16:16:38 +0200459 using Huffman tree _from previous Huffman-compressed literals block_.
Yann Colletcd25a912016-07-05 11:50:37 +0200460 Huffman tree description will be skipped.
Yann Collet2fa99042016-07-01 20:55:28 +0200461
inikep9d003c12016-08-04 10:41:49 +0200462__`Size_Format`__
Yann Collet2fa99042016-07-01 20:55:28 +0200463
inikepf896c1d2016-08-03 16:37:42 +0200464`Size_Format` is divided into 2 families :
Yann Collet2fa99042016-07-01 20:55:28 +0200465
inikepf896c1d2016-08-03 16:37:42 +0200466- For `Compressed_Block`, it requires to decode both `Compressed_Size`
467 and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
468- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200469
Yann Collet198e6aa2016-07-20 20:12:24 +0200470For values spanning several bytes, convention is Little-endian.
Yann Collet2fa99042016-07-01 20:55:28 +0200471
inikep9d003c12016-08-04 10:41:49 +0200472__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200473
inikepf896c1d2016-08-03 16:37:42 +0200474- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
Yann Collet2fa99042016-07-01 20:55:28 +0200475 Total literal header size is 1 byte.
Yann Collet198e6aa2016-07-20 20:12:24 +0200476 `size = h[0]>>3;`
inikepf896c1d2016-08-03 16:37:42 +0200477- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
Yann Collet2fa99042016-07-01 20:55:28 +0200478 Total literal header size is 2 bytes.
Yann Collet198e6aa2016-07-20 20:12:24 +0200479 `size = (h[0]>>4) + (h[1]<<4);`
inikepf896c1d2016-08-03 16:37:42 +0200480- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
Yann Colletd916c902016-07-04 00:42:58 +0200481 Total literal header size is 3 bytes.
Yann Collet198e6aa2016-07-20 20:12:24 +0200482 `size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
Yann Collet2fa99042016-07-01 20:55:28 +0200483
484Note : it's allowed to represent a short value (ex : `13`)
485using a long format, accepting the reduced compacity.
486
inikep9d003c12016-08-04 10:41:49 +0200487__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200488
Yann Colletc2e1a682016-07-22 17:30:52 +0200489- Value : 00 : _Single stream_.
inikepf896c1d2016-08-03 16:37:42 +0200490 `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Yann Colletcd25a912016-07-05 11:50:37 +0200491 Total literal header size is 3 bytes.
Yann Colletc2e1a682016-07-22 17:30:52 +0200492- Value : 01 : 4 streams.
inikepf896c1d2016-08-03 16:37:42 +0200493 `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Yann Colletcd25a912016-07-05 11:50:37 +0200494 Total literal header size is 3 bytes.
495- Value : 10 : 4 streams.
inikepf896c1d2016-08-03 16:37:42 +0200496 `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
Yann Colletcd25a912016-07-05 11:50:37 +0200497 Total literal header size is 4 bytes.
Yann Colletd9cc4422016-07-22 19:15:27 +0200498- Value : 11 : 4 streams.
inikepf896c1d2016-08-03 16:37:42 +0200499 `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
Yann Colletcd25a912016-07-05 11:50:37 +0200500 Total literal header size is 5 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200501
inikep4f270ac2016-08-04 11:25:52 +0200502`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
inikepf896c1d2016-08-03 16:37:42 +0200503
Yann Collet698cb632016-07-03 18:49:35 +0200504
inikep4f270ac2016-08-04 11:25:52 +0200505#### `Huffman_Tree_Description`
Yann Collet698cb632016-07-03 18:49:35 +0200506
inikep4f270ac2016-08-04 11:25:52 +0200507This section is only present when literals block type is `Compressed_Block` (`2`).
Yann Collet00d44ab2016-07-04 01:29:47 +0200508
Yann Collet9ca73362016-07-05 10:53:38 +0200509Prefix coding represents symbols from an a priori known alphabet
Yann Colletb21e9cb2016-07-15 17:31:13 +0200510by bit sequences (codewords), one codeword for each symbol,
Yann Collet9ca73362016-07-05 10:53:38 +0200511in a manner such that different symbols may be represented
512by bit sequences of different lengths,
513but a parser can always parse an encoded string
514unambiguously symbol-by-symbol.
Yann Collet00d44ab2016-07-04 01:29:47 +0200515
Yann Collet9ca73362016-07-05 10:53:38 +0200516Given an alphabet with known symbol frequencies,
517the Huffman algorithm allows the construction of an optimal prefix code
518using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet00d44ab2016-07-04 01:29:47 +0200519
Yann Collet9ca73362016-07-05 10:53:38 +0200520Prefix code must not exceed a maximum code length.
Yann Collet00d44ab2016-07-04 01:29:47 +0200521More bits improve accuracy but cost more header size,
Yann Collete557fd52016-07-17 16:21:37 +0200522and require more memory or more complex decoding operations.
523This specification limits maximum code length to 11 bits.
Yann Collet00d44ab2016-07-04 01:29:47 +0200524
Yann Collet698cb632016-07-03 18:49:35 +0200525
526##### Representation
527
Yann Collet00d44ab2016-07-04 01:29:47 +0200528All literal values from zero (included) to last present one (excluded)
Yann Collet698cb632016-07-03 18:49:35 +0200529are represented by `weight` values, from 0 to `maxBits`.
530Transformation from `weight` to `nbBits` follows this formulae :
531`nbBits = weight ? maxBits + 1 - weight : 0;` .
532The last symbol's weight is deduced from previously decoded ones,
533by completing to the nearest power of 2.
534This power of 2 gives `maxBits`, the depth of the current tree.
535
536__Example__ :
inikep586a0552016-08-03 16:16:38 +0200537Let's presume the following Huffman tree must be described :
Yann Collet698cb632016-07-03 18:49:35 +0200538
Yann Colletd916c902016-07-04 00:42:58 +0200539| literal | 0 | 1 | 2 | 3 | 4 | 5 |
540| ------- | --- | --- | --- | --- | --- | --- |
541| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200542
543The tree depth is 4, since its smallest element uses 4 bits.
544Value `5` will not be listed, nor will values above `5`.
545Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
546Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
547It gives the following serie of weights :
548
Yann Colletd916c902016-07-04 00:42:58 +0200549| weights | 4 | 3 | 2 | 0 | 1 |
550| ------- | --- | --- | --- | --- | --- |
551| literal | 0 | 1 | 2 | 3 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200552
553The decoder will do the inverse operation :
Yann Colletd916c902016-07-04 00:42:58 +0200554having collected weights of literals from `0` to `4`,
555it knows the last literal, `5`, is present with a non-zero weight.
Yann Colletcd25a912016-07-05 11:50:37 +0200556The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet698cb632016-07-03 18:49:35 +0200557Sum of 2^(weight-1) (excluding 0) is :
Yann Colletd916c902016-07-04 00:42:58 +0200558`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet698cb632016-07-03 18:49:35 +0200559Nearest power of 2 is 16.
560Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet698cb632016-07-03 18:49:35 +0200561
562##### Huffman Tree header
563
Yann Collet9ca73362016-07-05 10:53:38 +0200564This is a single byte value (0-255),
565which tells how to decode the list of weights.
Yann Collet698cb632016-07-03 18:49:35 +0200566
Yann Collet698cb632016-07-03 18:49:35 +0200567- if headerByte >= 128 : this is a direct representation,
568 where each weight is written directly as a 4 bits field (0-15).
Yann Colletcd25a912016-07-05 11:50:37 +0200569 The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet698cb632016-07-03 18:49:35 +0200570 meaning it uses a last full byte even if nbSymbols is odd.
Yann Collet26f68142016-07-08 10:42:59 +0200571 `nbSymbols = headerByte - 127;`.
Yann Collet38b75dd2016-07-24 15:35:59 +0200572 Note that maximum nbSymbols is 255-127 = 128.
Yann Collet26f68142016-07-08 10:42:59 +0200573 A larger serie must necessarily use FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200574
575- if headerByte < 128 :
576 the serie of weights is compressed by FSE.
Yann Collet26f68142016-07-08 10:42:59 +0200577 The length of the FSE-compressed serie is `headerByte` (0-127).
Yann Collet698cb632016-07-03 18:49:35 +0200578
inikep586a0552016-08-03 16:16:38 +0200579##### FSE (Finite State Entropy) compression of Huffman weights
Yann Collet698cb632016-07-03 18:49:35 +0200580
Yann Collet26f68142016-07-08 10:42:59 +0200581The serie of weights is compressed using FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200582It's a single bitstream with 2 interleaved states,
Yann Collet26f68142016-07-08 10:42:59 +0200583sharing a single distribution table.
Yann Collet698cb632016-07-03 18:49:35 +0200584
585To decode an FSE bitstream, it is necessary to know its compressed size.
586Compressed size is provided by `headerByte`.
Yann Collet38b75dd2016-07-24 15:35:59 +0200587It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet26f68142016-07-08 10:42:59 +0200588which is `255`, since literal values span from `0` to `255`,
Yann Colletcd25a912016-07-05 11:50:37 +0200589and last symbol value is not represented.
Yann Collet698cb632016-07-03 18:49:35 +0200590
591An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet00d44ab2016-07-04 01:29:47 +0200592It will create a Decoding Table.
Yann Collet26f68142016-07-08 10:42:59 +0200593Table must be pre-allocated, which requires to support a maximum accuracy.
inikep586a0552016-08-03 16:16:38 +0200594For a list of Huffman weights, maximum accuracy is 7 bits.
Yann Collet698cb632016-07-03 18:49:35 +0200595
Yann Collet26f68142016-07-08 10:42:59 +0200596FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
597and so is [FSE bitstream](#bitstream).
598The main difference is that Huffman header compression uses 2 states,
599which share the same FSE distribution table.
Yann Collet38b75dd2016-07-24 15:35:59 +0200600Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet26f68142016-07-08 10:42:59 +0200601The number of symbols to decode is discovered
602by tracking bitStream overflow condition.
603When both states have overflowed the bitstream, end is reached.
604
Yann Collet698cb632016-07-03 18:49:35 +0200605
inikep586a0552016-08-03 16:16:38 +0200606##### Conversion from weights to Huffman prefix codes
Yann Collet698cb632016-07-03 18:49:35 +0200607
Yann Colletd916c902016-07-04 00:42:58 +0200608All present symbols shall now have a `weight` value.
Yann Collet38b75dd2016-07-24 15:35:59 +0200609It is possible to transform weights into nbBits, using this formula :
Yann Colletd916c902016-07-04 00:42:58 +0200610`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
611
Yann Collet38b75dd2016-07-24 15:35:59 +0200612Symbols are sorted by weight. Within same weight, symbols keep natural order.
613Symbols with a weight of zero are removed.
614Then, starting from lowest weight, prefix codes are distributed in order.
Yann Colletd916c902016-07-04 00:42:58 +0200615
616__Example__ :
Yann Colletb21e9cb2016-07-15 17:31:13 +0200617Let's presume the following list of weights has been decoded :
Yann Colletd916c902016-07-04 00:42:58 +0200618
619| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
620| ------- | --- | --- | --- | --- | --- | --- |
621| weight | 4 | 3 | 2 | 0 | 1 | 1 |
622
623Sorted by weight and then natural order,
624it gives the following distribution :
625
626| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
627| ------------ | --- | --- | --- | --- | --- | ---- |
628| weight | 0 | 1 | 1 | 2 | 3 | 4 |
Yann Colletd916c902016-07-04 00:42:58 +0200629| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
Yann Colletb21e9cb2016-07-15 17:31:13 +0200630| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
Yann Colletd916c902016-07-04 00:42:58 +0200631
632
Yann Colletd916c902016-07-04 00:42:58 +0200633#### Literals bitstreams
634
635##### Bitstreams sizes
636
637As seen in a previous paragraph,
inikep586a0552016-08-03 16:16:38 +0200638there are 2 flavors of Huffman-compressed literals :
Yann Colletd916c902016-07-04 00:42:58 +0200639single stream, and 4-streams.
640
6414-streams is useful for CPU with multiple execution units and OoO operations.
642Since each stream can be decoded independently,
643it's possible to decode them up to 4x faster than a single stream,
644presuming the CPU has enough parallelism available.
645
646For single stream, header provides both the compressed and regenerated size.
647For 4-streams though,
648header only provides compressed and regenerated size of all 4 streams combined.
Yann Colletd916c902016-07-04 00:42:58 +0200649In order to properly decode the 4 streams,
650it's necessary to know the compressed and regenerated size of each stream.
651
Yann Collet38b75dd2016-07-24 15:35:59 +0200652Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
653except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Colletd916c902016-07-04 00:42:58 +0200654
Yann Collet38b75dd2016-07-24 15:35:59 +0200655Compressed size is provided explicitly : in the 4-streams variant,
inikep2fc37522016-07-25 12:47:02 +0200656bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
Yann Collet00d44ab2016-07-04 01:29:47 +0200657Each value represents the compressed size of one stream, in order.
Yann Colletd916c902016-07-04 00:42:58 +0200658The last stream size is deducted from total compressed size
Yann Collet38b75dd2016-07-24 15:35:59 +0200659and from previously decoded stream sizes :
Yann Colletd916c902016-07-04 00:42:58 +0200660`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
661
Yann Collet00d44ab2016-07-04 01:29:47 +0200662##### Bitstreams read and decode
Yann Colletd916c902016-07-04 00:42:58 +0200663
664Each bitstream must be read _backward_,
665that is starting from the end down to the beginning.
666Therefore it's necessary to know the size of each bitstream.
667
668It's also necessary to know exactly which _bit_ is the latest.
669This is detected by a final bit flag :
670the highest bit of latest byte is a final-bit-flag.
671Consequently, a last byte of `0` is not possible.
672And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet38b75dd2016-07-24 15:35:59 +0200673Hence, the last byte contains between 0 and 7 useful bits.
Yann Colletd916c902016-07-04 00:42:58 +0200674
675Starting from the end,
676it's possible to read the bitstream in a little-endian fashion,
677keeping track of already used bits.
678
Yann Collet00d44ab2016-07-04 01:29:47 +0200679Reading the last `maxBits` bits,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200680it's then possible to compare extracted value to decoding table,
Yann Colletd916c902016-07-04 00:42:58 +0200681determining the symbol to decode and number of bits to discard.
682
683The process continues up to reading the required number of symbols per stream.
684If a bitstream is not entirely and exactly consumed,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200685hence reaching exactly its beginning position with _all_ bits consumed,
Yann Colletd916c902016-07-04 00:42:58 +0200686the decoding process is considered faulty.
687
Yann Collet698cb632016-07-03 18:49:35 +0200688
inikep4f270ac2016-08-04 11:25:52 +0200689### `Sequences_Section`
Yann Collet00d44ab2016-07-04 01:29:47 +0200690
691A compressed block is a succession of _sequences_ .
692A sequence is a literal copy command, followed by a match copy command.
693A literal copy command specifies a length.
694It is the number of bytes to be copied (or extracted) from the literal section.
695A match copy command specifies an offset and a length.
696The offset gives the position to copy from,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200697which can be within a previous block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200698
Yann Colletcd25a912016-07-05 11:50:37 +0200699There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet00d44ab2016-07-04 01:29:47 +0200700which are encoded together, interleaved in a single _bitstream_.
701
Yann Colletcd25a912016-07-05 11:50:37 +0200702Each symbol is a _code_ in its own context,
703which specifies a baseline and a number of bits to add.
Yann Collet00d44ab2016-07-04 01:29:47 +0200704_Codes_ are FSE compressed,
705and interleaved with raw additional bits in the same bitstream.
706
Yann Collet23f05cc2016-07-04 16:13:11 +0200707The Sequences section starts by a header,
708followed by optional Probability tables for each symbol type,
Yann Collet00d44ab2016-07-04 01:29:47 +0200709followed by the bitstream.
710
Yann Collet6fa05a22016-07-20 14:58:49 +0200711| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream |
Yann Colletc40ba712016-07-08 15:39:02 +0200712| ------ | ---------------- | ------------- | ------------------ | --------- |
713
Yann Collet23f05cc2016-07-04 16:13:11 +0200714To decode the Sequence section, it's required to know its size.
Yann Colletcd25a912016-07-05 11:50:37 +0200715This size is deducted from `blockSize - literalSectionSize`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200716
717
Yann Collet00d44ab2016-07-04 01:29:47 +0200718#### Sequences section header
719
Yann Collet23f05cc2016-07-04 16:13:11 +0200720Consists in 2 items :
721- Nb of Sequences
722- Flags providing Symbol compression types
723
724__Nb of Sequences__
725
726This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
727Let's call its first byte `byte0`.
728- `if (byte0 == 0)` : there are no sequences.
729 The sequence section stops there.
730 Regenerated content is defined entirely by literals section.
Yann Colletcd25a912016-07-05 11:50:37 +0200731- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
732- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
733- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet23f05cc2016-07-04 16:13:11 +0200734
Yann Collet10b9c132016-07-24 01:21:53 +0200735__Symbol encoding modes__
Yann Collet23f05cc2016-07-04 16:13:11 +0200736
737This is a single byte, defining the compression mode of each symbol type.
738
739| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
740| ------- | ------ | ------ | ------ | -------- |
Yann Collet10b9c132016-07-24 01:21:53 +0200741|FieldName| LLType | OFType | MLType | Reserved |
Yann Collet23f05cc2016-07-04 16:13:11 +0200742
743The last field, `Reserved`, must be all-zeroes.
744
Yann Collet10b9c132016-07-24 01:21:53 +0200745`LLType`, `OFType` and `MLType` define the compression mode of
Yann Collet23f05cc2016-07-04 16:13:11 +0200746Literal Lengths, Offsets and Match Lengths respectively.
747
748They follow the same enumeration :
749
Yann Colletf8e7b532016-07-23 16:31:49 +0200750| Value | 0 | 1 | 2 | 3 |
751| ---------------- | ------ | --- | ---------- | ------ |
752| Compression Mode | predef | RLE | Compressed | Repeat |
Yann Collet23f05cc2016-07-04 16:13:11 +0200753
754- "predef" : uses a pre-defined distribution table.
755- "RLE" : it's a single code, repeated `nbSeqs` times.
756- "Repeat" : re-use distribution table from previous compressed block.
Yann Colletf8e7b532016-07-23 16:31:49 +0200757- "Compressed" : standard FSE compression.
Yann Colletcd25a912016-07-05 11:50:37 +0200758 A distribution table will be present.
759 It will be described in [next part](#distribution-tables).
Yann Collet23f05cc2016-07-04 16:13:11 +0200760
761#### Symbols decoding
762
763##### Literal Lengths codes
764
765Literal lengths codes are values ranging from `0` to `35` included.
766They define lengths from 0 to 131071 bytes.
767
768| Code | 0-15 |
769| ------ | ---- |
Yann Colletc40ba712016-07-08 15:39:02 +0200770| length | Code |
Yann Colletcd25a912016-07-05 11:50:37 +0200771| nbBits | 0 |
772
Yann Collet23f05cc2016-07-04 16:13:11 +0200773
774| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
775| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
776| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
777| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
778
779| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
780| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
781| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
782| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
783
784| Code | 32 | 33 | 34 | 35 |
785| -------- | ---- | ---- | ---- | ---- |
786| Baseline | 8192 |16384 |32768 |65536 |
787| nb Bits | 13 | 14 | 15 | 16 |
788
789__Default distribution__
790
Yann Colletcd25a912016-07-05 11:50:37 +0200791When "compression mode" is "predef"",
Yann Collet23f05cc2016-07-04 16:13:11 +0200792a pre-defined distribution is used for FSE compression.
793
Yann Colletc40ba712016-07-08 15:39:02 +0200794Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet23f05cc2016-07-04 16:13:11 +0200795```
796short literalLengths_defaultDistribution[36] =
797 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
798 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
799 -1,-1,-1,-1 };
800```
801
802##### Match Lengths codes
803
804Match lengths codes are values ranging from `0` to `52` included.
805They define lengths from 3 to 131074 bytes.
806
807| Code | 0-31 |
808| ------ | -------- |
Yann Collet23f05cc2016-07-04 16:13:11 +0200809| value | Code + 3 |
Yann Colletcd25a912016-07-05 11:50:37 +0200810| nbBits | 0 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200811
812| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
813| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
814| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
815| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
816
817| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
818| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
819| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
820| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
821
822| Code | 48 | 49 | 50 | 51 | 52 |
823| -------- | ---- | ---- | ---- | ---- | ---- |
824| Baseline | 4098 | 8194 |16486 |32770 |65538 |
825| nb Bits | 12 | 13 | 14 | 15 | 16 |
826
827__Default distribution__
828
Yann Colletc40ba712016-07-08 15:39:02 +0200829When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200830a pre-defined distribution is used for FSE compression.
831
832Here is its definition. It uses an accuracy of 6 bits (64 states).
833```
834short matchLengths_defaultDistribution[53] =
835 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
836 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
837 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
838 -1,-1,-1,-1,-1 };
839```
840
841##### Offset codes
842
843Offset codes are values ranging from `0` to `N`,
844with `N` being limited by maximum backreference distance.
845
Yann Colletcd25a912016-07-05 11:50:37 +0200846A decoder is free to limit its maximum `N` supported.
847Recommendation is to support at least up to `22`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200848For information, at the time of this writing.
849the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
850
851An offset code is also the nb of additional bits to read,
852and can be translated into an `OFValue` using the following formulae :
853
854```
855OFValue = (1 << offsetCode) + readNBits(offsetCode);
856if (OFValue > 3) offset = OFValue - 3;
857```
858
859OFValue from 1 to 3 are special : they define "repeat codes",
860which means one of the previous offsets will be repeated.
861They are sorted in recency order, with 1 meaning the most recent one.
Yann Colletcd25a912016-07-05 11:50:37 +0200862See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet23f05cc2016-07-04 16:13:11 +0200863
864__Default distribution__
865
Yann Colletcd25a912016-07-05 11:50:37 +0200866When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200867a pre-defined distribution is used for FSE compression.
868
869Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Colletcd25a912016-07-05 11:50:37 +0200870and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet23f05cc2016-07-04 16:13:11 +0200871
872If any sequence in the compressed block requires an offset larger than this,
873it's not possible to use the default distribution to represent it.
874
875```
876short offsetCodes_defaultDistribution[53] =
877 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
878 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
879```
880
881#### Distribution tables
882
883Following the header, up to 3 distribution tables can be described.
Yann Collet10b9c132016-07-24 01:21:53 +0200884When present, they are in this order :
Yann Collet23f05cc2016-07-04 16:13:11 +0200885- Literal lengthes
886- Offsets
887- Match Lengthes
888
Yann Collet10b9c132016-07-24 01:21:53 +0200889The content to decode depends on their respective encoding mode :
Yann Collet23f05cc2016-07-04 16:13:11 +0200890- Predef : no content. Use pre-defined distribution table.
891- RLE : 1 byte. This is the only code to use across the whole compressed block.
892- FSE : A distribution table is present.
Yann Collet10b9c132016-07-24 01:21:53 +0200893- Repeat mode : no content. Re-use distribution from previous compressed block.
Yann Collet23f05cc2016-07-04 16:13:11 +0200894
895##### FSE distribution table : condensed format
896
897An FSE distribution table describes the probabilities of all symbols
898from `0` to the last present one (included)
Yann Collet9ca73362016-07-05 10:53:38 +0200899on a normalized scale of `1 << AccuracyLog` .
Yann Collet23f05cc2016-07-04 16:13:11 +0200900
901It's a bitstream which is read forward, in little-endian fashion.
902It's not necessary to know its exact size,
903since it will be discovered and reported by the decoding process.
904
905The bitstream starts by reporting on which scale it operates.
906`AccuracyLog = low4bits + 5;`
Yann Collet9d6e9492016-07-22 19:32:07 +0200907Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
908and for offsets it is `8`. Higher values are considered errors.
Yann Collet23f05cc2016-07-04 16:13:11 +0200909
910Then follow each symbol value, from `0` to last present one.
911The nb of bits used by each field is variable.
912It depends on :
913
914- Remaining probabilities + 1 :
915 __example__ :
916 Presuming an AccuracyLog of 8,
917 and presuming 100 probabilities points have already been distributed,
Yann Colletcd25a912016-07-05 11:50:37 +0200918 the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet23f05cc2016-07-04 16:13:11 +0200919 Therefore, it must read `log2sup(156) == 8` bits.
920
921- Value decoded : small values use 1 less bit :
922 __example__ :
923 Presuming values from 0 to 156 (included) are possible,
924 255-156 = 99 values are remaining in an 8-bits field.
925 They are used this way :
926 first 99 values (hence from 0 to 98) use only 7 bits,
927 values from 99 to 156 use 8 bits.
928 This is achieved through this scheme :
929
930 | Value read | Value decoded | nb Bits used |
931 | ---------- | ------------- | ------------ |
932 | 0 - 98 | 0 - 98 | 7 |
933 | 99 - 127 | 99 - 127 | 8 |
934 | 128 - 226 | 0 - 98 | 7 |
935 | 227 - 255 | 128 - 156 | 8 |
936
937Symbols probabilities are read one by one, in order.
938
939Probability is obtained from Value decoded by following formulae :
940`Proba = value - 1;`
941
942It means value `0` becomes negative probability `-1`.
943`-1` is a special probability, which means `less than 1`.
Yann Colletc40ba712016-07-08 15:39:02 +0200944Its effect on distribution table is described in [next paragraph].
Yann Collet23f05cc2016-07-04 16:13:11 +0200945For the purpose of calculating cumulated distribution, it counts as one.
946
Yann Colletc40ba712016-07-08 15:39:02 +0200947[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
948
Yann Collet23f05cc2016-07-04 16:13:11 +0200949When a symbol has a probability of `zero`,
950it is followed by a 2-bits repeat flag.
951This repeat flag tells how many probabilities of zeroes follow the current one.
952It provides a number ranging from 0 to 3.
953If it is a 3, another 2-bits repeat flag follows, and so on.
954
Yann Collet9ca73362016-07-05 10:53:38 +0200955When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200956decoding is complete.
Yann Collet9ca73362016-07-05 10:53:38 +0200957If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200958distribution is considered corrupted.
959
Yann Collet10b9c132016-07-24 01:21:53 +0200960Then the decoder can tell how many bytes were used in this process,
961and how many symbols are present.
962The bitstream consumes a round number of bytes.
963Any remaining bit within the last byte is just unused.
964
Yann Collet23f05cc2016-07-04 16:13:11 +0200965##### FSE decoding : from normalized distribution to decoding tables
966
Yann Collet9ca73362016-07-05 10:53:38 +0200967The distribution of normalized probabilities is enough
968to create a unique decoding table.
969
970It follows the following build rule :
971
972The table has a size of `tableSize = 1 << AccuracyLog;`.
973Each cell describes the symbol decoded,
974and instructions to get the next state.
975
976Symbols are scanned in their natural order for `less than 1` probabilities.
977Symbols with this probability are being attributed a single cell,
978starting from the end of the table.
979These symbols define a full state reset, reading `AccuracyLog` bits.
980
981All remaining symbols are sorted in their natural order.
982Starting from symbol `0` and table position `0`,
983each symbol gets attributed as many cells as its probability.
984Cell allocation is spreaded, not linear :
985each successor position follow this rule :
986
Yann Colletcd25a912016-07-05 11:50:37 +0200987```
988position += (tableSize>>1) + (tableSize>>3) + 3;
989position &= tableSize-1;
990```
Yann Collet9ca73362016-07-05 10:53:38 +0200991
992A position is skipped if already occupied,
993typically by a "less than 1" probability symbol.
994
995The result is a list of state values.
996Each state will decode the current symbol.
997
998To get the Number of bits and baseline required for next state,
999it's first necessary to sort all states in their natural order.
Yann Colletcd25a912016-07-05 11:50:37 +02001000The lower states will need 1 more bit than higher ones.
Yann Collet9ca73362016-07-05 10:53:38 +02001001
1002__Example__ :
1003Presuming a symbol has a probability of 5.
Yann Colletcd25a912016-07-05 11:50:37 +02001004It receives 5 state values. States are sorted in natural order.
Yann Collet9ca73362016-07-05 10:53:38 +02001005
1006Next power of 2 is 8.
1007Space of probabilities is divided into 8 equal parts.
1008Presuming the AccuracyLog is 7, it defines 128 states.
1009Divided by 8, each share is 16 large.
1010
1011In order to reach 8, 8-5=3 lowest states will count "double",
1012taking shares twice larger,
1013requiring one more bit in the process.
1014
1015Numbering starts from higher states using less bits.
1016
1017| state order | 0 | 1 | 2 | 3 | 4 |
1018| ----------- | ----- | ----- | ------ | ---- | ----- |
1019| width | 32 | 32 | 32 | 16 | 16 |
1020| nb Bits | 5 | 5 | 5 | 4 | 4 |
1021| range nb | 2 | 4 | 6 | 0 | 1 |
1022| baseline | 32 | 64 | 96 | 0 | 16 |
1023| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
1024
1025Next state is determined from current state
1026by reading the required number of bits, and adding the specified baseline.
Yann Collet23f05cc2016-07-04 16:13:11 +02001027
1028
1029#### Bitstream
Yann Collet00d44ab2016-07-04 01:29:47 +02001030
Yann Collet9ca73362016-07-05 10:53:38 +02001031All sequences are stored in a single bitstream, read _backward_.
1032It is therefore necessary to know the bitstream size,
1033which is deducted from compressed block size.
1034
Yann Colletc40ba712016-07-08 15:39:02 +02001035The last useful bit of the stream is followed by an end-bit-flag.
Yann Colletcd25a912016-07-05 11:50:37 +02001036Highest bit of last byte is this flag.
Yann Collet9ca73362016-07-05 10:53:38 +02001037It does not belong to the useful part of the bitstream.
1038Therefore, last byte has 0-7 useful bits.
1039Note that it also means that last byte cannot be `0`.
1040
1041##### Starting states
1042
1043The bitstream starts with initial state values,
1044each using the required number of bits in their respective _accuracy_,
1045decoded previously from their normalized distribution.
1046
1047It starts by `Literal Length State`,
1048followed by `Offset State`,
1049and finally `Match Length State`.
1050
1051Reminder : always keep in mind that all values are read _backward_.
1052
1053##### Decoding a sequence
1054
1055A state gives a code.
1056A code provides a baseline and number of bits to add.
1057See [Symbol Decoding] section for details on each symbol.
1058
1059Decoding starts by reading the nb of bits required to decode offset.
1060It then does the same for match length,
1061and then for literal length.
1062
Yann Colletc40ba712016-07-08 15:39:02 +02001063Offset / matchLength / litLength define a sequence.
1064It starts by inserting the number of literals defined by `litLength`,
1065then continue by copying `matchLength` bytes from `currentPos - offset`.
Yann Collet9ca73362016-07-05 10:53:38 +02001066
1067The next operation is to update states.
1068Using rules pre-calculated in the decoding tables,
1069`Literal Length State` is updated,
1070followed by `Match Length State`,
1071and then `Offset State`.
1072
1073This operation will be repeated `NbSeqs` times.
1074At the end, the bitstream shall be entirely consumed,
1075otherwise bitstream is considered corrupted.
1076
1077[Symbol Decoding]:#symbols-decoding
1078
1079##### Repeat offsets
1080
1081As seen in [Offset Codes], the first 3 values define a repeated offset.
1082They are sorted in recency order, with 1 meaning "most recent one".
1083
1084There is an exception though, when current sequence's literal length is `0`.
Yann Collet917fe182016-07-31 04:01:57 +02001085In which case, repcodes are "pushed by one",
1086so 1 becomes 2, 2 becomes 3,
1087and 3 becomes "offset_1 - 1_byte".
Yann Collet9ca73362016-07-05 10:53:38 +02001088
Yann Collet917fe182016-07-31 04:01:57 +02001089On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
Yann Collet9ca73362016-07-05 10:53:38 +02001090
1091Then each block receives its start value from previous compressed block.
1092Note that non-compressed blocks are skipped,
1093they do not contribute to offset history.
1094
1095[Offset Codes]: #offset-codes
1096
1097###### Offset updates rules
1098
Yann Collet917fe182016-07-31 04:01:57 +02001099New offset take the lead in offset history,
1100up to its previous place if it was already present.
Yann Collet9ca73362016-07-05 10:53:38 +02001101
Yann Collet917fe182016-07-31 04:01:57 +02001102It means that when repeat offset 1 (most recent) is used, history is unmodified.
1103When repeat offset 2 is used, it's swapped with offset 1.
Yann Collet00d44ab2016-07-04 01:29:47 +02001104
Yann Collet2fa99042016-07-01 20:55:28 +02001105
Yann Colletbd106072016-07-08 19:16:57 +02001106Dictionary format
1107-----------------
1108
1109`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
1110But dictionaries created by `zstd --train` follow a format, described here.
1111
1112__Pre-requisites__ : a dictionary has a known length,
1113 defined either by a buffer limit, or a file size.
1114
1115| Header | DictID | Stats | Content |
1116| ------ | ------ | ----- | ------- |
1117
inikep2fc37522016-07-25 12:47:02 +02001118__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
Yann Colletbd106072016-07-08 19:16:57 +02001119
inikep2fc37522016-07-25 12:47:02 +02001120__Dict_ID__ : 4 bytes, stored in Little-Endian format.
Yann Colletbd106072016-07-08 19:16:57 +02001121 DictID can be any value, except 0 (which means no DictID).
Yann Collet722e14b2016-07-08 19:22:16 +02001122 It's used by decoders to check if they use the correct dictionary.
Yann Colletf6ff53c2016-07-15 17:03:38 +02001123 _Reserved ranges :_
1124 If the frame is going to be distributed in a private environment,
1125 any dictionary ID can be used.
1126 However, for public distribution of compressed frames,
1127 some ranges are reserved for future use :
Yann Collet6cacd342016-07-15 17:58:13 +02001128
1129 - low range : 1 - 32767 : reserved
1130 - high range : >= (2^31) : reserved
Yann Colletbd106072016-07-08 19:16:57 +02001131
1132__Stats__ : Entropy tables, following the same format as a [compressed blocks].
1133 They are stored in following order :
1134 Huffman tables for literals, FSE table for offset,
Yann Collet722e14b2016-07-08 19:22:16 +02001135 FSE table for matchLenth, and FSE table for litLength.
1136 It's finally followed by 3 offset values, populating recent offsets,
inikep2fc37522016-07-25 12:47:02 +02001137 stored in order, 4-bytes little-endian each, for a total of 12 bytes.
Yann Colletbd106072016-07-08 19:16:57 +02001138
1139__Content__ : Where the actual dictionary content is.
Yann Collet722e14b2016-07-08 19:22:16 +02001140 Content size depends on Dictionary size.
Yann Colletbd106072016-07-08 19:16:57 +02001141
inikepf9c3cce2016-07-25 11:04:56 +02001142[compressed blocks]: #the-format-of-compressed_block
Yann Colletbd106072016-07-08 19:16:57 +02001143
Yann Collet2fa99042016-07-01 20:55:28 +02001144
1145Version changes
1146---------------
Yann Collet6fa05a22016-07-20 14:58:49 +02001147- 0.2.0 : numerous format adjustments for zstd v0.8
inikep586a0552016-08-03 16:16:38 +02001148- 0.1.2 : limit Huffman tree depth to 11 bits
Yann Collete557fd52016-07-17 16:21:37 +02001149- 0.1.1 : reserved dictID ranges
1150- 0.1.0 : initial release