blob: 13276d09a2ba008643719b25b0d8b2caa27a5368 [file] [log] [blame] [view]
Yann Collet5cc18822016-07-03 19:03:13 +02001Zstandard Compression Format
2============================
Yann Collet2fa99042016-07-01 20:55:28 +02003
4### Notices
5
6Copyright (c) 2016 Yann Collet
7
8Permission is granted to copy and distribute this document
inikepf9c3cce2016-07-25 11:04:56 +02009for any purpose and without charge,
10including translations into other languages
Yann Collet2fa99042016-07-01 20:55:28 +020011and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
Yann Collet6fa05a22016-07-20 14:58:49 +0200190.2.0 (22/07/16)
Yann Collet2fa99042016-07-01 20:55:28 +020020
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
Yann Collet9ca73362016-07-05 10:53:38 +020028file compression, pipe and streaming compression,
Yann Collet2fa99042016-07-01 20:55:28 +020029using the [Zstandard algorithm](http://www.zstandard.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the Zstandard compression method,
36and optional [xxHash-64 checksum method](http://www.xxhash.org),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into Zstandard format and/or decompress data from Zstandard format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore informative fields, such as checksum.
56Whenever it does not support a parameter defined in the compressed stream,
57it must produce a non-ambiguous error code and associated error message
58explaining which parameter is unsupported.
59
60
inikepf9c3cce2016-07-25 11:04:56 +020061Overall conventions
62-----------
63In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
64
65
Yann Collet2fa99042016-07-01 20:55:28 +020066Definitions
67-----------
68A content compressed by Zstandard is transformed into a Zstandard __frame__.
69Multiple frames can be appended into a single file or stream.
70A frame is totally independent, has a defined beginning and end,
71and a set of parameters which tells the decoder how to decompress it.
72
73A frame encapsulates one or multiple __blocks__.
74Each block can be compressed or not,
75and has a guaranteed maximum content size, which depends on frame parameters.
76Unlike frames, each block depends on previous blocks for proper decoding.
77However, each block can be decompressed without waiting for its successor,
78allowing streaming operations.
79
80
inikepf9c3cce2016-07-25 11:04:56 +020081Frame Concatenation
82-------------------
Yann Collet2fa99042016-07-01 20:55:28 +020083
84In some circumstances, it may be required to append multiple frames,
85for example in order to add new data to an existing compressed file
86without re-framing it.
87
88In such case, each frame brings its own set of descriptor flags.
89Each frame is considered independent.
90The only relation between frames is their sequential order.
91
92The ability to decode multiple concatenated frames
93within a single stream or file is left outside of this specification.
94As an example, the reference `zstd` command line utility is able
95to decode all concatenated frames in their sequential order,
96delivering the final decompressed result as if it was a single content.
97
98
inikepf9c3cce2016-07-25 11:04:56 +020099General Structure of Zstandard Frame format
100-------------------------------------------
101The structure of a single Zstandard frame is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200102
inikepf9c3cce2016-07-25 11:04:56 +0200103| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] |`End_Marker`|
104|:--------------:|:--------------:|:----------:| ------------------ |:----------:|
105| 4 bytes | 2-14 bytes | n bytes | | 3 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200106
inikepf9c3cce2016-07-25 11:04:56 +0200107__`Magic_Number`__
108
1094 Bytes, Little endian format.
110Value : 0xFD2FB527
111
112__`Frame_Header`__
113
1142 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
115
116__`Data_Block`__
117
118Detailed in [next chapter](#the-structure-of-data_block).
119That’s where compressed data is stored.
120
121__`End_Marker`__
122
123The flow of blocks ends when the last block header brings an _end signal_.
124This last block header may optionally host a `Content_Checksum`.
125
126##### __`Content_Checksum`__
127
128`Content_Checksum` allow to verify that frame content has been regenerated correctly.
129The content checksum is the result
130of [xxh64() hash function](https://www.xxHash.com)
131digesting the original (decoded) data as input, and a seed of zero.
132Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
133stored within `End_Marker`.
134```
135mask22bits = (1<<22)-1;
136contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
137```
138`Content_Checksum` is only present when its associated flag
139is set in the frame descriptor.
140Its usage is optional.
141
142
143
144The structure of `Frame_Header`
145-------------------------------
146The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
Yann Collet2fa99042016-07-01 20:55:28 +0200147and up to 14 bytes depending on optional parameters.
inikepf9c3cce2016-07-25 11:04:56 +0200148The structure of `Frame_Header` is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200149
inikepf9c3cce2016-07-25 11:04:56 +0200150| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
151| ------------------------- | --------------------- | ----------------- | ---------------------- |
152| 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200153
inikepf9c3cce2016-07-25 11:04:56 +0200154### `Frame_Header_Descriptor`
155
156The first header's byte is called the `Frame_Header_Descriptor`.
Yann Collet2fa99042016-07-01 20:55:28 +0200157It tells which other fields are present.
inikepf9c3cce2016-07-25 11:04:56 +0200158Decoding this byte is enough to tell the size of `Frame_Header`.
Yann Collet2fa99042016-07-01 20:55:28 +0200159
inikepf9c3cce2016-07-25 11:04:56 +0200160| Bit number | Field name |
161| ---------- | ---------- |
162| 7-6 | `Frame_Content_Size_flag` |
163| 5 | `Single_Segment_flag` |
164| 4 | `Unused_bit` |
165| 3 | `Reserved_bit` |
166| 2 | `Content_Checksum_flag` |
167| 1-0 | `Dictionary_ID_flag` |
Yann Collet2fa99042016-07-01 20:55:28 +0200168
169In this table, bit 7 is highest bit, while bit 0 is lowest.
170
inikepf9c3cce2016-07-25 11:04:56 +0200171__`Single_Segment_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200172
inikepf9c3cce2016-07-25 11:04:56 +0200173If `Single_Segment_flag` is not set then `Window_Descriptor` is mandatory and `Frame_Content_Size_flag` will be ignored.
Yann Collet2fa99042016-07-01 20:55:28 +0200174
inikepf9c3cce2016-07-25 11:04:56 +0200175If `Single_Segment_flag` is set then `Window_Descriptor` should be absent and `Frame_Content_Size_flag` will be used along with a mandatory `Frame_Content_Size` field.
176As a consequence, the decoder must allocate a single continuous memory segment of size equal or bigger than `Frame_Content_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200177
178In order to preserve the decoder from unreasonable memory requirement,
Yann Collet23f05cc2016-07-04 16:13:11 +0200179a decoder can reject a compressed frame
Yann Collet2fa99042016-07-01 20:55:28 +0200180which requests a memory size beyond decoder's authorized range.
181
182For broader compatibility, decoders are recommended to support
Yann Collet23f05cc2016-07-04 16:13:11 +0200183memory sizes of at least 8 MB.
184This is just a recommendation,
Yann Collet9ca73362016-07-05 10:53:38 +0200185each decoder is free to support higher or lower limits,
Yann Collet2fa99042016-07-01 20:55:28 +0200186depending on local limitations.
187
inikepf9c3cce2016-07-25 11:04:56 +0200188__`Frame_Content_Size_flag`__
189
190This is a 2-bits flag (`= FHD >> 6`) used only if `Single_Segment_flag` is set.
191In this case Value can be converted to Field size that is number of bytes used by `Frame_Content_Size` according to the following table:
192
193| Value | 0 | 1 | 2 | 3 |
194|----------| --- | --- | --- | --- |
195|Field size| 1 | 2 | 4 | 8 |
196
197__`Unused_bit`__
Yann Collet2fa99042016-07-01 20:55:28 +0200198
Yann Colletf0bc6732016-07-13 17:30:21 +0200199The value of this bit should be set to zero.
200A decoder compliant with this specification version should not interpret it.
201It might be used in a future version,
202to signal a property which is not mandatory to properly decode the frame.
Yann Collet2fa99042016-07-01 20:55:28 +0200203
inikepf9c3cce2016-07-25 11:04:56 +0200204__`Reserved_bit`__
Yann Collet2fa99042016-07-01 20:55:28 +0200205
206This bit is reserved for some future feature.
207Its value _must be zero_.
208A decoder compliant with this specification version must ensure it is not set.
209This bit may be used in a future revision,
210to signal a feature that must be interpreted in order to decode the frame.
211
inikepf9c3cce2016-07-25 11:04:56 +0200212__`Content_Checksum_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200213
inikepf9c3cce2016-07-25 11:04:56 +0200214If this flag is set, a content checksum will be present within `End_Marker`.
Yann Collet9ca73362016-07-05 10:53:38 +0200215The checksum is a 22 bits value extracted from the XXH64() of data,
inikepf9c3cce2016-07-25 11:04:56 +0200216and stored within `End_Marker`. See [`Content_Checksum`](#content_checksum) .
Yann Collet2fa99042016-07-01 20:55:28 +0200217
inikepf9c3cce2016-07-25 11:04:56 +0200218__`Dictionary_ID_flag`__
Yann Collet2fa99042016-07-01 20:55:28 +0200219
220This is a 2-bits flag (`= FHD & 3`),
Yann Collet9ca73362016-07-05 10:53:38 +0200221telling if a dictionary ID is provided within the header.
222It also specifies the size of this field.
Yann Collet2fa99042016-07-01 20:55:28 +0200223
inikepf9c3cce2016-07-25 11:04:56 +0200224| Value | 0 | 1 | 2 | 3 |
225| -------- | --- | --- | --- | --- |
226|Field size| 0 | 1 | 2 | 4 |
Yann Collet2fa99042016-07-01 20:55:28 +0200227
inikepf9c3cce2016-07-25 11:04:56 +0200228### `Window_Descriptor`
Yann Collet2fa99042016-07-01 20:55:28 +0200229
230Provides guarantees on maximum back-reference distance
231that will be present within compressed data.
232This information is useful for decoders to allocate enough memory.
233
inikepf9c3cce2016-07-25 11:04:56 +0200234The `Window_Descriptor` byte is optional. It should be absent if `Single_Segment_flag` is set.
235In this case, the maximum back-reference distance is the content size itself,
Yann Colletcd25a912016-07-05 11:50:37 +0200236which can be any value from 1 to 2^64-1 bytes (16 EB).
237
inikepf9c3cce2016-07-25 11:04:56 +0200238| Bit numbers | 7-3 | 0-2 |
239| ----------- | -------- | -------- |
240| Field name | Exponent | Mantissa |
Yann Collet2fa99042016-07-01 20:55:28 +0200241
242Maximum distance is given by the following formulae :
243```
244windowLog = 10 + Exponent;
245windowBase = 1 << windowLog;
246windowAdd = (windowBase / 8) * Mantissa;
247windowSize = windowBase + windowAdd;
248```
249The minimum window size is 1 KB.
Yann Colletcd25a912016-07-05 11:50:37 +0200250The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet2fa99042016-07-01 20:55:28 +0200251
252To properly decode compressed data,
253a decoder will need to allocate a buffer of at least `windowSize` bytes.
254
Yann Collet2fa99042016-07-01 20:55:28 +0200255In order to preserve decoder from unreasonable memory requirements,
256a decoder can refuse a compressed frame
257which requests a memory size beyond decoder's authorized range.
258
Yann Colletcd25a912016-07-05 11:50:37 +0200259For improved interoperability,
Yann Collet2fa99042016-07-01 20:55:28 +0200260decoders are recommended to be compatible with window sizes of 8 MB.
261Encoders are recommended to not request more than 8 MB.
262It's merely a recommendation though,
263decoders are free to support larger or lower limits,
264depending on local limitations.
265
inikepf9c3cce2016-07-25 11:04:56 +0200266### `Dictionary_ID`
Yann Collet23f05cc2016-07-04 16:13:11 +0200267
Yann Colletf6ff53c2016-07-15 17:03:38 +0200268This is a variable size field, which contains
269the ID of the dictionary required to properly decode the frame.
270Note that this field is optional. When it's not present,
Yann Collet23f05cc2016-07-04 16:13:11 +0200271it's up to the caller to make sure it uses the correct dictionary.
272
inikepf9c3cce2016-07-25 11:04:56 +0200273Field size depends on `Dictionary_ID_flag`.
Yann Collet23f05cc2016-07-04 16:13:11 +02002741 byte can represent an ID 0-255.
2752 bytes can represent an ID 0-65535.
Yann Colletcd25a912016-07-05 11:50:37 +02002764 bytes can represent an ID 0-4294967295.
Yann Collet23f05cc2016-07-04 16:13:11 +0200277
278It's allowed to represent a small ID (for example `13`)
Yann Colletcd25a912016-07-05 11:50:37 +0200279with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet23f05cc2016-07-04 16:13:11 +0200280
Yann Colletf6ff53c2016-07-15 17:03:38 +0200281_Reserved ranges :_
282If the frame is going to be distributed in a private environment,
283any dictionary ID can be used.
284However, for public distribution of compressed frames using a dictionary,
inikepf9c3cce2016-07-25 11:04:56 +0200285the following ranges are reserved for future use and should not be used :
286- low range : 1 - 32767
287- high range : >= (2^31)
Yann Colletf6ff53c2016-07-15 17:03:38 +0200288
289
inikepf9c3cce2016-07-25 11:04:56 +0200290### `Frame_Content_Size`
Yann Collet2fa99042016-07-01 20:55:28 +0200291
292This is the original (uncompressed) size.
inikepf9c3cce2016-07-25 11:04:56 +0200293This information is optional, and only present if `Single_Segment_flag` is set.
294Content size is provided using 1, 2, 4 or 8 bytes according to `Frame_Content_Size_flag`.
Yann Collet2fa99042016-07-01 20:55:28 +0200295Format is Little endian.
296
297| Field Size | Range |
298| ---------- | ---------- |
Yann Collet2fa99042016-07-01 20:55:28 +0200299| 1 | 0 - 255 |
300| 2 | 256 - 65791|
301| 4 | 0 - 2^32-1 |
302| 8 | 0 - 2^64-1 |
303
304When field size is 1, 4 or 8 bytes, the value is read directly.
inikepf9c3cce2016-07-25 11:04:56 +0200305When field size is 2, _the offset of 256 is added_.
306It's allowed to represent a small size (for example `18`) using any compatible variant.
Yann Collet2fa99042016-07-01 20:55:28 +0200307
308In order to preserve decoder from unreasonable memory requirement,
309a decoder can refuse a compressed frame
310which requests a memory size beyond decoder's authorized range.
311
Yann Collet2fa99042016-07-01 20:55:28 +0200312
inikepf9c3cce2016-07-25 11:04:56 +0200313The structure of `Data_Block`
314-----------------------------
315The structure of `Data_Block` is following:
Yann Collet2fa99042016-07-01 20:55:28 +0200316
inikepf9c3cce2016-07-25 11:04:56 +0200317| `Block_Type` | `Block_Size` | `Block_Content` |
318|:------------:|:------------:|:---------------:|
319| 2 bits | 22 bits | n bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200320
inikepf9c3cce2016-07-25 11:04:56 +0200321__`Block_Type` and `Block_Size`__
Yann Collet2fa99042016-07-01 20:55:28 +0200322
inikepf9c3cce2016-07-25 11:04:56 +0200323The block header uses 3-bytes, format is __little-endian__.
324The 2 highest bits represent the `Block_Type`,
325while the remaining 22 bits represent the (compressed) `Block_Size`.
Yann Collet2fa99042016-07-01 20:55:28 +0200326
327There are 4 block types :
328
inikepf9c3cce2016-07-25 11:04:56 +0200329| Value | 0 | 1 | 2 | 3 |
330| ------------ | ----------- | ----------- | ------------------ | --------- |
331| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `EndMark` |
Yann Collet2fa99042016-07-01 20:55:28 +0200332
inikepf9c3cce2016-07-25 11:04:56 +0200333- `Raw_Block` - this is an uncompressed block.
334 `Block_Size` is the number of bytes to read and copy.
335- `RLE_Block` - this is a single byte, repeated N times.
336 In which case, `Block_Size` is the size to regenerate,
337 while the "compressed" block is just 1 byte (the byte to repeat).
338- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
Yann Collet9ca73362016-07-05 10:53:38 +0200339 detailed in another section of this specification.
inikepf9c3cce2016-07-25 11:04:56 +0200340 `Block_Size` is the compressed size.
Yann Collet2fa99042016-07-01 20:55:28 +0200341 Decompressed size is unknown,
342 but its maximum possible value is guaranteed (see below)
inikepf9c3cce2016-07-25 11:04:56 +0200343- `EndMark` - this is not a block. It signals the end of the frame.
Yann Collet2fa99042016-07-01 20:55:28 +0200344 The rest of the field may be optionally filled by a checksum
inikepf9c3cce2016-07-25 11:04:56 +0200345 (see [`Content_Checksum`](#content_checksum)).
Yann Collet2fa99042016-07-01 20:55:28 +0200346
347Block sizes must respect a few rules :
Yann Collet9ca73362016-07-05 10:53:38 +0200348- In compressed mode, compressed size if always strictly `< decompressed size`.
349- Block decompressed size is always <= maximum back-reference distance .
350- Block decompressed size is always <= 128 KB
Yann Collet2fa99042016-07-01 20:55:28 +0200351
352
inikepf9c3cce2016-07-25 11:04:56 +0200353__`Block_Content`__
Yann Collet2fa99042016-07-01 20:55:28 +0200354
inikepf9c3cce2016-07-25 11:04:56 +0200355The `Block_Content` is where the actual data to decode stands.
Yann Collet2fa99042016-07-01 20:55:28 +0200356It might be compressed or not, depending on previous field indications.
357A data block is not necessarily "full" :
358since an arbitrary “flush” may happen anytime,
Yann Colletcd25a912016-07-05 11:50:37 +0200359block decompressed content can be any size,
inikepf9c3cce2016-07-25 11:04:56 +0200360up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
Yann Colletcd25a912016-07-05 11:50:37 +0200361- Maximum back-reference distance
Yann Collet2fa99042016-07-01 20:55:28 +0200362- 128 KB
363
364
365Skippable Frames
366----------------
367
inikepf9c3cce2016-07-25 11:04:56 +0200368| `Magic_Number` | `Frame_Size` | `User_Data` |
369|:--------------:|:------------:|:-----------:|
370| 4 bytes | 4 bytes | n bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200371
372Skippable frames allow the insertion of user-defined data
373into a flow of concatenated frames.
374Its design is pretty straightforward,
375with the sole objective to allow the decoder to quickly skip
376over user-defined data and continue decoding.
377
Yann Colletcd25a912016-07-05 11:50:37 +0200378Skippable frames defined in this specification are compatible with [LZ4] ones.
379
380[LZ4]:http://www.lz4.org
Yann Collet2fa99042016-07-01 20:55:28 +0200381
inikepf9c3cce2016-07-25 11:04:56 +0200382__`Magic_Number`__
Yann Collet2fa99042016-07-01 20:55:28 +0200383
3844 Bytes, Little endian format.
385Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
386All 16 values are valid to identify a skippable frame.
387
inikepf9c3cce2016-07-25 11:04:56 +0200388__`Frame_Size`__
Yann Collet2fa99042016-07-01 20:55:28 +0200389
inikepf9c3cce2016-07-25 11:04:56 +0200390This is the size, in bytes, of the following `User_Data`
Yann Collet2fa99042016-07-01 20:55:28 +0200391(without including the magic number nor the size field itself).
inikepf9c3cce2016-07-25 11:04:56 +0200392This field is represented using 4 Bytes, Little endian format, unsigned 32-bits.
393This means `User_Data` can’t be bigger than (2^32-1) bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200394
inikepf9c3cce2016-07-25 11:04:56 +0200395__`User_Data`__
Yann Collet2fa99042016-07-01 20:55:28 +0200396
inikepf9c3cce2016-07-25 11:04:56 +0200397The `User_Data` can be anything. Data will just be skipped by the decoder.
Yann Collet2fa99042016-07-01 20:55:28 +0200398
399
inikepf9c3cce2016-07-25 11:04:56 +0200400The format of `Compressed_Block`
401--------------------------------
402The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
403The `Compressed_Block` has a guaranteed maximum regenerated size,
Yann Collet2fa99042016-07-01 20:55:28 +0200404in order to properly allocate destination buffer.
inikepf9c3cce2016-07-25 11:04:56 +0200405See [`Data_Block`](#the-structure-of-data_block) for more details.
Yann Collet2fa99042016-07-01 20:55:28 +0200406
407A compressed block consists of 2 sections :
Yann Colletcd25a912016-07-05 11:50:37 +0200408- [Literals section](#literals-section)
409- [Sequences section](#sequences-section)
Yann Collet2fa99042016-07-01 20:55:28 +0200410
Yann Collet23f05cc2016-07-04 16:13:11 +0200411### Prerequisites
412To decode a compressed block, the following elements are necessary :
Yann Collet698cb632016-07-03 18:49:35 +0200413- Previous decoded blocks, up to a distance of `windowSize`,
inikepf9c3cce2016-07-25 11:04:56 +0200414 or all previous blocks when `Single_Segment_flag` is set.
Yann Collet698cb632016-07-03 18:49:35 +0200415- List of "recent offsets" from previous compressed block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200416- Decoding tables of previous compressed block for each symbol type
Yann Collet9ca73362016-07-05 10:53:38 +0200417 (literals, litLength, matchLength, offset).
Yann Collet698cb632016-07-03 18:49:35 +0200418
419
Yann Collet00d44ab2016-07-04 01:29:47 +0200420### Literals section
Yann Collet2fa99042016-07-01 20:55:28 +0200421
Yann Collet2fa99042016-07-01 20:55:28 +0200422During sequence phase, literals will be entangled with match copy operations.
423All literals are regrouped in the first part of the block.
424They can be decoded first, and then copied during sequence operations,
Yann Collet00d44ab2016-07-04 01:29:47 +0200425or they can be decoded on the flow, as needed by sequence commands.
Yann Collet2fa99042016-07-01 20:55:28 +0200426
inikepf9c3cce2016-07-25 11:04:56 +0200427| Literals section header | [Huffman Tree Description] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
428| ----------------------- | -------------------------- | ------- | --------- | --------- | --------- |
Yann Collet2fa99042016-07-01 20:55:28 +0200429
inikepf9c3cce2016-07-25 11:04:56 +0200430Literals can be stored uncompressed or compressed using Huffman prefix codes.
Yann Collet2fa99042016-07-01 20:55:28 +0200431When compressed, an optional tree description can be present,
432followed by 1 or 4 streams.
433
inikepf9c3cce2016-07-25 11:04:56 +0200434
Yann Collet00d44ab2016-07-04 01:29:47 +0200435#### Literals section header
Yann Collet2fa99042016-07-01 20:55:28 +0200436
Yann Collet00d44ab2016-07-04 01:29:47 +0200437Header is in charge of describing how literals are packed.
Yann Collet2fa99042016-07-01 20:55:28 +0200438It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
Yann Collet198e6aa2016-07-20 20:12:24 +0200439using little-endian convention.
Yann Collet2fa99042016-07-01 20:55:28 +0200440
inikepf9c3cce2016-07-25 11:04:56 +0200441| Literals Block Type | sizes format | regenerated size | [compressed size] |
442| ------------------- | ------------ | ---------------- | ----------------- |
443| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
Yann Collet198e6aa2016-07-20 20:12:24 +0200444
445In this representation, bits on the left are smallest bits.
Yann Collet2fa99042016-07-01 20:55:28 +0200446
inikepf9c3cce2016-07-25 11:04:56 +0200447__Literals Block Type__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200448
Yann Collet198e6aa2016-07-20 20:12:24 +0200449This field uses 2 lowest bits of first byte, describing 4 different block types :
Yann Collet2fa99042016-07-01 20:55:28 +0200450
inikepf9c3cce2016-07-25 11:04:56 +0200451| Value | 0 | 1 | 2 | 3 |
452| ------------------- | --- | --- | ---------- | ----------- |
453| Literals Block Type | Raw | RLE | Compressed | RepeatStats |
Yann Collet2fa99042016-07-01 20:55:28 +0200454
inikepf9c3cce2016-07-25 11:04:56 +0200455- Raw literals block - Literals are stored uncompressed.
456- RLE literals block - Literals consist of a single byte value repeated N times.
457- Compressed literals block - This is a standard huffman-compressed block,
Yann Colletcd25a912016-07-05 11:50:37 +0200458 starting with a huffman tree description.
459 See details below.
inikepf9c3cce2016-07-25 11:04:56 +0200460- Repeat Stats literals block - This is a huffman-compressed block,
Yann Colletcd25a912016-07-05 11:50:37 +0200461 using huffman tree _from previous huffman-compressed literals block_.
462 Huffman tree description will be skipped.
Yann Collet2fa99042016-07-01 20:55:28 +0200463
464__Sizes format__ :
465
466Sizes format are divided into 2 families :
467
468- For compressed block, it requires to decode both the compressed size
469 and the decompressed size. It will also decode the number of streams.
470- For Raw or RLE blocks, it's enough to decode the size to regenerate.
471
Yann Collet198e6aa2016-07-20 20:12:24 +0200472For values spanning several bytes, convention is Little-endian.
Yann Collet2fa99042016-07-01 20:55:28 +0200473
Yann Collet198e6aa2016-07-20 20:12:24 +0200474__Sizes format for Raw and RLE literals block__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200475
Yann Collet198e6aa2016-07-20 20:12:24 +0200476- Value : x0 : Regenerated size uses 5 bits (0-31).
Yann Collet2fa99042016-07-01 20:55:28 +0200477 Total literal header size is 1 byte.
Yann Collet198e6aa2016-07-20 20:12:24 +0200478 `size = h[0]>>3;`
479- Value : 01 : Regenerated size uses 12 bits (0-4095).
Yann Collet2fa99042016-07-01 20:55:28 +0200480 Total literal header size is 2 bytes.
Yann Collet198e6aa2016-07-20 20:12:24 +0200481 `size = (h[0]>>4) + (h[1]<<4);`
Yann Collet2fa99042016-07-01 20:55:28 +0200482- Value : 11 : Regenerated size uses 20 bits (0-1048575).
Yann Colletd916c902016-07-04 00:42:58 +0200483 Total literal header size is 3 bytes.
Yann Collet198e6aa2016-07-20 20:12:24 +0200484 `size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
Yann Collet2fa99042016-07-01 20:55:28 +0200485
486Note : it's allowed to represent a short value (ex : `13`)
487using a long format, accepting the reduced compacity.
488
inikepf9c3cce2016-07-25 11:04:56 +0200489__Sizes format for Compressed literals block and Repeat Stats literals block__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200490
Yann Colletc2e1a682016-07-22 17:30:52 +0200491- Value : 00 : _Single stream_.
Yann Colletcd25a912016-07-05 11:50:37 +0200492 Compressed and regenerated sizes use 10 bits (0-1023).
493 Total literal header size is 3 bytes.
Yann Colletc2e1a682016-07-22 17:30:52 +0200494- Value : 01 : 4 streams.
Yann Colletcd25a912016-07-05 11:50:37 +0200495 Compressed and regenerated sizes use 10 bits (0-1023).
496 Total literal header size is 3 bytes.
497- Value : 10 : 4 streams.
498 Compressed and regenerated sizes use 14 bits (0-16383).
499 Total literal header size is 4 bytes.
Yann Colletd9cc4422016-07-22 19:15:27 +0200500- Value : 11 : 4 streams.
Yann Colletcd25a912016-07-05 11:50:37 +0200501 Compressed and regenerated sizes use 18 bits (0-262143).
502 Total literal header size is 5 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200503
inikepf9c3cce2016-07-25 11:04:56 +0200504Compressed and regenerated size fields follow little-endian convention.
Yann Collet698cb632016-07-03 18:49:35 +0200505
506#### Huffman Tree description
507
Yann Colletcd25a912016-07-05 11:50:37 +0200508This section is only present when literals block type is `Compressed` (`0`).
Yann Collet00d44ab2016-07-04 01:29:47 +0200509
Yann Collet9ca73362016-07-05 10:53:38 +0200510Prefix coding represents symbols from an a priori known alphabet
Yann Colletb21e9cb2016-07-15 17:31:13 +0200511by bit sequences (codewords), one codeword for each symbol,
Yann Collet9ca73362016-07-05 10:53:38 +0200512in a manner such that different symbols may be represented
513by bit sequences of different lengths,
514but a parser can always parse an encoded string
515unambiguously symbol-by-symbol.
Yann Collet00d44ab2016-07-04 01:29:47 +0200516
Yann Collet9ca73362016-07-05 10:53:38 +0200517Given an alphabet with known symbol frequencies,
518the Huffman algorithm allows the construction of an optimal prefix code
519using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet00d44ab2016-07-04 01:29:47 +0200520
Yann Collet9ca73362016-07-05 10:53:38 +0200521Prefix code must not exceed a maximum code length.
Yann Collet00d44ab2016-07-04 01:29:47 +0200522More bits improve accuracy but cost more header size,
Yann Collete557fd52016-07-17 16:21:37 +0200523and require more memory or more complex decoding operations.
524This specification limits maximum code length to 11 bits.
Yann Collet00d44ab2016-07-04 01:29:47 +0200525
Yann Collet698cb632016-07-03 18:49:35 +0200526
527##### Representation
528
Yann Collet00d44ab2016-07-04 01:29:47 +0200529All literal values from zero (included) to last present one (excluded)
Yann Collet698cb632016-07-03 18:49:35 +0200530are represented by `weight` values, from 0 to `maxBits`.
531Transformation from `weight` to `nbBits` follows this formulae :
532`nbBits = weight ? maxBits + 1 - weight : 0;` .
533The last symbol's weight is deduced from previously decoded ones,
534by completing to the nearest power of 2.
535This power of 2 gives `maxBits`, the depth of the current tree.
536
537__Example__ :
538Let's presume the following huffman tree must be described :
539
Yann Colletd916c902016-07-04 00:42:58 +0200540| literal | 0 | 1 | 2 | 3 | 4 | 5 |
541| ------- | --- | --- | --- | --- | --- | --- |
542| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200543
544The tree depth is 4, since its smallest element uses 4 bits.
545Value `5` will not be listed, nor will values above `5`.
546Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
547Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
548It gives the following serie of weights :
549
Yann Colletd916c902016-07-04 00:42:58 +0200550| weights | 4 | 3 | 2 | 0 | 1 |
551| ------- | --- | --- | --- | --- | --- |
552| literal | 0 | 1 | 2 | 3 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200553
554The decoder will do the inverse operation :
Yann Colletd916c902016-07-04 00:42:58 +0200555having collected weights of literals from `0` to `4`,
556it knows the last literal, `5`, is present with a non-zero weight.
Yann Colletcd25a912016-07-05 11:50:37 +0200557The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet698cb632016-07-03 18:49:35 +0200558Sum of 2^(weight-1) (excluding 0) is :
Yann Colletd916c902016-07-04 00:42:58 +0200559`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet698cb632016-07-03 18:49:35 +0200560Nearest power of 2 is 16.
561Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet698cb632016-07-03 18:49:35 +0200562
563##### Huffman Tree header
564
Yann Collet9ca73362016-07-05 10:53:38 +0200565This is a single byte value (0-255),
566which tells how to decode the list of weights.
Yann Collet698cb632016-07-03 18:49:35 +0200567
Yann Collet698cb632016-07-03 18:49:35 +0200568- if headerByte >= 128 : this is a direct representation,
569 where each weight is written directly as a 4 bits field (0-15).
Yann Colletcd25a912016-07-05 11:50:37 +0200570 The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet698cb632016-07-03 18:49:35 +0200571 meaning it uses a last full byte even if nbSymbols is odd.
Yann Collet26f68142016-07-08 10:42:59 +0200572 `nbSymbols = headerByte - 127;`.
Yann Collet38b75dd2016-07-24 15:35:59 +0200573 Note that maximum nbSymbols is 255-127 = 128.
Yann Collet26f68142016-07-08 10:42:59 +0200574 A larger serie must necessarily use FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200575
576- if headerByte < 128 :
577 the serie of weights is compressed by FSE.
Yann Collet26f68142016-07-08 10:42:59 +0200578 The length of the FSE-compressed serie is `headerByte` (0-127).
Yann Collet698cb632016-07-03 18:49:35 +0200579
580##### FSE (Finite State Entropy) compression of huffman weights
581
Yann Collet26f68142016-07-08 10:42:59 +0200582The serie of weights is compressed using FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200583It's a single bitstream with 2 interleaved states,
Yann Collet26f68142016-07-08 10:42:59 +0200584sharing a single distribution table.
Yann Collet698cb632016-07-03 18:49:35 +0200585
586To decode an FSE bitstream, it is necessary to know its compressed size.
587Compressed size is provided by `headerByte`.
Yann Collet38b75dd2016-07-24 15:35:59 +0200588It's also necessary to know its _maximum possible_ decompressed size,
Yann Collet26f68142016-07-08 10:42:59 +0200589which is `255`, since literal values span from `0` to `255`,
Yann Colletcd25a912016-07-05 11:50:37 +0200590and last symbol value is not represented.
Yann Collet698cb632016-07-03 18:49:35 +0200591
592An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet00d44ab2016-07-04 01:29:47 +0200593It will create a Decoding Table.
Yann Collet26f68142016-07-08 10:42:59 +0200594Table must be pre-allocated, which requires to support a maximum accuracy.
Yann Collet38b75dd2016-07-24 15:35:59 +0200595For a list of huffman weights, maximum accuracy is 7 bits.
Yann Collet698cb632016-07-03 18:49:35 +0200596
Yann Collet26f68142016-07-08 10:42:59 +0200597FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
598and so is [FSE bitstream](#bitstream).
599The main difference is that Huffman header compression uses 2 states,
600which share the same FSE distribution table.
Yann Collet38b75dd2016-07-24 15:35:59 +0200601Bitstream contains only FSE symbols (no interleaved "raw bitfields").
Yann Collet26f68142016-07-08 10:42:59 +0200602The number of symbols to decode is discovered
603by tracking bitStream overflow condition.
604When both states have overflowed the bitstream, end is reached.
605
Yann Collet698cb632016-07-03 18:49:35 +0200606
607##### Conversion from weights to huffman prefix codes
608
Yann Colletd916c902016-07-04 00:42:58 +0200609All present symbols shall now have a `weight` value.
Yann Collet38b75dd2016-07-24 15:35:59 +0200610It is possible to transform weights into nbBits, using this formula :
Yann Colletd916c902016-07-04 00:42:58 +0200611`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
612
Yann Collet38b75dd2016-07-24 15:35:59 +0200613Symbols are sorted by weight. Within same weight, symbols keep natural order.
614Symbols with a weight of zero are removed.
615Then, starting from lowest weight, prefix codes are distributed in order.
Yann Colletd916c902016-07-04 00:42:58 +0200616
617__Example__ :
Yann Colletb21e9cb2016-07-15 17:31:13 +0200618Let's presume the following list of weights has been decoded :
Yann Colletd916c902016-07-04 00:42:58 +0200619
620| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
621| ------- | --- | --- | --- | --- | --- | --- |
622| weight | 4 | 3 | 2 | 0 | 1 | 1 |
623
624Sorted by weight and then natural order,
625it gives the following distribution :
626
627| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
628| ------------ | --- | --- | --- | --- | --- | ---- |
629| weight | 0 | 1 | 1 | 2 | 3 | 4 |
Yann Colletd916c902016-07-04 00:42:58 +0200630| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
Yann Colletb21e9cb2016-07-15 17:31:13 +0200631| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
Yann Colletd916c902016-07-04 00:42:58 +0200632
633
Yann Colletd916c902016-07-04 00:42:58 +0200634#### Literals bitstreams
635
636##### Bitstreams sizes
637
638As seen in a previous paragraph,
639there are 2 flavors of huffman-compressed literals :
640single stream, and 4-streams.
641
6424-streams is useful for CPU with multiple execution units and OoO operations.
643Since each stream can be decoded independently,
644it's possible to decode them up to 4x faster than a single stream,
645presuming the CPU has enough parallelism available.
646
647For single stream, header provides both the compressed and regenerated size.
648For 4-streams though,
649header only provides compressed and regenerated size of all 4 streams combined.
Yann Colletd916c902016-07-04 00:42:58 +0200650In order to properly decode the 4 streams,
651it's necessary to know the compressed and regenerated size of each stream.
652
Yann Collet38b75dd2016-07-24 15:35:59 +0200653Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
654except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Yann Colletd916c902016-07-04 00:42:58 +0200655
Yann Collet38b75dd2016-07-24 15:35:59 +0200656Compressed size is provided explicitly : in the 4-streams variant,
Yann Collet00d44ab2016-07-04 01:29:47 +0200657bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
658Each value represents the compressed size of one stream, in order.
Yann Colletd916c902016-07-04 00:42:58 +0200659The last stream size is deducted from total compressed size
Yann Collet38b75dd2016-07-24 15:35:59 +0200660and from previously decoded stream sizes :
Yann Colletd916c902016-07-04 00:42:58 +0200661`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
662
Yann Collet00d44ab2016-07-04 01:29:47 +0200663##### Bitstreams read and decode
Yann Colletd916c902016-07-04 00:42:58 +0200664
665Each bitstream must be read _backward_,
666that is starting from the end down to the beginning.
667Therefore it's necessary to know the size of each bitstream.
668
669It's also necessary to know exactly which _bit_ is the latest.
670This is detected by a final bit flag :
671the highest bit of latest byte is a final-bit-flag.
672Consequently, a last byte of `0` is not possible.
673And the final-bit-flag itself is not part of the useful bitstream.
Yann Collet38b75dd2016-07-24 15:35:59 +0200674Hence, the last byte contains between 0 and 7 useful bits.
Yann Colletd916c902016-07-04 00:42:58 +0200675
676Starting from the end,
677it's possible to read the bitstream in a little-endian fashion,
678keeping track of already used bits.
679
Yann Collet00d44ab2016-07-04 01:29:47 +0200680Reading the last `maxBits` bits,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200681it's then possible to compare extracted value to decoding table,
Yann Colletd916c902016-07-04 00:42:58 +0200682determining the symbol to decode and number of bits to discard.
683
684The process continues up to reading the required number of symbols per stream.
685If a bitstream is not entirely and exactly consumed,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200686hence reaching exactly its beginning position with _all_ bits consumed,
Yann Colletd916c902016-07-04 00:42:58 +0200687the decoding process is considered faulty.
688
Yann Collet698cb632016-07-03 18:49:35 +0200689
Yann Collet00d44ab2016-07-04 01:29:47 +0200690### Sequences section
691
692A compressed block is a succession of _sequences_ .
693A sequence is a literal copy command, followed by a match copy command.
694A literal copy command specifies a length.
695It is the number of bytes to be copied (or extracted) from the literal section.
696A match copy command specifies an offset and a length.
697The offset gives the position to copy from,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200698which can be within a previous block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200699
Yann Colletcd25a912016-07-05 11:50:37 +0200700There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet00d44ab2016-07-04 01:29:47 +0200701which are encoded together, interleaved in a single _bitstream_.
702
Yann Colletcd25a912016-07-05 11:50:37 +0200703Each symbol is a _code_ in its own context,
704which specifies a baseline and a number of bits to add.
Yann Collet00d44ab2016-07-04 01:29:47 +0200705_Codes_ are FSE compressed,
706and interleaved with raw additional bits in the same bitstream.
707
Yann Collet23f05cc2016-07-04 16:13:11 +0200708The Sequences section starts by a header,
709followed by optional Probability tables for each symbol type,
Yann Collet00d44ab2016-07-04 01:29:47 +0200710followed by the bitstream.
711
Yann Collet6fa05a22016-07-20 14:58:49 +0200712| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream |
Yann Colletc40ba712016-07-08 15:39:02 +0200713| ------ | ---------------- | ------------- | ------------------ | --------- |
714
Yann Collet23f05cc2016-07-04 16:13:11 +0200715To decode the Sequence section, it's required to know its size.
Yann Colletcd25a912016-07-05 11:50:37 +0200716This size is deducted from `blockSize - literalSectionSize`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200717
718
Yann Collet00d44ab2016-07-04 01:29:47 +0200719#### Sequences section header
720
Yann Collet23f05cc2016-07-04 16:13:11 +0200721Consists in 2 items :
722- Nb of Sequences
723- Flags providing Symbol compression types
724
725__Nb of Sequences__
726
727This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
728Let's call its first byte `byte0`.
729- `if (byte0 == 0)` : there are no sequences.
730 The sequence section stops there.
731 Regenerated content is defined entirely by literals section.
Yann Colletcd25a912016-07-05 11:50:37 +0200732- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
733- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
734- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet23f05cc2016-07-04 16:13:11 +0200735
Yann Collet10b9c132016-07-24 01:21:53 +0200736__Symbol encoding modes__
Yann Collet23f05cc2016-07-04 16:13:11 +0200737
738This is a single byte, defining the compression mode of each symbol type.
739
740| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
741| ------- | ------ | ------ | ------ | -------- |
Yann Collet10b9c132016-07-24 01:21:53 +0200742|FieldName| LLType | OFType | MLType | Reserved |
Yann Collet23f05cc2016-07-04 16:13:11 +0200743
744The last field, `Reserved`, must be all-zeroes.
745
Yann Collet10b9c132016-07-24 01:21:53 +0200746`LLType`, `OFType` and `MLType` define the compression mode of
Yann Collet23f05cc2016-07-04 16:13:11 +0200747Literal Lengths, Offsets and Match Lengths respectively.
748
749They follow the same enumeration :
750
Yann Colletf8e7b532016-07-23 16:31:49 +0200751| Value | 0 | 1 | 2 | 3 |
752| ---------------- | ------ | --- | ---------- | ------ |
753| Compression Mode | predef | RLE | Compressed | Repeat |
Yann Collet23f05cc2016-07-04 16:13:11 +0200754
755- "predef" : uses a pre-defined distribution table.
756- "RLE" : it's a single code, repeated `nbSeqs` times.
757- "Repeat" : re-use distribution table from previous compressed block.
Yann Colletf8e7b532016-07-23 16:31:49 +0200758- "Compressed" : standard FSE compression.
Yann Colletcd25a912016-07-05 11:50:37 +0200759 A distribution table will be present.
760 It will be described in [next part](#distribution-tables).
Yann Collet23f05cc2016-07-04 16:13:11 +0200761
762#### Symbols decoding
763
764##### Literal Lengths codes
765
766Literal lengths codes are values ranging from `0` to `35` included.
767They define lengths from 0 to 131071 bytes.
768
769| Code | 0-15 |
770| ------ | ---- |
Yann Colletc40ba712016-07-08 15:39:02 +0200771| length | Code |
Yann Colletcd25a912016-07-05 11:50:37 +0200772| nbBits | 0 |
773
Yann Collet23f05cc2016-07-04 16:13:11 +0200774
775| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
776| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
777| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
778| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
779
780| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
781| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
782| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
783| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
784
785| Code | 32 | 33 | 34 | 35 |
786| -------- | ---- | ---- | ---- | ---- |
787| Baseline | 8192 |16384 |32768 |65536 |
788| nb Bits | 13 | 14 | 15 | 16 |
789
790__Default distribution__
791
Yann Colletcd25a912016-07-05 11:50:37 +0200792When "compression mode" is "predef"",
Yann Collet23f05cc2016-07-04 16:13:11 +0200793a pre-defined distribution is used for FSE compression.
794
Yann Colletc40ba712016-07-08 15:39:02 +0200795Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet23f05cc2016-07-04 16:13:11 +0200796```
797short literalLengths_defaultDistribution[36] =
798 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
799 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
800 -1,-1,-1,-1 };
801```
802
803##### Match Lengths codes
804
805Match lengths codes are values ranging from `0` to `52` included.
806They define lengths from 3 to 131074 bytes.
807
808| Code | 0-31 |
809| ------ | -------- |
Yann Collet23f05cc2016-07-04 16:13:11 +0200810| value | Code + 3 |
Yann Colletcd25a912016-07-05 11:50:37 +0200811| nbBits | 0 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200812
813| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
814| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
815| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
816| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
817
818| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
819| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
820| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
821| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
822
823| Code | 48 | 49 | 50 | 51 | 52 |
824| -------- | ---- | ---- | ---- | ---- | ---- |
825| Baseline | 4098 | 8194 |16486 |32770 |65538 |
826| nb Bits | 12 | 13 | 14 | 15 | 16 |
827
828__Default distribution__
829
Yann Colletc40ba712016-07-08 15:39:02 +0200830When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200831a pre-defined distribution is used for FSE compression.
832
833Here is its definition. It uses an accuracy of 6 bits (64 states).
834```
835short matchLengths_defaultDistribution[53] =
836 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
837 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
838 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
839 -1,-1,-1,-1,-1 };
840```
841
842##### Offset codes
843
844Offset codes are values ranging from `0` to `N`,
845with `N` being limited by maximum backreference distance.
846
Yann Colletcd25a912016-07-05 11:50:37 +0200847A decoder is free to limit its maximum `N` supported.
848Recommendation is to support at least up to `22`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200849For information, at the time of this writing.
850the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
851
852An offset code is also the nb of additional bits to read,
853and can be translated into an `OFValue` using the following formulae :
854
855```
856OFValue = (1 << offsetCode) + readNBits(offsetCode);
857if (OFValue > 3) offset = OFValue - 3;
858```
859
860OFValue from 1 to 3 are special : they define "repeat codes",
861which means one of the previous offsets will be repeated.
862They are sorted in recency order, with 1 meaning the most recent one.
Yann Colletcd25a912016-07-05 11:50:37 +0200863See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet23f05cc2016-07-04 16:13:11 +0200864
865__Default distribution__
866
Yann Colletcd25a912016-07-05 11:50:37 +0200867When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200868a pre-defined distribution is used for FSE compression.
869
870Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Colletcd25a912016-07-05 11:50:37 +0200871and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet23f05cc2016-07-04 16:13:11 +0200872
873If any sequence in the compressed block requires an offset larger than this,
874it's not possible to use the default distribution to represent it.
875
876```
877short offsetCodes_defaultDistribution[53] =
878 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
879 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
880```
881
882#### Distribution tables
883
884Following the header, up to 3 distribution tables can be described.
Yann Collet10b9c132016-07-24 01:21:53 +0200885When present, they are in this order :
Yann Collet23f05cc2016-07-04 16:13:11 +0200886- Literal lengthes
887- Offsets
888- Match Lengthes
889
Yann Collet10b9c132016-07-24 01:21:53 +0200890The content to decode depends on their respective encoding mode :
Yann Collet23f05cc2016-07-04 16:13:11 +0200891- Predef : no content. Use pre-defined distribution table.
892- RLE : 1 byte. This is the only code to use across the whole compressed block.
893- FSE : A distribution table is present.
Yann Collet10b9c132016-07-24 01:21:53 +0200894- Repeat mode : no content. Re-use distribution from previous compressed block.
Yann Collet23f05cc2016-07-04 16:13:11 +0200895
896##### FSE distribution table : condensed format
897
898An FSE distribution table describes the probabilities of all symbols
899from `0` to the last present one (included)
Yann Collet9ca73362016-07-05 10:53:38 +0200900on a normalized scale of `1 << AccuracyLog` .
Yann Collet23f05cc2016-07-04 16:13:11 +0200901
902It's a bitstream which is read forward, in little-endian fashion.
903It's not necessary to know its exact size,
904since it will be discovered and reported by the decoding process.
905
906The bitstream starts by reporting on which scale it operates.
907`AccuracyLog = low4bits + 5;`
Yann Collet9d6e9492016-07-22 19:32:07 +0200908Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
909and for offsets it is `8`. Higher values are considered errors.
Yann Collet23f05cc2016-07-04 16:13:11 +0200910
911Then follow each symbol value, from `0` to last present one.
912The nb of bits used by each field is variable.
913It depends on :
914
915- Remaining probabilities + 1 :
916 __example__ :
917 Presuming an AccuracyLog of 8,
918 and presuming 100 probabilities points have already been distributed,
Yann Colletcd25a912016-07-05 11:50:37 +0200919 the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet23f05cc2016-07-04 16:13:11 +0200920 Therefore, it must read `log2sup(156) == 8` bits.
921
922- Value decoded : small values use 1 less bit :
923 __example__ :
924 Presuming values from 0 to 156 (included) are possible,
925 255-156 = 99 values are remaining in an 8-bits field.
926 They are used this way :
927 first 99 values (hence from 0 to 98) use only 7 bits,
928 values from 99 to 156 use 8 bits.
929 This is achieved through this scheme :
930
931 | Value read | Value decoded | nb Bits used |
932 | ---------- | ------------- | ------------ |
933 | 0 - 98 | 0 - 98 | 7 |
934 | 99 - 127 | 99 - 127 | 8 |
935 | 128 - 226 | 0 - 98 | 7 |
936 | 227 - 255 | 128 - 156 | 8 |
937
938Symbols probabilities are read one by one, in order.
939
940Probability is obtained from Value decoded by following formulae :
941`Proba = value - 1;`
942
943It means value `0` becomes negative probability `-1`.
944`-1` is a special probability, which means `less than 1`.
Yann Colletc40ba712016-07-08 15:39:02 +0200945Its effect on distribution table is described in [next paragraph].
Yann Collet23f05cc2016-07-04 16:13:11 +0200946For the purpose of calculating cumulated distribution, it counts as one.
947
Yann Colletc40ba712016-07-08 15:39:02 +0200948[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
949
Yann Collet23f05cc2016-07-04 16:13:11 +0200950When a symbol has a probability of `zero`,
951it is followed by a 2-bits repeat flag.
952This repeat flag tells how many probabilities of zeroes follow the current one.
953It provides a number ranging from 0 to 3.
954If it is a 3, another 2-bits repeat flag follows, and so on.
955
Yann Collet9ca73362016-07-05 10:53:38 +0200956When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200957decoding is complete.
Yann Collet9ca73362016-07-05 10:53:38 +0200958If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200959distribution is considered corrupted.
960
Yann Collet10b9c132016-07-24 01:21:53 +0200961Then the decoder can tell how many bytes were used in this process,
962and how many symbols are present.
963The bitstream consumes a round number of bytes.
964Any remaining bit within the last byte is just unused.
965
Yann Collet23f05cc2016-07-04 16:13:11 +0200966##### FSE decoding : from normalized distribution to decoding tables
967
Yann Collet9ca73362016-07-05 10:53:38 +0200968The distribution of normalized probabilities is enough
969to create a unique decoding table.
970
971It follows the following build rule :
972
973The table has a size of `tableSize = 1 << AccuracyLog;`.
974Each cell describes the symbol decoded,
975and instructions to get the next state.
976
977Symbols are scanned in their natural order for `less than 1` probabilities.
978Symbols with this probability are being attributed a single cell,
979starting from the end of the table.
980These symbols define a full state reset, reading `AccuracyLog` bits.
981
982All remaining symbols are sorted in their natural order.
983Starting from symbol `0` and table position `0`,
984each symbol gets attributed as many cells as its probability.
985Cell allocation is spreaded, not linear :
986each successor position follow this rule :
987
Yann Colletcd25a912016-07-05 11:50:37 +0200988```
989position += (tableSize>>1) + (tableSize>>3) + 3;
990position &= tableSize-1;
991```
Yann Collet9ca73362016-07-05 10:53:38 +0200992
993A position is skipped if already occupied,
994typically by a "less than 1" probability symbol.
995
996The result is a list of state values.
997Each state will decode the current symbol.
998
999To get the Number of bits and baseline required for next state,
1000it's first necessary to sort all states in their natural order.
Yann Colletcd25a912016-07-05 11:50:37 +02001001The lower states will need 1 more bit than higher ones.
Yann Collet9ca73362016-07-05 10:53:38 +02001002
1003__Example__ :
1004Presuming a symbol has a probability of 5.
Yann Colletcd25a912016-07-05 11:50:37 +02001005It receives 5 state values. States are sorted in natural order.
Yann Collet9ca73362016-07-05 10:53:38 +02001006
1007Next power of 2 is 8.
1008Space of probabilities is divided into 8 equal parts.
1009Presuming the AccuracyLog is 7, it defines 128 states.
1010Divided by 8, each share is 16 large.
1011
1012In order to reach 8, 8-5=3 lowest states will count "double",
1013taking shares twice larger,
1014requiring one more bit in the process.
1015
1016Numbering starts from higher states using less bits.
1017
1018| state order | 0 | 1 | 2 | 3 | 4 |
1019| ----------- | ----- | ----- | ------ | ---- | ----- |
1020| width | 32 | 32 | 32 | 16 | 16 |
1021| nb Bits | 5 | 5 | 5 | 4 | 4 |
1022| range nb | 2 | 4 | 6 | 0 | 1 |
1023| baseline | 32 | 64 | 96 | 0 | 16 |
1024| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
1025
1026Next state is determined from current state
1027by reading the required number of bits, and adding the specified baseline.
Yann Collet23f05cc2016-07-04 16:13:11 +02001028
1029
1030#### Bitstream
Yann Collet00d44ab2016-07-04 01:29:47 +02001031
Yann Collet9ca73362016-07-05 10:53:38 +02001032All sequences are stored in a single bitstream, read _backward_.
1033It is therefore necessary to know the bitstream size,
1034which is deducted from compressed block size.
1035
Yann Colletc40ba712016-07-08 15:39:02 +02001036The last useful bit of the stream is followed by an end-bit-flag.
Yann Colletcd25a912016-07-05 11:50:37 +02001037Highest bit of last byte is this flag.
Yann Collet9ca73362016-07-05 10:53:38 +02001038It does not belong to the useful part of the bitstream.
1039Therefore, last byte has 0-7 useful bits.
1040Note that it also means that last byte cannot be `0`.
1041
1042##### Starting states
1043
1044The bitstream starts with initial state values,
1045each using the required number of bits in their respective _accuracy_,
1046decoded previously from their normalized distribution.
1047
1048It starts by `Literal Length State`,
1049followed by `Offset State`,
1050and finally `Match Length State`.
1051
1052Reminder : always keep in mind that all values are read _backward_.
1053
1054##### Decoding a sequence
1055
1056A state gives a code.
1057A code provides a baseline and number of bits to add.
1058See [Symbol Decoding] section for details on each symbol.
1059
1060Decoding starts by reading the nb of bits required to decode offset.
1061It then does the same for match length,
1062and then for literal length.
1063
Yann Colletc40ba712016-07-08 15:39:02 +02001064Offset / matchLength / litLength define a sequence.
1065It starts by inserting the number of literals defined by `litLength`,
1066then continue by copying `matchLength` bytes from `currentPos - offset`.
Yann Collet9ca73362016-07-05 10:53:38 +02001067
1068The next operation is to update states.
1069Using rules pre-calculated in the decoding tables,
1070`Literal Length State` is updated,
1071followed by `Match Length State`,
1072and then `Offset State`.
1073
1074This operation will be repeated `NbSeqs` times.
1075At the end, the bitstream shall be entirely consumed,
1076otherwise bitstream is considered corrupted.
1077
1078[Symbol Decoding]:#symbols-decoding
1079
1080##### Repeat offsets
1081
1082As seen in [Offset Codes], the first 3 values define a repeated offset.
1083They are sorted in recency order, with 1 meaning "most recent one".
1084
1085There is an exception though, when current sequence's literal length is `0`.
1086In which case, 1 would just make previous match longer.
1087Therefore, in such case, 1 means in fact 2, and 2 is impossible.
1088Meaning of 3 is unmodified.
1089
1090Repeat offsets start with the following values : 1, 4 and 8 (in order).
1091
1092Then each block receives its start value from previous compressed block.
1093Note that non-compressed blocks are skipped,
1094they do not contribute to offset history.
1095
1096[Offset Codes]: #offset-codes
1097
1098###### Offset updates rules
1099
1100When the new offset is a normal one,
1101offset history is simply translated by one position,
1102with the new offset taking first spot.
1103
1104- When repeat offset 1 (most recent) is used, history is unmodified.
1105- When repeat offset 2 is used, it's swapped with offset 1.
1106- When repeat offset 3 is used, it takes first spot,
1107 pushing the other ones by one position.
Yann Collet00d44ab2016-07-04 01:29:47 +02001108
Yann Collet2fa99042016-07-01 20:55:28 +02001109
Yann Colletbd106072016-07-08 19:16:57 +02001110Dictionary format
1111-----------------
1112
1113`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
1114But dictionaries created by `zstd --train` follow a format, described here.
1115
1116__Pre-requisites__ : a dictionary has a known length,
1117 defined either by a buffer limit, or a file size.
1118
1119| Header | DictID | Stats | Content |
1120| ------ | ------ | ----- | ------- |
1121
1122__Header__ : 4 bytes ID, value 0xEC30A437, Little Endian format
1123
1124__Dict_ID__ : 4 bytes, stored in Little Endian format.
1125 DictID can be any value, except 0 (which means no DictID).
Yann Collet722e14b2016-07-08 19:22:16 +02001126 It's used by decoders to check if they use the correct dictionary.
Yann Colletf6ff53c2016-07-15 17:03:38 +02001127 _Reserved ranges :_
1128 If the frame is going to be distributed in a private environment,
1129 any dictionary ID can be used.
1130 However, for public distribution of compressed frames,
1131 some ranges are reserved for future use :
Yann Collet6cacd342016-07-15 17:58:13 +02001132
1133 - low range : 1 - 32767 : reserved
1134 - high range : >= (2^31) : reserved
Yann Colletbd106072016-07-08 19:16:57 +02001135
1136__Stats__ : Entropy tables, following the same format as a [compressed blocks].
1137 They are stored in following order :
1138 Huffman tables for literals, FSE table for offset,
Yann Collet722e14b2016-07-08 19:22:16 +02001139 FSE table for matchLenth, and FSE table for litLength.
1140 It's finally followed by 3 offset values, populating recent offsets,
1141 stored in order, 4-bytes little endian each, for a total of 12 bytes.
Yann Colletbd106072016-07-08 19:16:57 +02001142
1143__Content__ : Where the actual dictionary content is.
Yann Collet722e14b2016-07-08 19:22:16 +02001144 Content size depends on Dictionary size.
Yann Colletbd106072016-07-08 19:16:57 +02001145
inikepf9c3cce2016-07-25 11:04:56 +02001146[compressed blocks]: #the-format-of-compressed_block
Yann Colletbd106072016-07-08 19:16:57 +02001147
Yann Collet2fa99042016-07-01 20:55:28 +02001148
1149Version changes
1150---------------
Yann Collet6fa05a22016-07-20 14:58:49 +02001151- 0.2.0 : numerous format adjustments for zstd v0.8
Yann Collete557fd52016-07-17 16:21:37 +02001152- 0.1.2 : limit huffman tree depth to 11 bits
1153- 0.1.1 : reserved dictID ranges
1154- 0.1.0 : initial release