| Yann Collet | 5cc1882 | 2016-07-03 19:03:13 +0200 | [diff] [blame] | 1 | Zstandard Compression Format |
| 2 | ============================ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 3 | |
| 4 | ### Notices |
| 5 | |
| W. Felix Handte | 5d693cc | 2022-12-20 12:49:47 -0500 | [diff] [blame] | 6 | Copyright (c) Meta Platforms, Inc. and affiliates. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 7 | |
| 8 | Permission is granted to copy and distribute this document |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 9 | for any purpose and without charge, |
| 10 | including translations into other languages |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 11 | and incorporation into compilations, |
| 12 | provided that the copyright notice and this notice are preserved, |
| 13 | and that any substantive changes or deletions from the original |
| 14 | are clearly marked. |
| 15 | Distribution of this document is unlimited. |
| 16 | |
| 17 | ### Version |
| 18 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 19 | 0.4.3 (2024-10-07) |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 20 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 21 | |
| 22 | Introduction |
| 23 | ------------ |
| 24 | |
| 25 | The purpose of this document is to define a lossless compressed data format, |
| 26 | that is independent of CPU type, operating system, |
| 27 | file system and character set, suitable for |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 28 | file compression, pipe and streaming compression, |
| Danielle Rozenblit | 4dffc35 | 2022-12-14 06:58:35 -0800 | [diff] [blame] | 29 | using the [Zstandard algorithm](https://facebook.github.io/zstd/). |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 30 | The text of the specification assumes a basic background in programming |
| 31 | at the level of bits and other primitive data representations. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 32 | |
| 33 | The data can be produced or consumed, |
| 34 | even for an arbitrarily long sequentially presented input data stream, |
| 35 | using only an a priori bounded amount of intermediate storage, |
| 36 | and hence can be used in data communications. |
| 37 | The format uses the Zstandard compression method, |
| Danielle Rozenblit | 4dffc35 | 2022-12-14 06:58:35 -0800 | [diff] [blame] | 38 | and optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/), |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 39 | for detection of data corruption. |
| 40 | |
| 41 | The data format defined by this specification |
| 42 | does not attempt to allow random access to compressed data. |
| 43 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 44 | Unless otherwise indicated below, |
| 45 | a compliant compressor must produce data sets |
| 46 | that conform to the specifications presented here. |
| 47 | It doesn’t need to support all options though. |
| 48 | |
| 49 | A compliant decompressor must be able to decompress |
| 50 | at least one working set of parameters |
| 51 | that conforms to the specifications presented here. |
| 52 | It may also ignore informative fields, such as checksum. |
| 53 | Whenever it does not support a parameter defined in the compressed stream, |
| 54 | it must produce a non-ambiguous error code and associated error message |
| 55 | explaining which parameter is unsupported. |
| 56 | |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 57 | This specification is intended for use by implementers of software |
| 58 | to compress data into Zstandard format and/or decompress data from Zstandard format. |
| 59 | The Zstandard format is supported by an open source reference implementation, |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 60 | written in portable C, and available at : https://github.com/facebook/zstd . |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 61 | |
| 62 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 63 | ### Overall conventions |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 64 | In this document: |
| 65 | - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 66 | - the naming convention for identifiers is `Mixed_Case_With_Underscores` |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 67 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 68 | ### Definitions |
| 69 | Content compressed by Zstandard is transformed into a Zstandard __frame__. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 70 | Multiple frames can be appended into a single file or stream. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 71 | A frame is completely independent, has a defined beginning and end, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 72 | and a set of parameters which tells the decoder how to decompress it. |
| 73 | |
| 74 | A frame encapsulates one or multiple __blocks__. |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 75 | Each block contains arbitrary content, which is described by its header, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 76 | and has a guaranteed maximum content size, which depends on frame parameters. |
| 77 | Unlike frames, each block depends on previous blocks for proper decoding. |
| 78 | However, each block can be decompressed without waiting for its successor, |
| 79 | allowing streaming operations. |
| 80 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 81 | Overview |
| 82 | --------- |
| 83 | - [Frames](#frames) |
| 84 | - [Zstandard frames](#zstandard-frames) |
| 85 | - [Blocks](#blocks) |
| 86 | - [Literals Section](#literals-section) |
| 87 | - [Sequences Section](#sequences-section) |
| 88 | - [Sequence Execution](#sequence-execution) |
| 89 | - [Skippable frames](#skippable-frames) |
| 90 | - [Entropy Encoding](#entropy-encoding) |
| 91 | - [FSE](#fse) |
| 92 | - [Huffman Coding](#huffman-coding) |
| 93 | - [Dictionary Format](#dictionary-format) |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 94 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 95 | Frames |
| 96 | ------ |
| Yann Collet | fccb46f | 2017-11-18 11:28:00 -0800 | [diff] [blame] | 97 | Zstandard compressed data is made of one or more __frames__. |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 98 | Each frame is independent and can be decompressed independently of other frames. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 99 | The decompressed content of multiple concatenated frames is the concatenation of |
| Yann Collet | fccb46f | 2017-11-18 11:28:00 -0800 | [diff] [blame] | 100 | each frame decompressed content. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 101 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 102 | There are two frame formats defined by Zstandard: |
| 103 | Zstandard frames and Skippable frames. |
| 104 | Zstandard frames contain compressed data, while |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 105 | skippable frames contain custom user metadata. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 106 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 107 | ## Zstandard frames |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 108 | The structure of a single Zstandard frame is following: |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 109 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 110 | | `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] | |
| 111 | |:--------------:|:--------------:|:----------:| ------------------ |:--------------------:| |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 112 | | 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 113 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 114 | __`Magic_Number`__ |
| 115 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 116 | 4 Bytes, __little-endian__ format. |
| Yann Collet | 7bdfcea | 2016-09-05 17:43:31 +0200 | [diff] [blame] | 117 | Value : 0xFD2FB528 |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 118 | Note: This value was selected to be less probable to find at the beginning of some random file. |
| 119 | It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.), |
| 120 | contains byte values outside of ASCII range, |
| 121 | and doesn't map into UTF8 space. |
| 122 | It reduces the chances that a text file represent this value by accident. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 123 | |
| 124 | __`Frame_Header`__ |
| 125 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 126 | 2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header). |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 127 | |
| 128 | __`Data_Block`__ |
| 129 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 130 | Detailed in [`Blocks`](#blocks). |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 131 | That’s where compressed data is stored. |
| 132 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 133 | __`Content_Checksum`__ |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 134 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 135 | An optional 32-bit checksum, only present if `Content_Checksum_flag` is set. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 136 | The content checksum is the result |
| Danielle Rozenblit | 4dffc35 | 2022-12-14 06:58:35 -0800 | [diff] [blame] | 137 | of [xxh64() hash function](https://cyan4973.github.io/xxHash/) |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 138 | digesting the original (decoded) data as input, and a seed of zero. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 139 | The low 4 bytes of the checksum are stored in __little-endian__ format. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 140 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 141 | ### `Frame_Header` |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 142 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 143 | The `Frame_Header` has a variable size, with a minimum of 2 bytes, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 144 | and up to 14 bytes depending on optional parameters. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 145 | The structure of `Frame_Header` is following: |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 146 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 147 | | `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] | |
| 148 | | ------------------------- | --------------------- | ----------------- | ---------------------- | |
| 149 | | 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 150 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 151 | #### `Frame_Header_Descriptor` |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 152 | |
| 153 | The first header's byte is called the `Frame_Header_Descriptor`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 154 | It describes which other fields are present. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 155 | Decoding this byte is enough to tell the size of `Frame_Header`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 156 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 157 | | Bit number | Field name | |
| 158 | | ---------- | ---------- | |
| 159 | | 7-6 | `Frame_Content_Size_flag` | |
| 160 | | 5 | `Single_Segment_flag` | |
| 161 | | 4 | `Unused_bit` | |
| 162 | | 3 | `Reserved_bit` | |
| 163 | | 2 | `Content_Checksum_flag` | |
| 164 | | 1-0 | `Dictionary_ID_flag` | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 165 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 166 | In this table, bit 7 is the highest bit, while bit 0 is the lowest one. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 167 | |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 168 | __`Frame_Content_Size_flag`__ |
| 169 | |
| 170 | This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`), |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 171 | specifying if `Frame_Content_Size` (the decompressed data size) |
| 172 | is provided within the header. |
| 173 | `Flag_Value` provides `FCS_Field_Size`, |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 174 | which is the number of bytes used by `Frame_Content_Size` |
| 175 | according to the following table: |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 176 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 177 | | `Flag_Value` | 0 | 1 | 2 | 3 | |
| 178 | | -------------- | ------ | --- | --- | --- | |
| 179 | |`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 | |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 180 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 181 | When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` : |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 182 | if `Single_Segment_flag` is set, `FCS_Field_Size` is 1. |
| 183 | Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided. |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 184 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 185 | __`Single_Segment_flag`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 186 | |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 187 | If this flag is set, |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 188 | data must be regenerated within a single continuous memory segment. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 189 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 190 | In this case, `Window_Descriptor` byte is skipped, |
| 191 | but `Frame_Content_Size` is necessarily present. |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 192 | As a consequence, the decoder must allocate a memory segment |
| Yann Collet | fccb46f | 2017-11-18 11:28:00 -0800 | [diff] [blame] | 193 | of size equal or larger than `Frame_Content_Size`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 194 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 195 | In order to preserve the decoder from unreasonable memory requirements, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 196 | a decoder is allowed to reject a compressed frame |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 197 | which requests a memory size beyond decoder's authorized range. |
| 198 | |
| 199 | For broader compatibility, decoders are recommended to support |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 200 | memory sizes of at least 8 MB. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 201 | This is only a recommendation, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 202 | each decoder is free to support higher or lower limits, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 203 | depending on local limitations. |
| 204 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 205 | __`Unused_bit`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 206 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 207 | A decoder compliant with this specification version shall not interpret this bit. |
| 208 | It might be used in any future version, |
| 209 | to signal a property which is transparent to properly decode the frame. |
| 210 | An encoder compliant with this specification version must set this bit to zero. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 211 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 212 | __`Reserved_bit`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 213 | |
| 214 | This bit is reserved for some future feature. |
| 215 | Its value _must be zero_. |
| 216 | A decoder compliant with this specification version must ensure it is not set. |
| 217 | This bit may be used in a future revision, |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 218 | to signal a feature that must be interpreted to decode the frame correctly. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 219 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 220 | __`Content_Checksum_flag`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 221 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 222 | If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end. |
| 223 | See `Content_Checksum` paragraph. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 224 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 225 | __`Dictionary_ID_flag`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 226 | |
| 227 | This is a 2-bits flag (`= FHD & 3`), |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 228 | telling if a dictionary ID is provided within the header. |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 229 | It also specifies the size of this field as `DID_Field_Size`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 230 | |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 231 | |`Flag_Value` | 0 | 1 | 2 | 3 | |
| 232 | | -------------- | --- | --- | --- | --- | |
| 233 | |`DID_Field_Size`| 0 | 1 | 2 | 4 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 234 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 235 | #### `Window_Descriptor` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 236 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 237 | Provides guarantees on minimum memory buffer required to decompress a frame. |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 238 | This information is important for decoders to allocate enough memory. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 239 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 240 | The `Window_Descriptor` byte is optional. |
| 241 | When `Single_Segment_flag` is set, `Window_Descriptor` is not present. |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 242 | In this case, `Window_Size` is `Frame_Content_Size`, |
| 243 | which can be any value from 0 to 2^64-1 bytes (16 ExaBytes). |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 244 | |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 245 | | Bit numbers | 7-3 | 2-0 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 246 | | ----------- | ---------- | ---------- | |
| 247 | | Field name | `Exponent` | `Mantissa` | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 248 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 249 | The minimum memory buffer size is called `Window_Size`. |
| 250 | It is described by the following formulas : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 251 | ``` |
| 252 | windowLog = 10 + Exponent; |
| 253 | windowBase = 1 << windowLog; |
| 254 | windowAdd = (windowBase / 8) * Mantissa; |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 255 | Window_Size = windowBase + windowAdd; |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 256 | ``` |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 257 | The minimum `Window_Size` is 1 KB. |
| 258 | The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 259 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 260 | In general, larger `Window_Size` tend to improve compression ratio, |
| 261 | but at the cost of memory usage. |
| 262 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 263 | To properly decode compressed data, |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 264 | a decoder will need to allocate a buffer of at least `Window_Size` bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 265 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 266 | In order to preserve decoder from unreasonable memory requirements, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 267 | a decoder is allowed to reject a compressed frame |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 268 | which requests a memory size beyond decoder's authorized range. |
| 269 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 270 | For improved interoperability, |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 271 | it's recommended for decoders to support `Window_Size` of up to 8 MB, |
| 272 | and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 273 | It's merely a recommendation though, |
| 274 | decoders are free to support larger or lower limits, |
| 275 | depending on local limitations. |
| 276 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 277 | #### `Dictionary_ID` |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 278 | |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 279 | This is a variable size field, which contains |
| 280 | the ID of the dictionary required to properly decode the frame. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 281 | `Dictionary_ID` field is optional. When it's not present, |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 282 | it's up to the decoder to know which dictionary to use. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 283 | |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 284 | `Dictionary_ID` field size is provided by `DID_Field_Size`. |
| 285 | `DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 286 | 1 byte can represent an ID 0-255. |
| 287 | 2 bytes can represent an ID 0-65535. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 288 | 4 bytes can represent an ID 0-4294967295. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 289 | Format is __little-endian__. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 290 | |
| 291 | It's allowed to represent a small ID (for example `13`) |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 292 | with a large 4-bytes dictionary ID, even if it is less efficient. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 293 | |
| Yann Collet | bb3c9bf | 2020-05-25 08:15:09 -0700 | [diff] [blame] | 294 | A value of `0` has same meaning as no `Dictionary_ID`, |
| 295 | in which case the frame may or may not need a dictionary to be decoded, |
| 296 | and the ID of such a dictionary is not specified. |
| 297 | The decoder must know this information by other means. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 298 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 299 | #### `Frame_Content_Size` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 300 | |
| inikep | 2fc3752 | 2016-07-25 12:47:02 +0200 | [diff] [blame] | 301 | This is the original (uncompressed) size. This information is optional. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 302 | `Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`. |
| 303 | `FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`. |
| 304 | `FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 305 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 306 | | `FCS_Field_Size` | Range | |
| 307 | | ---------------- | ---------- | |
| 308 | | 0 | unknown | |
| 309 | | 1 | 0 - 255 | |
| 310 | | 2 | 256 - 65791| |
| 311 | | 4 | 0 - 2^32-1 | |
| 312 | | 8 | 0 - 2^64-1 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 313 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 314 | `Frame_Content_Size` format is __little-endian__. |
| 315 | When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly. |
| 316 | When `FCS_Field_Size` is 2, _the offset of 256 is added_. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 317 | It's allowed to represent a small size (for example `18`) using any compatible variant. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 318 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 319 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 320 | Blocks |
| 321 | ------- |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 322 | |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 323 | After `Magic_Number` and `Frame_Header`, there are some number of blocks. |
| 324 | Each frame must have at least one block, |
| 325 | but there is no upper limit on the number of blocks per frame. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 326 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 327 | The structure of a block is as follows: |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 328 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 329 | | `Block_Header` | `Block_Content` | |
| 330 | |:--------------:|:---------------:| |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 331 | | 3 bytes | n bytes | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 332 | |
| Yann Collet | 098b36e | 2019-11-13 09:50:15 -0800 | [diff] [blame] | 333 | __`Block_Header`__ |
| 334 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 335 | `Block_Header` uses 3 bytes, written using __little-endian__ convention. |
| 336 | It contains 3 fields : |
| 337 | |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 338 | | `Last_Block` | `Block_Type` | `Block_Size` | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 339 | |:------------:|:------------:|:------------:| |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 340 | | bit 0 | bits 1-2 | bits 3-23 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 341 | |
| 342 | __`Last_Block`__ |
| 343 | |
| 344 | The lowest bit signals if this block is the last one. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 345 | The frame will end after this last block. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 346 | It may be followed by an optional `Content_Checksum` |
| 347 | (see [Zstandard Frames](#zstandard-frames)). |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 348 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 349 | __`Block_Type`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 350 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 351 | The next 2 bits represent the `Block_Type`. |
| Yann Collet | 1e07eb4 | 2019-08-16 15:13:42 +0200 | [diff] [blame] | 352 | `Block_Type` influences the meaning of `Block_Size`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 353 | There are 4 block types : |
| 354 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 355 | | Value | 0 | 1 | 2 | 3 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 356 | | ------------ | ----------- | ----------- | ------------------ | --------- | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 357 | | `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`| |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 358 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 359 | - `Raw_Block` - this is an uncompressed block. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 360 | `Block_Content` contains `Block_Size` bytes. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 361 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 362 | - `RLE_Block` - this is a single byte, repeated `Block_Size` times. |
| 363 | `Block_Content` consists of a single byte. |
| 364 | On the decompression side, this byte must be repeated `Block_Size` times. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 365 | |
| 366 | - `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks), |
| 367 | explained later on. |
| 368 | `Block_Size` is the length of `Block_Content`, the compressed data. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 369 | The decompressed size is not known, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 370 | but its maximum possible value is guaranteed (see below) |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 371 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 372 | - `Reserved` - this is not a block. |
| 373 | This value cannot be used with current version of this specification. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 374 | If such a value is present, it is considered corrupted data. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 375 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 376 | __`Block_Size`__ |
| 377 | |
| 378 | The upper 21 bits of `Block_Header` represent the `Block_Size`. |
| Yann Collet | 098b36e | 2019-11-13 09:50:15 -0800 | [diff] [blame] | 379 | |
| Yann Collet | 1e07eb4 | 2019-08-16 15:13:42 +0200 | [diff] [blame] | 380 | When `Block_Type` is `Compressed_Block` or `Raw_Block`, |
| Yann Collet | bb3c9bf | 2020-05-25 08:15:09 -0700 | [diff] [blame] | 381 | `Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`). |
| Yann Collet | 098b36e | 2019-11-13 09:50:15 -0800 | [diff] [blame] | 382 | |
| 383 | When `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1, |
| 384 | `Block_Size` represents the number of times this byte must be repeated. |
| 385 | |
| 386 | `Block_Size` is limited by `Block_Maximum_Size` (see below). |
| 387 | |
| 388 | __`Block_Content`__ and __`Block_Maximum_Size`__ |
| 389 | |
| 390 | The size of `Block_Content` is limited by `Block_Maximum_Size`, |
| 391 | which is the smallest of: |
| 392 | - `Window_Size` |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 393 | - 128 KB |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 394 | |
| Yann Collet | 098b36e | 2019-11-13 09:50:15 -0800 | [diff] [blame] | 395 | `Block_Maximum_Size` is constant for a given frame. |
| 396 | This maximum is applicable to both the decompressed size |
| 397 | and the compressed size of any block in the frame. |
| 398 | |
| 399 | The reasoning for this limit is that a decoder can read this information |
| 400 | at the beginning of a frame and use it to allocate buffers. |
| 401 | The guarantees on the size of blocks ensure that |
| 402 | the buffers will be large enough for any following block of the valid frame. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 403 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 404 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 405 | Compressed Blocks |
| 406 | ----------------- |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 407 | To decompress a compressed block, the compressed size must be provided |
| 408 | from `Block_Size` field within `Block_Header`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 409 | |
| 410 | A compressed block consists of 2 sections : |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 411 | - [Literals Section](#literals-section) |
| 412 | - [Sequences Section](#sequences-section) |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 413 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 414 | The results of the two sections are then combined to produce the decompressed |
| 415 | data in [Sequence Execution](#sequence-execution) |
| 416 | |
| 417 | #### Prerequisites |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 418 | To decode a compressed block, the following elements are necessary : |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 419 | - Previous decoded data, up to a distance of `Window_Size`, |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 420 | or beginning of the Frame, whichever is smaller. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 421 | - List of "recent offsets" from previous `Compressed_Block`. |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 422 | - The previous Huffman tree, required by `Treeless_Literals_Block` type |
| 423 | - Previous FSE decoding tables, required by `Repeat_Mode` |
| 424 | for each symbol type (literals lengths, match lengths, offsets) |
| 425 | |
| 426 | Note that decoding tables aren't always from the previous `Compressed_Block`. |
| 427 | |
| 428 | - Every decoding table can come from a dictionary. |
| 429 | - The Huffman tree comes from the previous `Compressed_Literals_Block`. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 430 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 431 | Literals Section |
| 432 | ---------------- |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 433 | All literals are regrouped in the first part of the block. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 434 | They can be decoded first, and then copied during [Sequence Execution], |
| 435 | or they can be decoded on the flow during [Sequence Execution]. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 436 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 437 | Literals can be stored uncompressed or compressed using Huffman prefix codes. |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 438 | When compressed, a tree description may optionally be present, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 439 | followed by 1 or 4 streams. |
| 440 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 441 | | `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] | |
| 442 | | ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 443 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 444 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 445 | ### `Literals_Section_Header` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 446 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 447 | Header is in charge of describing how literals are packed. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 448 | It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 449 | using __little-endian__ convention. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 450 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 451 | | `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 452 | | --------------------- | ------------- | ------------------ | ------------------- | |
| 453 | | 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits | |
| Yann Collet | 198e6aa | 2016-07-20 20:12:24 +0200 | [diff] [blame] | 454 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 455 | In this representation, bits on the left are the lowest bits. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 456 | |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 457 | __`Literals_Block_Type`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 458 | |
| Yann Collet | 198e6aa | 2016-07-20 20:12:24 +0200 | [diff] [blame] | 459 | This field uses 2 lowest bits of first byte, describing 4 different block types : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 460 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 461 | | `Literals_Block_Type` | Value | |
| 462 | | --------------------------- | ----- | |
| 463 | | `Raw_Literals_Block` | 0 | |
| 464 | | `RLE_Literals_Block` | 1 | |
| 465 | | `Compressed_Literals_Block` | 2 | |
| 466 | | `Treeless_Literals_Block` | 3 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 467 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 468 | - `Raw_Literals_Block` - Literals are stored uncompressed. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 469 | - `RLE_Literals_Block` - Literals consist of a single byte value |
| 470 | repeated `Regenerated_Size` times. |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 471 | - `Compressed_Literals_Block` - This is a standard Huffman-compressed block, |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 472 | starting with a Huffman tree description. |
| Yann Collet | 832f559 | 2023-02-18 18:16:00 -0800 | [diff] [blame] | 473 | In this mode, there are at least 2 different literals represented in the Huffman tree description. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 474 | See details below. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 475 | - `Treeless_Literals_Block` - This is a Huffman-compressed block, |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 476 | using Huffman tree _from previous Huffman-compressed literals block_. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 477 | `Huffman_Tree_Description` will be skipped. |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 478 | Note: If this mode is triggered without any previous Huffman-table in the frame |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 479 | (or [dictionary](#dictionary-format)), this should be treated as data corruption. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 480 | |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 481 | __`Size_Format`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 482 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 483 | `Size_Format` is divided into 2 families : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 484 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 485 | - For `Raw_Literals_Block` and `RLE_Literals_Block`, |
| 486 | it's only necessary to decode `Regenerated_Size`. |
| 487 | There is no `Compressed_Size` field. |
| 488 | - For `Compressed_Block` and `Treeless_Literals_Block`, |
| 489 | it's required to decode both `Compressed_Size` |
| 490 | and `Regenerated_Size` (the decompressed size). |
| 491 | It's also necessary to decode the number of streams (1 or 4). |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 492 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 493 | For values spanning several bytes, convention is __little-endian__. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 494 | |
| inikep | 9d003c1 | 2016-08-04 10:41:49 +0200 | [diff] [blame] | 495 | __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 496 | |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 497 | `Size_Format` uses 1 _or_ 2 bits. |
| Nick Terrell | c1a7def | 2018-07-10 15:07:36 -0700 | [diff] [blame] | 498 | Its value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3` |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 499 | |
| 500 | - `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit. |
| Sean Purcell | d86153d | 2017-01-26 16:58:25 -0800 | [diff] [blame] | 501 | `Regenerated_Size` uses 5 bits (0-31). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 502 | `Literals_Section_Header` uses 1 byte. |
| Nick Terrell | c1a7def | 2018-07-10 15:07:36 -0700 | [diff] [blame] | 503 | `Regenerated_Size = Literals_Section_Header[0]>>3` |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 504 | - `Size_Format` == 01 : `Size_Format` uses 2 bits. |
| Sean Purcell | d86153d | 2017-01-26 16:58:25 -0800 | [diff] [blame] | 505 | `Regenerated_Size` uses 12 bits (0-4095). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 506 | `Literals_Section_Header` uses 2 bytes. |
| Nick Terrell | c1a7def | 2018-07-10 15:07:36 -0700 | [diff] [blame] | 507 | `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)` |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 508 | - `Size_Format` == 11 : `Size_Format` uses 2 bits. |
| Sean Purcell | d86153d | 2017-01-26 16:58:25 -0800 | [diff] [blame] | 509 | `Regenerated_Size` uses 20 bits (0-1048575). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 510 | `Literals_Section_Header` uses 3 bytes. |
| Nick Terrell | c1a7def | 2018-07-10 15:07:36 -0700 | [diff] [blame] | 511 | `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 512 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 513 | Only Stream1 is present for these cases. |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 514 | Note : it's allowed to represent a short value (for example `27`) |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 515 | using a long format, even if it's less efficient. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 516 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 517 | __`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 518 | |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 519 | `Size_Format` always uses 2 bits. |
| 520 | |
| 521 | - `Size_Format` == 00 : _A single stream_. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 522 | Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 523 | `Literals_Section_Header` uses 3 bytes. |
| 524 | - `Size_Format` == 01 : 4 streams. |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 525 | Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 526 | `Literals_Section_Header` uses 3 bytes. |
| 527 | - `Size_Format` == 10 : 4 streams. |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 528 | Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 529 | `Literals_Section_Header` uses 4 bytes. |
| 530 | - `Size_Format` == 11 : 4 streams. |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 531 | Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143). |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 532 | `Literals_Section_Header` uses 5 bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 533 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 534 | Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention. |
| 535 | Note: `Compressed_Size` __includes__ the size of the Huffman Tree description |
| 536 | _when_ it is present. |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 537 | Note 2: `Compressed_Size` can never be `==0`. |
| 538 | Even in single-stream scenario, assuming an empty content, it must be `>=1`, |
| 539 | since it contains at least the final end bit flag. |
| 540 | In 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10` |
| 541 | (6 bytes for the jump table, + 4x1 bytes for the 4 streams). |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 542 | |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 543 | 4 streams is faster than 1 stream in decompression speed, |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 544 | by exploiting instruction level parallelism. |
| 545 | But it's also more expensive, |
| 546 | costing on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table. |
| 547 | |
| 548 | In general, use the 4 streams mode when there are more literals to decode, |
| 549 | to favor higher decompression speeds. |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 550 | Note that beyond >1KB of literals, the 4 streams mode is compulsory. |
| Yann Collet | 6a9c525 | 2022-12-22 11:30:15 -0800 | [diff] [blame] | 551 | |
| 552 | Note that a minimum of 6 bytes is required for the 4 streams mode. |
| 553 | That's a technical minimum, but it's not recommended to employ the 4 streams mode |
| 554 | for such a small quantity, that would be wasteful. |
| 555 | A more practical lower bound would be around ~256 bytes. |
| 556 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 557 | #### Raw Literals Block |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 558 | The data in Stream1 is `Regenerated_Size` bytes long, |
| 559 | it contains the raw literals data to be used during [Sequence Execution]. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 560 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 561 | #### RLE Literals Block |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 562 | Stream1 consists of a single byte which should be repeated `Regenerated_Size` times |
| 563 | to generate the decoded literals. |
| 564 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 565 | #### Compressed Literals Block and Treeless Literals Block |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 566 | Both of these modes contain Huffman encoded data. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 567 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 568 | For `Treeless_Literals_Block`, |
| 569 | the Huffman table comes from previously compressed literals block, |
| 570 | or from a dictionary. |
| 571 | |
| 572 | |
| 573 | ### `Huffman_Tree_Description` |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 574 | This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). |
| Yann Collet | 832f559 | 2023-02-18 18:16:00 -0800 | [diff] [blame] | 575 | The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 576 | The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description). |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 577 | The size of `Huffman_Tree_Description` is determined during decoding process, |
| 578 | it must be used to determine where streams begin. |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 579 | `Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 580 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 581 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 582 | ### Jump Table |
| 583 | The Jump Table is only present when there are 4 Huffman-coded streams. |
| 584 | |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 585 | Reminder : Huffman compressed data consists of either 1 or 4 streams. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 586 | |
| 587 | If only one stream is present, it is a single bitstream occupying the entire |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 588 | remaining portion of the literals block, encoded as described in |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 589 | [Huffman-Coded Streams](#huffman-coded-streams). |
| 590 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 591 | If there are four streams, `Literals_Section_Header` only provided |
| 592 | enough information to know the decompressed and compressed sizes |
| 593 | of all four streams _combined_. |
| 594 | The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 595 | except for the last stream which may be up to 3 bytes smaller, |
| 596 | to reach a total decompressed size as specified in `Regenerated_Size`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 597 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 598 | The compressed size of each stream is provided explicitly in the Jump Table. |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 599 | Jump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 600 | describing the compressed sizes of the first three streams. |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 601 | `Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams: |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 602 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 603 | `Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 604 | |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 605 | `Stream4_Size` is necessarily `>= 1`. Therefore, |
| 606 | if `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`, |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 607 | data is considered corrupted. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 608 | |
| 609 | Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream, |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 610 | as described in [Huffman-Coded Streams](#huffman-coded-streams) |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 611 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 612 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 613 | Sequences Section |
| 614 | ----------------- |
| 615 | A compressed block is a succession of _sequences_ . |
| 616 | A sequence is a literal copy command, followed by a match copy command. |
| 617 | A literal copy command specifies a length. |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 618 | It is the number of bytes to be copied (or extracted) from the Literals Section. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 619 | A match copy command specifies an offset and a length. |
| 620 | |
| 621 | When all _sequences_ are decoded, |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 622 | if there are literals left in the _literals section_, |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 623 | these bytes are added at the end of the block. |
| 624 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 625 | This is described in more detail in [Sequence Execution](#sequence-execution). |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 626 | |
| 627 | The `Sequences_Section` regroup all symbols required to decode commands. |
| 628 | There are 3 symbol types : literals lengths, offsets and match lengths. |
| 629 | They are encoded together, interleaved, in a single _bitstream_. |
| 630 | |
| 631 | The `Sequences_Section` starts by a header, |
| 632 | followed by optional probability tables for each symbol type, |
| 633 | followed by the bitstream. |
| 634 | |
| 635 | | `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream | |
| 636 | | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | |
| 637 | |
| 638 | To decode the `Sequences_Section`, it's required to know its size. |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 639 | Its size is deduced from the size of `Literals_Section`: |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 640 | `Sequences_Section_Size = Block_Size - Literals_Section_Size`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 641 | |
| 642 | |
| 643 | #### `Sequences_Section_Header` |
| 644 | |
| 645 | Consists of 2 items: |
| 646 | - `Number_of_Sequences` |
| 647 | - Symbol compression modes |
| 648 | |
| 649 | __`Number_of_Sequences`__ |
| 650 | |
| 651 | This is a variable size field using between 1 and 3 bytes. |
| 652 | Let's call its first byte `byte0`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 653 | - `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte. |
| Yann Collet | 1f83b7c | 2023-06-05 09:51:52 -0700 | [diff] [blame] | 654 | - `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes. |
| 655 | Note that the 2 bytes format fully overlaps the 1 byte format. |
| 656 | - `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 657 | |
| Yann Collet | 3732a08 | 2023-06-05 16:03:00 -0700 | [diff] [blame] | 658 | `if (Number_of_Sequences == 0)` : there are no sequences. |
| 659 | The sequence section stops immediately, |
| 660 | FSE tables used in `Repeat_Mode` aren't updated. |
| 661 | Block's decompressed content is defined solely by the Literals Section content. |
| 662 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 663 | __Symbol compression modes__ |
| 664 | |
| 665 | This is a single byte, defining the compression mode of each symbol type. |
| 666 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 667 | |Bit number| 7-6 | 5-4 | 3-2 | 1-0 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 668 | | -------- | ----------------------- | -------------- | -------------------- | ---------- | |
| 669 | |Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` | |
| 670 | |
| 671 | The last field, `Reserved`, must be all-zeroes. |
| 672 | |
| 673 | `Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 674 | literals lengths, offsets, and match lengths symbols respectively. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 675 | |
| 676 | They follow the same enumeration : |
| 677 | |
| 678 | | Value | 0 | 1 | 2 | 3 | |
| 679 | | ------------------ | ----------------- | ---------- | --------------------- | ------------- | |
| 680 | | `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` | |
| 681 | |
| 682 | - `Predefined_Mode` : A predefined FSE distribution table is used, defined in |
| 683 | [default distributions](#default-distributions). |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 684 | No distribution table will be present. |
| Yann Collet | c1e6347 | 2018-06-21 18:08:11 -0700 | [diff] [blame] | 685 | - `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value. |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 686 | This symbol will be used for all sequences. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 687 | - `FSE_Compressed_Mode` : standard FSE compression. |
| 688 | A distribution table will be present. |
| Yann Collet | a935d67 | 2017-03-31 16:19:04 -0700 | [diff] [blame] | 689 | The format of this distribution table is described in [FSE Table Description](#fse-table-description). |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 690 | Note that the maximum allowed accuracy log for literals length and match length tables is 9, |
| 691 | and the maximum accuracy log for the offsets table is 8. |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 692 | `FSE_Compressed_Mode` must not be used when only one symbol is present, |
| 693 | `RLE_Mode` should be used instead (although any other mode will work). |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 694 | - `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again, |
| 695 | or if this is the first block, table in the dictionary will be used. |
| 696 | Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. |
| 697 | It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`. |
| 698 | No distribution table will be present. |
| 699 | If this mode is used without any previous sequence table in the frame |
| 700 | (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 701 | |
| 702 | #### The codes for literals lengths, match lengths, and offsets. |
| 703 | |
| 704 | Each symbol is a _code_ in its own context, |
| 705 | which specifies `Baseline` and `Number_of_Bits` to add. |
| 706 | _Codes_ are FSE compressed, |
| 707 | and interleaved with raw additional bits in the same bitstream. |
| 708 | |
| 709 | ##### Literals length codes |
| 710 | |
| 711 | Literals length codes are values ranging from `0` to `35` included. |
| 712 | They define lengths from 0 to 131071 bytes. |
| 713 | The literals length is equal to the decoded `Baseline` plus |
| 714 | the result of reading `Number_of_Bits` bits from the bitstream, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 715 | as a __little-endian__ value. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 716 | |
| 717 | | `Literals_Length_Code` | 0-15 | |
| 718 | | ---------------------- | ---------------------- | |
| 719 | | length | `Literals_Length_Code` | |
| 720 | | `Number_of_Bits` | 0 | |
| 721 | |
| 722 | | `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
| 723 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 724 | | `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | |
| 725 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
| 726 | |
| 727 | | `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
| 728 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 729 | | `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | |
| 730 | | `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
| 731 | |
| 732 | | `Literals_Length_Code` | 32 | 33 | 34 | 35 | |
| 733 | | ---------------------- | ---- | ---- | ---- | ---- | |
| 734 | | `Baseline` | 8192 |16384 |32768 |65536 | |
| 735 | | `Number_of_Bits` | 13 | 14 | 15 | 16 | |
| 736 | |
| 737 | |
| 738 | ##### Match length codes |
| 739 | |
| 740 | Match length codes are values ranging from `0` to `52` included. |
| 741 | They define lengths from 3 to 131074 bytes. |
| 742 | The match length is equal to the decoded `Baseline` plus |
| 743 | the result of reading `Number_of_Bits` bits from the bitstream, |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 744 | as a __little-endian__ value. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 745 | |
| 746 | | `Match_Length_Code` | 0-31 | |
| 747 | | ------------------- | ----------------------- | |
| 748 | | value | `Match_Length_Code` + 3 | |
| 749 | | `Number_of_Bits` | 0 | |
| 750 | |
| 751 | | `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
| 752 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 753 | | `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | |
| 754 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
| 755 | |
| 756 | | `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
| 757 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 758 | | `Baseline` | 67 | 83 | 99 | 131 | 259 | 515 | 1027 | 2051 | |
| 759 | | `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | |
| 760 | |
| 761 | | `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 | |
| 762 | | ------------------- | ---- | ---- | ---- | ---- | ---- | |
| 763 | | `Baseline` | 4099 | 8195 |16387 |32771 |65539 | |
| 764 | | `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 | |
| 765 | |
| 766 | ##### Offset codes |
| 767 | |
| 768 | Offset codes are values ranging from `0` to `N`. |
| 769 | |
| 770 | A decoder is free to limit its maximum `N` supported. |
| 771 | Recommendation is to support at least up to `22`. |
| 772 | For information, at the time of this writing. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 773 | the reference decoder supports a maximum `N` value of `31`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 774 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 775 | An offset code is also the number of additional bits to read in __little-endian__ fashion, |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 776 | and can be translated into an `Offset_Value` using the following formulas : |
| 777 | |
| 778 | ``` |
| 779 | Offset_Value = (1 << offsetCode) + readNBits(offsetCode); |
| 780 | if (Offset_Value > 3) offset = Offset_Value - 3; |
| 781 | ``` |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 782 | It means that maximum `Offset_Value` is `(2^(N+1))-1` |
| 783 | supporting back-reference distances up to `(2^(N+1))-4`, |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 784 | but is limited by [maximum back-reference distance](#window_descriptor). |
| 785 | |
| 786 | `Offset_Value` from 1 to 3 are special : they define "repeat codes". |
| Yann Collet | c1e6347 | 2018-06-21 18:08:11 -0700 | [diff] [blame] | 787 | This is described in more detail in [Repeat Offsets](#repeat-offsets). |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 788 | |
| 789 | #### Decoding Sequences |
| 790 | FSE bitstreams are read in reverse direction than written. In zstd, |
| 791 | the compressor writes bits forward into a block and the decompressor |
| 792 | must read the bitstream _backwards_. |
| 793 | |
| 794 | To find the start of the bitstream it is therefore necessary to |
| 795 | know the offset of the last byte of the block which can be found |
| 796 | by counting `Block_Size` bytes after the block header. |
| 797 | |
| 798 | After writing the last bit containing information, the compressor |
| 799 | writes a single `1`-bit and then fills the byte with 0-7 `0` bits of |
| 800 | padding. The last byte of the compressed bitstream cannot be `0` for |
| 801 | that reason. |
| 802 | |
| 803 | When decompressing, the last byte containing the padding is the first |
| 804 | byte to read. The decompressor needs to skip 0-7 initial `0`-bits and |
| 805 | the first `1`-bit it occurs. Afterwards, the useful part of the bitstream |
| 806 | begins. |
| 807 | |
| 808 | FSE decoding requires a 'state' to be carried from symbol to symbol. |
| 809 | For more explanation on FSE decoding, see the [FSE section](#fse). |
| 810 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 811 | For sequence decoding, a separate state keeps track of each |
| 812 | literal lengths, offsets, and match lengths symbols. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 813 | Some FSE primitives are also used. |
| 814 | For more details on the operation of these primitives, see the [FSE section](#fse). |
| 815 | |
| 816 | ##### Starting states |
| 817 | The bitstream starts with initial FSE state values, |
| 818 | each using the required number of bits in their respective _accuracy_, |
| 819 | decoded previously from their normalized distribution. |
| 820 | |
| 821 | It starts by `Literals_Length_State`, |
| 822 | followed by `Offset_State`, |
| 823 | and finally `Match_Length_State`. |
| 824 | |
| 825 | Reminder : always keep in mind that all values are read _backward_, |
| 826 | so the 'start' of the bitstream is at the highest position in memory, |
| 827 | immediately before the last `1`-bit for padding. |
| 828 | |
| 829 | After decoding the starting states, a single sequence is decoded |
| 830 | `Number_Of_Sequences` times. |
| 831 | These sequences are decoded in order from first to last. |
| 832 | Since the compressor writes the bitstream in the forward direction, |
| 833 | this means the compressor must encode the sequences starting with the last |
| 834 | one and ending with the first. |
| 835 | |
| 836 | ##### Decoding a sequence |
| 837 | For each of the symbol types, the FSE state can be used to determine the appropriate code. |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 838 | The code then defines the `Baseline` and `Number_of_Bits` to read for each type. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 839 | See the [description of the codes] for how to determine these values. |
| 840 | |
| 841 | [description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets |
| 842 | |
| 843 | Decoding starts by reading the `Number_of_Bits` required to decode `Offset`. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 844 | It then does the same for `Match_Length`, and then for `Literals_Length`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 845 | This sequence is then used for [sequence execution](#sequence-execution). |
| 846 | |
| 847 | If it is not the last sequence in the block, |
| 848 | the next operation is to update states. |
| 849 | Using the rules pre-calculated in the decoding tables, |
| 850 | `Literals_Length_State` is updated, |
| 851 | followed by `Match_Length_State`, |
| 852 | and then `Offset_State`. |
| 853 | See the [FSE section](#fse) for details on how to update states from the bitstream. |
| 854 | |
| 855 | This operation will be repeated `Number_of_Sequences` times. |
| 856 | At the end, the bitstream shall be entirely consumed, |
| 857 | otherwise the bitstream is considered corrupted. |
| 858 | |
| 859 | #### Default Distributions |
| 860 | If `Predefined_Mode` is selected for a symbol type, |
| 861 | its FSE decoding table is generated from a predefined distribution table defined here. |
| 862 | For details on how to convert this distribution into a decoding table, see the [FSE section]. |
| 863 | |
| 864 | [FSE section]: #from-normalized-distribution-to-decoding-tables |
| 865 | |
| Sean Purcell | 3bee41a | 2017-02-21 10:20:36 -0800 | [diff] [blame] | 866 | ##### Literals Length |
| 867 | The decoding table uses an accuracy log of 6 bits (64 states). |
| 868 | ``` |
| 869 | short literalsLength_defaultDistribution[36] = |
| 870 | { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, |
| 871 | 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, |
| 872 | -1,-1,-1,-1 }; |
| 873 | ``` |
| 874 | |
| 875 | ##### Match Length |
| 876 | The decoding table uses an accuracy log of 6 bits (64 states). |
| 877 | ``` |
| 878 | short matchLengths_defaultDistribution[53] = |
| 879 | { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
| 880 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 881 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, |
| 882 | -1,-1,-1,-1,-1 }; |
| 883 | ``` |
| 884 | |
| 885 | ##### Offset Codes |
| 886 | The decoding table uses an accuracy log of 5 bits (32 states), |
| 887 | and supports a maximum `N` value of 28, allowing offset values up to 536,870,908 . |
| 888 | |
| 889 | If any sequence in the compressed block requires a larger offset than this, |
| 890 | it's not possible to use the default distribution to represent it. |
| 891 | ``` |
| 892 | short offsetCodes_defaultDistribution[29] = |
| 893 | { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
| 894 | 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 }; |
| 895 | ``` |
| 896 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 897 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 898 | Sequence Execution |
| 899 | ------------------ |
| 900 | Once literals and sequences have been decoded, |
| 901 | they are combined to produce the decoded content of a block. |
| 902 | |
| 903 | Each sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`), |
| Sean Purcell | 3bee41a | 2017-02-21 10:20:36 -0800 | [diff] [blame] | 904 | decoded as described in the [Sequences Section](#sequences-section). |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 905 | To execute a sequence, first copy `literals_length` bytes |
| 906 | from the decoded literals to the output. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 907 | |
| 908 | Then `match_length` bytes are copied from previous decoded data. |
| 909 | The offset to copy from is determined by `offset_value`: |
| 910 | if `offset_value > 3`, then the offset is `offset_value - 3`. |
| 911 | If `offset_value` is from 1-3, the offset is a special repeat offset value. |
| 912 | See the [repeat offset](#repeat-offsets) section for how the offset is determined |
| 913 | in this case. |
| 914 | |
| 915 | The offset is defined as from the current position, so an offset of 6 |
| 916 | and a match length of 3 means that 3 bytes should be copied from 6 bytes back. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 917 | Note that all offsets leading to previously decoded data |
| 918 | must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 919 | |
| 920 | #### Repeat offsets |
| 921 | As seen in [Sequence Execution](#sequence-execution), |
| 922 | the first 3 values define a repeated offset and we will call them |
| 923 | `Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`. |
| 924 | They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one". |
| 925 | |
| 926 | If `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc. |
| 927 | |
| 928 | There is an exception though, when current sequence's `literals_length = 0`. |
| 929 | In this case, repeated offsets are shifted by one, |
| 930 | so an `offset_value` of 1 means `Repeated_Offset2`, |
| 931 | an `offset_value` of 2 means `Repeated_Offset3`, |
| elasota | f06b18b | 2023-11-19 15:33:37 -0500 | [diff] [blame] | 932 | and an `offset_value` of 3 means `Repeated_Offset1 - 1`. |
| 933 | |
| 934 | In the final case, if `Repeated_Offset1 - 1` evaluates to 0, then the |
| 935 | data is considered corrupted. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 936 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 937 | For the first block, the starting offset history is populated with following values : |
| 938 | `Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8, |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 939 | unless a dictionary is used, in which case they come from the dictionary. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 940 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 941 | Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`. |
| 942 | Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 943 | |
| 944 | [Offset Codes]: #offset-codes |
| 945 | |
| 946 | ###### Offset updates rules |
| 947 | |
| W. Felix Handte | 2d46d76 | 2020-12-09 20:00:48 -0500 | [diff] [blame] | 948 | During the execution of the sequences of a `Compressed_Block`, the |
| 949 | `Repeated_Offsets`' values are kept up to date, so that they always represent |
| 950 | the three most-recently used offsets. In order to achieve that, they are |
| 951 | updated after executing each sequence in the following way: |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 952 | |
| W. Felix Handte | 2d46d76 | 2020-12-09 20:00:48 -0500 | [diff] [blame] | 953 | When the sequence's `offset_value` does not refer to one of the |
| 954 | `Repeated_Offsets`--when it has value greater than 3, or when it has value 3 |
| 955 | and the sequence's `literals_length` is zero--the `Repeated_Offsets`' values |
| 956 | are shifted back one, and `Repeated_Offset1` takes on the value of the |
| 957 | just-used offset. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 958 | |
| W. Felix Handte | 2d46d76 | 2020-12-09 20:00:48 -0500 | [diff] [blame] | 959 | Otherwise, when the sequence's `offset_value` refers to one of the |
| 960 | `Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the |
| 961 | sequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered |
| 962 | so that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and |
| 963 | the existing values are pushed back from the first `Repeated_Offset` through to |
| 964 | the `Repeated_Offset` selected by the `offset_value`. This effectively performs |
| 965 | a single-stepped wrapping rotation of the values of these offsets, so that |
| 966 | their order again reflects the recency of their use. |
| Yann Collet | 9bf0070 | 2018-10-26 15:51:51 -0700 | [diff] [blame] | 967 | |
| W. Felix Handte | 2d46d76 | 2020-12-09 20:00:48 -0500 | [diff] [blame] | 968 | The following table shows the values of the `Repeated_Offsets` as a series of |
| 969 | sequences are applied to them: |
| Yann Collet | 9bf0070 | 2018-10-26 15:51:51 -0700 | [diff] [blame] | 970 | |
| W. Felix Handte | 2d46d76 | 2020-12-09 20:00:48 -0500 | [diff] [blame] | 971 | | `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment | |
| 972 | |:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:| |
| 973 | | | | 1 | 4 | 8 | starting values | |
| 974 | | 1114 | 11 | 1111 | 1 | 4 | non-repeat | |
| Yann Collet | f33ccd2 | 2022-05-24 04:47:49 -0700 | [diff] [blame] | 975 | | 1 | 22 | 1111 | 1 | 4 | repeat 1: no change | |
| W. Felix Handte | 2d46d76 | 2020-12-09 20:00:48 -0500 | [diff] [blame] | 976 | | 2225 | 22 | 2222 | 1111 | 1 | non-repeat | |
| 977 | | 1114 | 111 | 1111 | 2222 | 1111 | non-repeat | |
| 978 | | 3336 | 33 | 3333 | 1111 | 2222 | non-repeat | |
| Yann Collet | f33ccd2 | 2022-05-24 04:47:49 -0700 | [diff] [blame] | 979 | | 2 | 22 | 1111 | 3333 | 2222 | repeat 2: swap 1 & 2 | |
| 980 | | 3 | 33 | 2222 | 1111 | 3333 | repeat 3: rotate 3 to 1 | |
| 981 | | 3 | 0 | 2221 | 2222 | 1111 | special case : insert `repeat1 - 1` | |
| 982 | | 1 | 0 | 2222 | 2221 | 1111 | == repeat 2 | |
| Yann Collet | 9bf0070 | 2018-10-26 15:51:51 -0700 | [diff] [blame] | 983 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 984 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 985 | Skippable Frames |
| 986 | ---------------- |
| 987 | |
| 988 | | `Magic_Number` | `Frame_Size` | `User_Data` | |
| 989 | |:--------------:|:------------:|:-----------:| |
| 990 | | 4 bytes | 4 bytes | n bytes | |
| 991 | |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 992 | Skippable frames allow the insertion of user-defined metadata |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 993 | into a flow of concatenated frames. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 994 | |
| 995 | Skippable frames defined in this specification are compatible with [LZ4] ones. |
| 996 | |
| Danielle Rozenblit | 4dffc35 | 2022-12-14 06:58:35 -0800 | [diff] [blame] | 997 | [LZ4]:https://lz4.github.io/lz4/ |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 998 | |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 999 | From a compliant decoder perspective, skippable frames need just be skipped, |
| 1000 | and their content ignored, resuming decoding after the skippable frame. |
| 1001 | |
| 1002 | It can be noted that a skippable frame |
| 1003 | can be used to watermark a stream of concatenated frames |
| Dominique Pelle | b772f53 | 2022-03-12 08:52:40 +0100 | [diff] [blame] | 1004 | embedding any kind of tracking information (even just a UUID). |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1005 | Users wary of such possibility should scan the stream of concatenated frames |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 1006 | in an attempt to detect such frame for analysis or removal. |
| 1007 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1008 | __`Magic_Number`__ |
| 1009 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1010 | 4 Bytes, __little-endian__ format. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1011 | Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F. |
| 1012 | All 16 values are valid to identify a skippable frame. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1013 | This specification doesn't detail any specific tagging for skippable frames. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1014 | |
| 1015 | __`Frame_Size`__ |
| 1016 | |
| 1017 | This is the size, in bytes, of the following `User_Data` |
| 1018 | (without including the magic number nor the size field itself). |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1019 | This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1020 | This means `User_Data` can’t be bigger than (2^32-1) bytes. |
| 1021 | |
| 1022 | __`User_Data`__ |
| 1023 | |
| 1024 | The `User_Data` can be anything. Data will just be skipped by the decoder. |
| 1025 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1026 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1027 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1028 | Entropy Encoding |
| 1029 | ---------------- |
| 1030 | Two types of entropy encoding are used by the Zstandard format: |
| 1031 | FSE, and Huffman coding. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1032 | Huffman is used to compress literals, |
| 1033 | while FSE is used for all other symbols |
| 1034 | (`Literals_Length_Code`, `Match_Length_Code`, offset codes) |
| 1035 | and to compress Huffman headers. |
| 1036 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1037 | |
| 1038 | FSE |
| 1039 | --- |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1040 | FSE, short for Finite State Entropy, is an entropy codec based on [ANS]. |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1041 | FSE encoding/decoding involves a state that is carried over between symbols. |
| 1042 | Decoding must be done in the opposite direction as encoding. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1043 | Therefore, all FSE bitstreams are read from end to beginning. |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1044 | Note that the order of the bits in the stream is not reversed, |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1045 | we just read each multi-bits element in the reverse order they are encoded. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1046 | |
| 1047 | For additional details on FSE, see [Finite State Entropy]. |
| 1048 | |
| 1049 | [Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/ |
| 1050 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1051 | FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements: |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1052 | `Symbol`, `Num_Bits`, and `Baseline`. |
| 1053 | The `log2` of the table size is its `Accuracy_Log`. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1054 | An FSE state value represents an index in this table. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1055 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1056 | To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value. |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1057 | The first symbol in the stream is the `Symbol` indicated in the table for that state. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1058 | To obtain the next state value, |
| 1059 | the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1060 | |
| 1061 | [ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems |
| 1062 | |
| 1063 | ### FSE Table Description |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1064 | To decode an FSE bitstream, it is necessary to build its FSE decoding table. |
| 1065 | The decoding table is derived from a distribution of Probabilities. |
| 1066 | The Zstandard format encodes distributions of Probabilities as follows: |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1067 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1068 | The distribution of probabilities is described in a bitstream which is read forward, |
| 1069 | in __little-endian__ fashion. |
| 1070 | The amount of bytes consumed from the bitstream to describe the distribution |
| 1071 | is discovered at the end of the decoding process. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1072 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1073 | The bitstream starts by reporting on which scale the distribution operates. |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1074 | Let's `low4Bits` designate the lowest 4 bits of the first byte : |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1075 | `Accuracy_Log = low4bits + 5`. |
| 1076 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1077 | An FSE distribution table describes the probabilities of all symbols |
| 1078 | from `0` to the last present one (included) in natural order. |
| 1079 | The sum of probabilities is normalized to reach a power of 2 total of `1 << Accuracy_Log` . |
| 1080 | There must be two or more symbols with non-zero probabilities. |
| 1081 | |
| 1082 | The number of bits used to decode each probability is variable. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1083 | It depends on : |
| 1084 | |
| 1085 | - Remaining probabilities + 1 : |
| 1086 | __example__ : |
| 1087 | Presuming an `Accuracy_Log` of 8, |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1088 | and presuming 100 probability points have already been distributed, |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1089 | the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive). |
| elasota | 324cce4 | 2023-10-31 11:42:00 -0400 | [diff] [blame] | 1090 | Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)` |
| 1091 | is the smallest integer `T` that satisfies `(1 << T) > N`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1092 | |
| 1093 | - Value decoded : small values use 1 less bit : |
| 1094 | __example__ : |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1095 | Presuming values from 0 to 157 (inclusive) are possible, |
| 1096 | 255-157 = 98 values are remaining in an 8-bits field. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1097 | They are used this way : |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1098 | first 98 values (hence from 0 to 97) use only 7 bits, |
| 1099 | values from 98 to 157 use 8 bits. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1100 | This is achieved through this scheme : |
| 1101 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1102 | | 8-bit field read | Value decoded | Nb of bits consumed | |
| 1103 | | ---------------- | ------------- | ------------------- | |
| 1104 | | 0 - 97 | 0 - 97 | 7 | |
| 1105 | | 98 - 127 | 98 - 127 | 8 | |
| 1106 | | 128 - 225 | 0 - 97 | 7 | |
| 1107 | | 226 - 255 | 128 - 157 | 8 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1108 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1109 | Probability is derived from Value decoded using the following formula: |
| 1110 | `Probality = Value - 1` |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1111 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1112 | Consequently, a Probability of `0` is described by a Value `1`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1113 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1114 | A Value `0` is used to signal a special case, named "Probability `-1`". |
| 1115 | It describes a probability which should have been "less than 1". |
| 1116 | Its effect on the decoding table building process is described in the [next section]. |
| 1117 | For the purpose of counting total allocated probability points, it counts as one. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1118 | |
| 1119 | [next section]:#from-normalized-distribution-to-decoding-tables |
| 1120 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1121 | Symbols probabilities are read one by one, in order. |
| 1122 | After each probability is decoded, the total nb of probability points is updated. |
| Dimitri Papadopoulos | fcf88ae | 2024-11-26 11:15:39 +0100 | [diff] [blame] | 1123 | This is used to determine how many bits must be read to decode the probability of next symbol. |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1124 | |
| 1125 | When a symbol has a __probability__ of `zero` (decoded from reading a Value `1`), |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1126 | it is followed by a 2-bits repeat flag. |
| 1127 | This repeat flag tells how many probabilities of zeroes follow the current one. |
| 1128 | It provides a number ranging from 0 to 3. |
| 1129 | If it is a 3, another 2-bits repeat flag follows, and so on. |
| 1130 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1131 | When the Probability for a symbol makes cumulated total reach `1 << Accuracy_Log`, |
| 1132 | then it's the last symbol, and decoding is complete. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1133 | |
| 1134 | Then the decoder can tell how many bytes were used in this process, |
| 1135 | and how many symbols are present. |
| 1136 | The bitstream consumes a round number of bytes. |
| 1137 | Any remaining bit within the last byte is just unused. |
| 1138 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1139 | If this process results in a non-zero probability for a symbol outside of the |
| 1140 | valid range of symbols that the FSE table is defined for, even if that symbol is |
| 1141 | not used, then the data is considered corrupted. |
| 1142 | For the specific case of offset codes, |
| 1143 | a decoder implementation may reject a frame containing a non-zero probability |
| 1144 | for an offset code larger than the largest offset code supported by the decoder |
| 1145 | implementation. |
| 1146 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1147 | #### From normalized distribution to decoding tables |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1148 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1149 | The normalized distribution of probabilities is enough |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1150 | to create a unique decoding table. |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1151 | It is generated using the following build rule : |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1152 | |
| 1153 | The table has a size of `Table_Size = 1 << Accuracy_Log`. |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1154 | Each row specifies the decoded symbol, |
| 1155 | and instructions to reach the next state (`Number_of_Bits` and `Baseline`). |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1156 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1157 | Symbols are first scanned in their natural order for "less than 1" probabilities |
| 1158 | (previously decoded from a Value of `0`). |
| 1159 | Symbols with this special probability are being attributed a single row, |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1160 | starting from the end of the table and retreating. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1161 | These symbols define a full state reset, reading `Accuracy_Log` bits. |
| 1162 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1163 | Then, all remaining symbols, sorted in natural order, are allocated rows. |
| 1164 | Starting from smallest present symbol, and table position `0`, |
| 1165 | each symbol gets allocated as many rows as its probability. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1166 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1167 | Row allocation is not linear, it follows this order, in modular arithmetic: |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1168 | ``` |
| 1169 | position += (tableSize>>1) + (tableSize>>3) + 3; |
| 1170 | position &= tableSize-1; |
| 1171 | ``` |
| 1172 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1173 | Using above ordering rule, each symbol gets allocated as many rows as its probability. |
| 1174 | If a position is already occupied by a "less than 1" probability symbol, |
| 1175 | it is simply skipped, and the next position is allocated instead. |
| 1176 | Once enough rows have been allocated for the current symbol, |
| 1177 | the allocation process continues, using the next symbol, in natural order. |
| 1178 | This process guarantees that the table is entirely and exactly filled. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1179 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1180 | Each row specifies a decoded symbol, and is accessed by current state value. |
| 1181 | It also specifies `Number_of_Bits` and `Baseline`, which are required to determine next state value. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1182 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1183 | To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order, |
| 1184 | and then attribute N+1 bits to lower rows, and N bits to higher rows, |
| 1185 | following the process described below (using an example): |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1186 | |
| 1187 | __Example__ : |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1188 | Presuming an `Accuracy_Log` of 7, |
| 1189 | let's imagine a symbol with a Probability of 5: |
| 1190 | it receives 5 rows, corresponding to 5 state values between `0` and `127`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1191 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1192 | In this example, the first state value happens to be `1` (after unspecified previous symbols). |
| 1193 | The next 4 states are then determined using above modular arithmetic rule, |
| 1194 | which specifies to add `64+16+3 = 83` modulo `128` to jump to next position, |
| 1195 | producing the following series: `1`, `84`, `39`, `122`, `77` (modular arithmetic). |
| 1196 | (note: the next symbol will then start at `32`). |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1197 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1198 | These state values are then sorted in natural order, |
| 1199 | resulting in the following series: `1`, `39`, `77`, `84`, `122`. |
| 1200 | |
| 1201 | The next power of 2 after 5 is 8. |
| 1202 | Therefore, the probability space will be divided into 8 equal parts. |
| 1203 | Since the probability space is `1<<7 = 128` large, each share is `128/8 = 16` large. |
| 1204 | |
| 1205 | In order to reach 8 shares, the `8-5 = 3` lowest states will count "double", |
| Yann Collet | ff7bd16 | 2019-10-18 17:48:12 -0700 | [diff] [blame] | 1206 | doubling their shares (32 in width), hence requiring one more bit. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1207 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1208 | Baseline is assigned starting from the lowest state using fewer bits, |
| 1209 | continuing in natural state order, looping back at the beginning. |
| 1210 | Each state takes its allocated range from Baseline, sized by its `Number_of_Bits`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1211 | |
| Yann Collet | ff7bd16 | 2019-10-18 17:48:12 -0700 | [diff] [blame] | 1212 | | state order | 0 | 1 | 2 | 3 | 4 | |
| 1213 | | ---------------- | ----- | ----- | ------ | ---- | ------ | |
| elasota | 52e41b9 | 2023-11-09 12:22:27 -0500 | [diff] [blame] | 1214 | | state value | 1 | 39 | 77 | 84 | 122 | |
| Yann Collet | ff7bd16 | 2019-10-18 17:48:12 -0700 | [diff] [blame] | 1215 | | width | 32 | 32 | 32 | 16 | 16 | |
| 1216 | | `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1217 | | allocation order | 3 | 4 | 5 | 1 | 2 | |
| Yann Collet | ff7bd16 | 2019-10-18 17:48:12 -0700 | [diff] [blame] | 1218 | | `Baseline` | 32 | 64 | 96 | 0 | 16 | |
| 1219 | | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1220 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1221 | During decoding, the next state value is determined by using current state value as row number, |
| 1222 | then reading the required `Number_of_Bits` from the bitstream, and adding the specified `Baseline`. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1223 | |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1224 | Note: |
| 1225 | as a trivial example, it follows that, for a symbol with a Probability of `1`, |
| 1226 | `Baseline` is necessarily `0`, and `Number_of_Bits` is necessarily `Accuracy_Log`. |
| 1227 | |
| 1228 | See [Appendix A] to see the outcome of this process applied to the default distributions. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1229 | |
| 1230 | [Appendix A]: #appendix-a---decoding-tables-for-predefined-codes |
| 1231 | |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1232 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1233 | Huffman Coding |
| 1234 | -------------- |
| 1235 | Zstandard Huffman-coded streams are read backwards, |
| 1236 | similar to the FSE bitstreams. |
| Yann Collet | 832f559 | 2023-02-18 18:16:00 -0800 | [diff] [blame] | 1237 | Therefore, to find the start of the bitstream, it is required to |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1238 | know the offset of the last byte of the Huffman-coded stream. |
| 1239 | |
| 1240 | After writing the last bit containing information, the compressor |
| 1241 | writes a single `1`-bit and then fills the byte with 0-7 `0` bits of |
| 1242 | padding. The last byte of the compressed bitstream cannot be `0` for |
| 1243 | that reason. |
| 1244 | |
| 1245 | When decompressing, the last byte containing the padding is the first |
| 1246 | byte to read. The decompressor needs to skip 0-7 initial `0`-bits and |
| 1247 | the first `1`-bit it occurs. Afterwards, the useful part of the bitstream |
| 1248 | begins. |
| 1249 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1250 | The bitstream contains Huffman-coded symbols in __little-endian__ order, |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1251 | with the codes defined by the method below. |
| 1252 | |
| 1253 | ### Huffman Tree Description |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1254 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1255 | Prefix coding represents symbols from an a priori known alphabet |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 1256 | by bit sequences (codewords), one codeword for each symbol, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1257 | in a manner such that different symbols may be represented |
| 1258 | by bit sequences of different lengths, |
| 1259 | but a parser can always parse an encoded string |
| 1260 | unambiguously symbol-by-symbol. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1261 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1262 | Given an alphabet with known symbol frequencies, |
| 1263 | the Huffman algorithm allows the construction of an optimal prefix code |
| 1264 | using the fewest bits of any possible prefix codes for that alphabet. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1265 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1266 | Prefix code must not exceed a maximum code length. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1267 | More bits improve accuracy but cost more header size, |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame] | 1268 | and require more memory or more complex decoding operations. |
| 1269 | This specification limits maximum code length to 11 bits. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1270 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1271 | #### Representation |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1272 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1273 | All literal symbols from zero (included) to last present one (excluded) |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1274 | are represented by `Weight` with values from `0` to `Max_Number_of_Bits`. |
| 1275 | Transformation from `Weight` to `Number_of_Bits` follows this formula : |
| 1276 | ``` |
| 1277 | Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 |
| 1278 | ``` |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1279 | When a literal symbol is not present, it receives a `Weight` of 0. |
| Yann Collet | 832f559 | 2023-02-18 18:16:00 -0800 | [diff] [blame] | 1280 | The least frequent symbol receives a `Weight` of 1. |
| elasota | 05059e5 | 2023-11-08 23:46:37 -0500 | [diff] [blame] | 1281 | If no literal has a `Weight` of 1, then the data is considered corrupted. |
| 1282 | If there are not at least two literals with non-zero `Weight`, then the data |
| 1283 | is considered corrupted. |
| Yann Collet | 832f559 | 2023-02-18 18:16:00 -0800 | [diff] [blame] | 1284 | The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max). |
| 1285 | The last symbol's `Weight` is deduced from previously retrieved Weights, |
| 1286 | by completing to the nearest power of 2. It's necessarily non 0. |
| 1287 | If it's not possible to reach a clean power of 2 with a single `Weight` value, |
| 1288 | the Huffman Tree Description is considered invalid. |
| 1289 | This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. |
| Yann Collet | 55a8f84 | 2018-09-05 12:25:35 -0700 | [diff] [blame] | 1290 | `Max_Number_of_Bits` must be <= 11, |
| 1291 | otherwise the representation is considered corrupted. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1292 | |
| 1293 | __Example__ : |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 1294 | Let's presume the following Huffman tree must be described : |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1295 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1296 | | literal symbol | A | B | C | D | E | F | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1297 | | ---------------- | --- | --- | --- | --- | --- | --- | |
| 1298 | | `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1299 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1300 | The tree depth is 4, since its longest elements uses 4 bits |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1301 | (longest elements are the ones with smallest frequency). |
| 1302 | |
| 1303 | All symbols will now receive a `Weight` instead of `Number_of_Bits`. |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1304 | Weight formula is : |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1305 | ``` |
| 1306 | Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 |
| 1307 | ``` |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1308 | It gives the following series of Weights : |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1309 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1310 | | literal symbol | A | B | C | D | E | F | |
| 1311 | | -------------- | --- | --- | --- | --- | --- | --- | |
| 1312 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | |
| 1313 | |
| 1314 | This list will be sent to the decoder, with the following modifications: |
| 1315 | |
| 1316 | - `F` will not be listed, because it can be determined from previous symbols |
| 1317 | - nor will symbols above `F` as they are all 0 |
| 1318 | - on the other hand, all symbols before `A`, starting with `\0`, will be listed, with a Weight of 0. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1319 | |
| 1320 | The decoder will do the inverse operation : |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1321 | having collected weights of literal symbols from `A` to `E`, |
| 1322 | it knows the last literal, `F`, is present with a non-zero `Weight`. |
| 1323 | The `Weight` of `F` can be determined by advancing to the next power of 2. |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1324 | The sum of `2^(Weight-1)` (excluding 0's) is : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1325 | `8 + 4 + 2 + 0 + 1 = 15`. |
| Yann Collet | 55a8f84 | 2018-09-05 12:25:35 -0700 | [diff] [blame] | 1326 | Nearest larger power of 2 value is 16. |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1327 | Therefore, `Max_Number_of_Bits = log2(16) = 4` and `Weight[F] = log_2(16 - 15) + 1 = 1`. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1328 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1329 | #### Huffman Tree header |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1330 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1331 | This is a single byte value (0-255), |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1332 | which describes how the series of weights is encoded. |
| 1333 | |
| 1334 | - if `headerByte` < 128 : |
| 1335 | the series of weights is compressed using FSE (see below). |
| 1336 | The length of the FSE-compressed series is equal to `headerByte` (0-127). |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1337 | |
| Yann Collet | 55a8f84 | 2018-09-05 12:25:35 -0700 | [diff] [blame] | 1338 | - if `headerByte` >= 128 : |
| 1339 | + the series of weights uses a direct representation, |
| 1340 | where each `Weight` is encoded directly as a 4 bits field (0-15). |
| 1341 | + They are encoded forward, 2 weights to a byte, |
| 1342 | first weight taking the top four bits and second one taking the bottom four. |
| 1343 | * e.g. the following operations could be used to read the weights: |
| 1344 | `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc. |
| 1345 | + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes, |
| 1346 | meaning it uses only full bytes even if `Number_of_Weights` is odd. |
| 1347 | + `Number_of_Weights = headerByte - 127`. |
| 1348 | * Note that maximum `Number_of_Weights` is 255-127 = 128, |
| 1349 | therefore, only up to 128 `Weight` can be encoded using direct representation. |
| 1350 | * Since the last non-zero `Weight` is _not_ encoded, |
| 1351 | this scheme is compatible with alphabet sizes of up to 129 symbols, |
| 1352 | hence including literal symbol 128. |
| 1353 | * If any literal symbol > 128 has a non-zero `Weight`, |
| 1354 | direct representation is not possible. |
| 1355 | In such case, it's necessary to use FSE compression. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1356 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1357 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1358 | #### Finite State Entropy (FSE) compression of Huffman weights |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1359 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1360 | In this case, the series of Huffman weights is compressed using FSE compression. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1361 | It's a single bitstream with 2 interleaved states, |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 1362 | sharing a single distribution table. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1363 | |
| 1364 | To decode an FSE bitstream, it is necessary to know its compressed size. |
| 1365 | Compressed size is provided by `headerByte`. |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 1366 | It's also necessary to know its _maximum possible_ decompressed size, |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1367 | which is `255`, since literal symbols span from `0` to `255`, |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 1368 | and last symbol's `Weight` is not represented. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1369 | |
| 1370 | An FSE bitstream starts by a header, describing probabilities distribution. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1371 | It will create a Decoding Table. |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1372 | For a list of Huffman weights, the maximum accuracy log is 6 bits. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1373 | For more description see the [FSE header description](#fse-table-description) |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1374 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1375 | The Huffman header compression uses 2 states, |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 1376 | which share the same FSE distribution table. |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1377 | The first state (`State1`) encodes the even indexed symbols, |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 1378 | and the second (`State2`) encodes the odd indexed symbols. |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1379 | `State1` is initialized first, and then `State2`, and they take turns |
| 1380 | decoding a single symbol and updating their state. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1381 | For more details on these FSE operations, see the [FSE section](#fse). |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 1382 | |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1383 | The number of symbols to decode is determined |
| 1384 | by tracking bitStream overflow condition: |
| 1385 | If updating state after decoding a symbol would require more bits than |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1386 | remain in the stream, it is assumed that extra bits are 0. Then, |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 1387 | symbols for each of the final states are decoded and the process is complete. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1388 | |
| elasota | e61e3ff | 2023-11-08 20:06:58 -0500 | [diff] [blame] | 1389 | If this process would produce more weights than the maximum number of decoded |
| 1390 | weights (255), then the data is considered corrupted. |
| 1391 | |
| elasota | 0938308 | 2024-06-20 15:19:58 -0400 | [diff] [blame] | 1392 | If either of the 2 initial states are absent or truncated, then the data is |
| 1393 | considered corrupted. Consequently, it is not possible to encode fewer than |
| 1394 | 2 weights using this mode. |
| 1395 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1396 | #### Conversion from weights to Huffman prefix codes |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 1397 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1398 | All present symbols shall now have a `Weight` value. |
| Yann Collet | c1e6347 | 2018-06-21 18:08:11 -0700 | [diff] [blame] | 1399 | It is possible to transform weights into `Number_of_Bits`, using this formula: |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1400 | ``` |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1401 | Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0 |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1402 | ``` |
| Yann Collet | 3e7c66a | 2024-10-09 01:06:24 -0700 | [diff] [blame] | 1403 | In order to determine which prefix code is assigned to each Symbol, |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1404 | Symbols are first sorted by `Weight`, then by natural sequential order. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1405 | Symbols with a `Weight` of zero are removed. |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1406 | Then, starting from lowest `Weight` (hence highest `Number_of_Bits`), |
| 1407 | prefix codes are assigned in ascending order. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1408 | |
| 1409 | __Example__ : |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1410 | Let's assume the following list of weights has been decoded: |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1411 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1412 | | Literal | A | B | C | D | E | F | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1413 | | -------- | --- | --- | --- | --- | --- | --- | |
| 1414 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1415 | |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 1416 | Sorted by weight and then natural sequential order, |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1417 | it gives the following prefix codes distribution: |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1418 | |
| Yann Collet | 3e7c66a | 2024-10-09 01:06:24 -0700 | [diff] [blame] | 1419 | | Literal | D | E | F | C | B | A | |
| 1420 | | ---------------- | --- | ---- | ---- | ---- | ---- | ---- | |
| 1421 | | `Weight` | 0 | 1 | 1 | 2 | 3 | 4 | |
| 1422 | | `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 | |
| 1423 | | prefix code | N/A | 0000 | 0001 | 001 | 01 | 1 | |
| 1424 | | ascending order | N/A | 0000 | 0001 | 001x | 01xx | 1xxx | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1425 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1426 | ### Huffman-coded Streams |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 1427 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1428 | Given a Huffman decoding table, |
| 1429 | it's possible to decode a Huffman-coded stream. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1430 | |
| 1431 | Each bitstream must be read _backward_, |
| 1432 | that is starting from the end down to the beginning. |
| 1433 | Therefore it's necessary to know the size of each bitstream. |
| 1434 | |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 1435 | It's also necessary to know exactly which _bit_ is the last one. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1436 | This is detected by a final bit flag : |
| 1437 | the highest bit of latest byte is a final-bit-flag. |
| 1438 | Consequently, a last byte of `0` is not possible. |
| 1439 | And the final-bit-flag itself is not part of the useful bitstream. |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 1440 | Hence, the last byte contains between 0 and 7 useful bits. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1441 | |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1442 | Starting from the end, |
| 1443 | it's possible to read the bitstream in a __little-endian__ fashion, |
| 1444 | keeping track of already used bits. Since the bitstream is encoded in reverse |
| 1445 | order, starting from the end read symbols in forward order. |
| 1446 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1447 | For example, if the literal sequence `ABEF` was encoded using above prefix code, |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1448 | it would be encoded (in reverse order) as: |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1449 | |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1450 | |Symbol | F | E | B | A | Padding | |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1451 | |--------|------|------|----|---|---------| |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1452 | |Encoding|`0000`|`0001`|`01`|`1`| `00001` | |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1453 | |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1454 | Resulting in following 2-bytes bitstream : |
| 1455 | ``` |
| 1456 | 00010000 00001101 |
| 1457 | ``` |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1458 | |
| Yann Collet | e8d35cc | 2017-08-20 10:39:20 -0700 | [diff] [blame] | 1459 | Here is an alternative representation with the symbol codes separated by underscore: |
| Yann Collet | d0d06e4 | 2017-08-19 12:26:09 -0700 | [diff] [blame] | 1460 | ``` |
| 1461 | 0001_0000 00001_1_01 |
| 1462 | ``` |
| 1463 | |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1464 | Reading highest `Max_Number_of_Bits` bits, |
| 1465 | it's possible to compare extracted value to decoding table, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1466 | determining the symbol to decode and number of bits to discard. |
| 1467 | |
| 1468 | The process continues up to reading the required number of symbols per stream. |
| 1469 | If a bitstream is not entirely and exactly consumed, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 1470 | hence reaching exactly its beginning position with _all_ bits consumed, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 1471 | the decoding process is considered faulty. |
| 1472 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1473 | |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1474 | Dictionary Format |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1475 | ----------------- |
| 1476 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1477 | Zstandard is compatible with "raw content" dictionaries, |
| 1478 | free of any format restriction, except that they must be at least 8 bytes. |
| 1479 | These dictionaries function as if they were just the `Content` part |
| 1480 | of a formatted dictionary. |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1481 | |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1482 | But dictionaries created by `zstd --train` follow a format, described here. |
| 1483 | |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1484 | __Pre-requisites__ : a dictionary has a size, |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1485 | defined either by a buffer limit, or a file size. |
| 1486 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1487 | | `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` | |
| 1488 | | -------------- | --------------- | ---------------- | --------- | |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1489 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1490 | __`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1491 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1492 | __`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1493 | `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`). |
| Yann Collet | 722e14b | 2016-07-08 19:22:16 +0200 | [diff] [blame] | 1494 | It's used by decoders to check if they use the correct dictionary. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1495 | |
| 1496 | _Reserved ranges :_ |
| Yann Collet | 11a392c | 2020-05-26 13:15:35 -0700 | [diff] [blame] | 1497 | If the dictionary is going to be distributed in a public environment, |
| 1498 | the following ranges of `Dictionary_ID` are reserved for some future registrar |
| 1499 | and shall not be used : |
| Yann Collet | 6cacd34 | 2016-07-15 17:58:13 +0200 | [diff] [blame] | 1500 | |
| Yann Collet | 11a392c | 2020-05-26 13:15:35 -0700 | [diff] [blame] | 1501 | - low range : <= 32767 |
| 1502 | - high range : >= (2^31) |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1503 | |
| Yann Collet | 11a392c | 2020-05-26 13:15:35 -0700 | [diff] [blame] | 1504 | Outside of these ranges, any value of `Dictionary_ID` |
| 1505 | which is both `>= 32768` and `< (1<<31)` can be used freely, |
| 1506 | even in public environment. |
| Yann Collet | bb3c9bf | 2020-05-25 08:15:09 -0700 | [diff] [blame] | 1507 | |
| 1508 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1509 | __`Entropy_Tables`__ : follow the same format as tables in [compressed blocks]. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1510 | See the relevant [FSE](#fse-table-description) |
| 1511 | and [Huffman](#huffman-tree-description) sections for how to decode these tables. |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1512 | They are stored in following order : |
| 1513 | Huffman tables for literals, FSE table for offsets, |
| 1514 | FSE table for match lengths, and FSE table for literals lengths. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1515 | These tables populate the Repeat Stats literals mode and |
| 1516 | Repeat distribution mode for sequence decoding. |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1517 | It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`), |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1518 | stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes. |
| senhuang42 | 8adeb9f | 2020-09-22 13:24:27 -0400 | [diff] [blame] | 1519 | Each recent offset must have a value <= dictionary content size, and cannot equal 0. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1520 | |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1521 | __`Content`__ : The rest of the dictionary is its content. |
| Sean Purcell | ab226d4 | 2017-01-25 16:41:52 -0800 | [diff] [blame] | 1522 | The content act as a "past" in front of data to compress or decompress, |
| 1523 | so it can be referenced in sequence commands. |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1524 | As long as the amount of data decoded from this frame is less than or |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1525 | equal to `Window_Size`, sequence commands may specify offsets longer |
| 1526 | than the total length of decoded output so far to reference back to the |
| Yann Collet | bb3c9bf | 2020-05-25 08:15:09 -0700 | [diff] [blame] | 1527 | dictionary, even parts of the dictionary with offsets larger than `Window_Size`. |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1528 | After the total output has surpassed `Window_Size` however, |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1529 | this is no longer allowed and the dictionary is no longer accessible. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1530 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 1531 | [compressed blocks]: #the-format-of-compressed_block |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1532 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1533 | If a dictionary is provided by an external source, |
| 1534 | it should be loaded with great care, its content considered untrusted. |
| 1535 | |
| 1536 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1537 | |
| Johannes Rudolph | 6fb4d67 | 2016-09-14 19:01:04 +0200 | [diff] [blame] | 1538 | Appendix A - Decoding tables for predefined codes |
| 1539 | ------------------------------------------------- |
| 1540 | |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1541 | This appendix contains FSE decoding tables |
| 1542 | for the predefined literal length, match length, and offset codes. |
| 1543 | The tables have been constructed using the algorithm as given above in chapter |
| 1544 | "from normalized distribution to decoding tables". |
| 1545 | The tables here can be used as examples |
| 1546 | to crosscheck that an implementation build its decoding tables correctly. |
| Johannes Rudolph | 6fb4d67 | 2016-09-14 19:01:04 +0200 | [diff] [blame] | 1547 | |
| 1548 | #### Literal Length Code: |
| 1549 | |
| 1550 | | State | Symbol | Number_Of_Bits | Base | |
| 1551 | | ----- | ------ | -------------- | ---- | |
| 1552 | | 0 | 0 | 4 | 0 | |
| 1553 | | 1 | 0 | 4 | 16 | |
| 1554 | | 2 | 1 | 5 | 32 | |
| 1555 | | 3 | 3 | 5 | 0 | |
| 1556 | | 4 | 4 | 5 | 0 | |
| 1557 | | 5 | 6 | 5 | 0 | |
| 1558 | | 6 | 7 | 5 | 0 | |
| 1559 | | 7 | 9 | 5 | 0 | |
| 1560 | | 8 | 10 | 5 | 0 | |
| 1561 | | 9 | 12 | 5 | 0 | |
| 1562 | | 10 | 14 | 6 | 0 | |
| 1563 | | 11 | 16 | 5 | 0 | |
| 1564 | | 12 | 18 | 5 | 0 | |
| 1565 | | 13 | 19 | 5 | 0 | |
| 1566 | | 14 | 21 | 5 | 0 | |
| 1567 | | 15 | 22 | 5 | 0 | |
| 1568 | | 16 | 24 | 5 | 0 | |
| 1569 | | 17 | 25 | 5 | 32 | |
| 1570 | | 18 | 26 | 5 | 0 | |
| 1571 | | 19 | 27 | 6 | 0 | |
| 1572 | | 20 | 29 | 6 | 0 | |
| 1573 | | 21 | 31 | 6 | 0 | |
| 1574 | | 22 | 0 | 4 | 32 | |
| 1575 | | 23 | 1 | 4 | 0 | |
| 1576 | | 24 | 2 | 5 | 0 | |
| 1577 | | 25 | 4 | 5 | 32 | |
| 1578 | | 26 | 5 | 5 | 0 | |
| 1579 | | 27 | 7 | 5 | 32 | |
| 1580 | | 28 | 8 | 5 | 0 | |
| 1581 | | 29 | 10 | 5 | 32 | |
| 1582 | | 30 | 11 | 5 | 0 | |
| 1583 | | 31 | 13 | 6 | 0 | |
| 1584 | | 32 | 16 | 5 | 32 | |
| 1585 | | 33 | 17 | 5 | 0 | |
| 1586 | | 34 | 19 | 5 | 32 | |
| 1587 | | 35 | 20 | 5 | 0 | |
| 1588 | | 36 | 22 | 5 | 32 | |
| 1589 | | 37 | 23 | 5 | 0 | |
| 1590 | | 38 | 25 | 4 | 0 | |
| 1591 | | 39 | 25 | 4 | 16 | |
| 1592 | | 40 | 26 | 5 | 32 | |
| 1593 | | 41 | 28 | 6 | 0 | |
| 1594 | | 42 | 30 | 6 | 0 | |
| 1595 | | 43 | 0 | 4 | 48 | |
| 1596 | | 44 | 1 | 4 | 16 | |
| 1597 | | 45 | 2 | 5 | 32 | |
| 1598 | | 46 | 3 | 5 | 32 | |
| 1599 | | 47 | 5 | 5 | 32 | |
| 1600 | | 48 | 6 | 5 | 32 | |
| 1601 | | 49 | 8 | 5 | 32 | |
| 1602 | | 50 | 9 | 5 | 32 | |
| 1603 | | 51 | 11 | 5 | 32 | |
| 1604 | | 52 | 12 | 5 | 32 | |
| 1605 | | 53 | 15 | 6 | 0 | |
| 1606 | | 54 | 17 | 5 | 32 | |
| 1607 | | 55 | 18 | 5 | 32 | |
| 1608 | | 56 | 20 | 5 | 32 | |
| 1609 | | 57 | 21 | 5 | 32 | |
| 1610 | | 58 | 23 | 5 | 32 | |
| 1611 | | 59 | 24 | 5 | 32 | |
| 1612 | | 60 | 35 | 6 | 0 | |
| 1613 | | 61 | 34 | 6 | 0 | |
| 1614 | | 62 | 33 | 6 | 0 | |
| 1615 | | 63 | 32 | 6 | 0 | |
| 1616 | |
| 1617 | #### Match Length Code: |
| 1618 | |
| 1619 | | State | Symbol | Number_Of_Bits | Base | |
| 1620 | | ----- | ------ | -------------- | ---- | |
| 1621 | | 0 | 0 | 6 | 0 | |
| 1622 | | 1 | 1 | 4 | 0 | |
| 1623 | | 2 | 2 | 5 | 32 | |
| 1624 | | 3 | 3 | 5 | 0 | |
| 1625 | | 4 | 5 | 5 | 0 | |
| 1626 | | 5 | 6 | 5 | 0 | |
| 1627 | | 6 | 8 | 5 | 0 | |
| 1628 | | 7 | 10 | 6 | 0 | |
| 1629 | | 8 | 13 | 6 | 0 | |
| 1630 | | 9 | 16 | 6 | 0 | |
| 1631 | | 10 | 19 | 6 | 0 | |
| 1632 | | 11 | 22 | 6 | 0 | |
| 1633 | | 12 | 25 | 6 | 0 | |
| 1634 | | 13 | 28 | 6 | 0 | |
| 1635 | | 14 | 31 | 6 | 0 | |
| 1636 | | 15 | 33 | 6 | 0 | |
| 1637 | | 16 | 35 | 6 | 0 | |
| 1638 | | 17 | 37 | 6 | 0 | |
| 1639 | | 18 | 39 | 6 | 0 | |
| 1640 | | 19 | 41 | 6 | 0 | |
| 1641 | | 20 | 43 | 6 | 0 | |
| 1642 | | 21 | 45 | 6 | 0 | |
| 1643 | | 22 | 1 | 4 | 16 | |
| 1644 | | 23 | 2 | 4 | 0 | |
| 1645 | | 24 | 3 | 5 | 32 | |
| 1646 | | 25 | 4 | 5 | 0 | |
| 1647 | | 26 | 6 | 5 | 32 | |
| 1648 | | 27 | 7 | 5 | 0 | |
| 1649 | | 28 | 9 | 6 | 0 | |
| 1650 | | 29 | 12 | 6 | 0 | |
| 1651 | | 30 | 15 | 6 | 0 | |
| 1652 | | 31 | 18 | 6 | 0 | |
| 1653 | | 32 | 21 | 6 | 0 | |
| 1654 | | 33 | 24 | 6 | 0 | |
| 1655 | | 34 | 27 | 6 | 0 | |
| 1656 | | 35 | 30 | 6 | 0 | |
| 1657 | | 36 | 32 | 6 | 0 | |
| 1658 | | 37 | 34 | 6 | 0 | |
| 1659 | | 38 | 36 | 6 | 0 | |
| 1660 | | 39 | 38 | 6 | 0 | |
| 1661 | | 40 | 40 | 6 | 0 | |
| 1662 | | 41 | 42 | 6 | 0 | |
| 1663 | | 42 | 44 | 6 | 0 | |
| 1664 | | 43 | 1 | 4 | 32 | |
| 1665 | | 44 | 1 | 4 | 48 | |
| 1666 | | 45 | 2 | 4 | 16 | |
| 1667 | | 46 | 4 | 5 | 32 | |
| 1668 | | 47 | 5 | 5 | 32 | |
| 1669 | | 48 | 7 | 5 | 32 | |
| 1670 | | 49 | 8 | 5 | 32 | |
| 1671 | | 50 | 11 | 6 | 0 | |
| 1672 | | 51 | 14 | 6 | 0 | |
| 1673 | | 52 | 17 | 6 | 0 | |
| 1674 | | 53 | 20 | 6 | 0 | |
| 1675 | | 54 | 23 | 6 | 0 | |
| 1676 | | 55 | 26 | 6 | 0 | |
| 1677 | | 56 | 29 | 6 | 0 | |
| 1678 | | 57 | 52 | 6 | 0 | |
| 1679 | | 58 | 51 | 6 | 0 | |
| 1680 | | 59 | 50 | 6 | 0 | |
| 1681 | | 60 | 49 | 6 | 0 | |
| 1682 | | 61 | 48 | 6 | 0 | |
| 1683 | | 62 | 47 | 6 | 0 | |
| 1684 | | 63 | 46 | 6 | 0 | |
| 1685 | |
| 1686 | #### Offset Code: |
| 1687 | |
| 1688 | | State | Symbol | Number_Of_Bits | Base | |
| 1689 | | ----- | ------ | -------------- | ---- | |
| 1690 | | 0 | 0 | 5 | 0 | |
| 1691 | | 1 | 6 | 4 | 0 | |
| 1692 | | 2 | 9 | 5 | 0 | |
| 1693 | | 3 | 15 | 5 | 0 | |
| 1694 | | 4 | 21 | 5 | 0 | |
| 1695 | | 5 | 3 | 5 | 0 | |
| 1696 | | 6 | 7 | 4 | 0 | |
| 1697 | | 7 | 12 | 5 | 0 | |
| 1698 | | 8 | 18 | 5 | 0 | |
| 1699 | | 9 | 23 | 5 | 0 | |
| 1700 | | 10 | 5 | 5 | 0 | |
| 1701 | | 11 | 8 | 4 | 0 | |
| 1702 | | 12 | 14 | 5 | 0 | |
| 1703 | | 13 | 20 | 5 | 0 | |
| 1704 | | 14 | 2 | 5 | 0 | |
| 1705 | | 15 | 7 | 4 | 16 | |
| 1706 | | 16 | 11 | 5 | 0 | |
| 1707 | | 17 | 17 | 5 | 0 | |
| 1708 | | 18 | 22 | 5 | 0 | |
| 1709 | | 19 | 4 | 5 | 0 | |
| 1710 | | 20 | 8 | 4 | 16 | |
| 1711 | | 21 | 13 | 5 | 0 | |
| 1712 | | 22 | 19 | 5 | 0 | |
| 1713 | | 23 | 1 | 5 | 0 | |
| 1714 | | 24 | 6 | 4 | 16 | |
| 1715 | | 25 | 10 | 5 | 0 | |
| 1716 | | 26 | 16 | 5 | 0 | |
| 1717 | | 27 | 28 | 5 | 0 | |
| 1718 | | 28 | 27 | 5 | 0 | |
| 1719 | | 29 | 26 | 5 | 0 | |
| 1720 | | 30 | 25 | 5 | 0 | |
| 1721 | | 31 | 24 | 5 | 0 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1722 | |
| Yann Collet | 7639db9 | 2018-06-21 17:48:34 -0700 | [diff] [blame] | 1723 | |
| 1724 | |
| 1725 | Appendix B - Resources for implementers |
| 1726 | ------------------------------------------------- |
| 1727 | |
| 1728 | An open source reference implementation is available on : |
| 1729 | https://github.com/facebook/zstd |
| 1730 | |
| 1731 | The project contains a frame generator, called [decodeCorpus], |
| 1732 | which can be used by any 3rd-party implementation |
| 1733 | to verify that a tested decoder is compliant with the specification. |
| 1734 | |
| 1735 | [decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing |
| 1736 | |
| 1737 | `decodeCorpus` generates random valid frames. |
| 1738 | A compliant decoder should be able to decode them all, |
| 1739 | or at least provide a meaningful error code explaining for which reason it cannot |
| 1740 | (memory limit restrictions for example). |
| 1741 | |
| 1742 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1743 | Version changes |
| 1744 | --------------- |
| Yann Collet | 3b343dc | 2024-10-07 17:15:07 -0700 | [diff] [blame] | 1745 | - 0.4.3 : clarifications for Huffman prefix code assignment example |
| Yann Collet | a8b86d0 | 2024-10-02 22:57:36 -0700 | [diff] [blame] | 1746 | - 0.4.2 : refactor FSE table construction process, inspired by Donald Pian |
| 1747 | - 0.4.1 : clarifications on a few error scenarios, by Eric Lasota |
| Yann Collet | 3732a08 | 2023-06-05 16:03:00 -0700 | [diff] [blame] | 1748 | - 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov |
| Yann Collet | 64e8511 | 2023-03-08 15:30:27 -0800 | [diff] [blame] | 1749 | - 0.3.9 : clarifications for Huffman-compressed literal sizes. |
| Yann Collet | 832f559 | 2023-02-18 18:16:00 -0800 | [diff] [blame] | 1750 | - 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions. |
| Yann Collet | 0b0b62d | 2021-05-15 23:04:46 -0700 | [diff] [blame] | 1751 | - 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878 |
| Yann Collet | bb3c9bf | 2020-05-25 08:15:09 -0700 | [diff] [blame] | 1752 | - 0.3.6 : clarifications for Dictionary_ID |
| Yann Collet | 098b36e | 2019-11-13 09:50:15 -0800 | [diff] [blame] | 1753 | - 0.3.5 : clarifications for Block_Maximum_Size |
| Yann Collet | ff7bd16 | 2019-10-18 17:48:12 -0700 | [diff] [blame] | 1754 | - 0.3.4 : clarifications for FSE decoding table |
| Yann Collet | 1e07eb4 | 2019-08-16 15:13:42 +0200 | [diff] [blame] | 1755 | - 0.3.3 : clarifications for field Block_Size |
| W. Felix Handte | a2861d7 | 2019-07-17 17:55:15 -0400 | [diff] [blame] | 1756 | - 0.3.2 : remove additional block size restriction on compressed blocks |
| Yann Collet | 9bf0070 | 2018-10-26 15:51:51 -0700 | [diff] [blame] | 1757 | - 0.3.1 : minor clarification regarding offset history update rules |
| Yann Collet | 72a3adf | 2018-09-25 16:34:26 -0700 | [diff] [blame] | 1758 | - 0.3.0 : minor edits to match RFC8478 |
| Yann Collet | 55a8f84 | 2018-09-05 12:25:35 -0700 | [diff] [blame] | 1759 | - 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz |
| Yann Collet | a4c9c4d | 2018-05-31 10:47:44 -0700 | [diff] [blame] | 1760 | - 0.2.8 : clarifications for IETF RFC discuss |
| Yann Collet | 82ad249 | 2018-04-30 11:35:49 -0700 | [diff] [blame] | 1761 | - 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1762 | - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz |
| Yann Collet | 14433ca | 2017-03-31 10:54:45 -0700 | [diff] [blame] | 1763 | - 0.2.5 : minor typos and clarifications |
| Sean Purcell | 042419e | 2017-02-17 16:24:26 -0800 | [diff] [blame] | 1764 | - 0.2.4 : section restructuring, by Sean Purcell |
| Yann Collet | 20bed42 | 2017-01-27 12:16:16 -0800 | [diff] [blame] | 1765 | - 0.2.3 : clarified several details, by Sean Purcell |
| Yann Collet | 55981a9 | 2016-09-15 02:13:18 +0200 | [diff] [blame] | 1766 | - 0.2.2 : added predefined codes, by Johannes Rudolph |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1767 | - 0.2.1 : clarify field names, by Przemyslaw Skibinski |
| Yann Collet | 8b12812 | 2017-08-19 12:17:57 -0700 | [diff] [blame] | 1768 | - 0.2.0 : numerous format adjustments for zstd v0.8+ |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 1769 | - 0.1.2 : limit Huffman tree depth to 11 bits |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame] | 1770 | - 0.1.1 : reserved dictID ranges |
| 1771 | - 0.1.0 : initial release |