| Yann Collet | 5cc1882 | 2016-07-03 19:03:13 +0200 | [diff] [blame] | 1 | Zstandard Compression Format |
| 2 | ============================ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 3 | |
| 4 | ### Notices |
| 5 | |
| 6 | Copyright (c) 2016 Yann Collet |
| 7 | |
| 8 | Permission is granted to copy and distribute this document |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 9 | for any purpose and without charge, |
| 10 | including translations into other languages |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 11 | and incorporation into compilations, |
| 12 | provided that the copyright notice and this notice are preserved, |
| 13 | and that any substantive changes or deletions from the original |
| 14 | are clearly marked. |
| 15 | Distribution of this document is unlimited. |
| 16 | |
| 17 | ### Version |
| 18 | |
| Yann Collet | 6fa05a2 | 2016-07-20 14:58:49 +0200 | [diff] [blame] | 19 | 0.2.0 (22/07/16) |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 20 | |
| 21 | |
| 22 | Introduction |
| 23 | ------------ |
| 24 | |
| 25 | The purpose of this document is to define a lossless compressed data format, |
| 26 | that is independent of CPU type, operating system, |
| 27 | file system and character set, suitable for |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 28 | file compression, pipe and streaming compression, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 29 | using the [Zstandard algorithm](http://www.zstandard.org). |
| 30 | |
| 31 | The data can be produced or consumed, |
| 32 | even for an arbitrarily long sequentially presented input data stream, |
| 33 | using only an a priori bounded amount of intermediate storage, |
| 34 | and hence can be used in data communications. |
| 35 | The format uses the Zstandard compression method, |
| 36 | and optional [xxHash-64 checksum method](http://www.xxhash.org), |
| 37 | for detection of data corruption. |
| 38 | |
| 39 | The data format defined by this specification |
| 40 | does not attempt to allow random access to compressed data. |
| 41 | |
| 42 | This specification is intended for use by implementers of software |
| 43 | to compress data into Zstandard format and/or decompress data from Zstandard format. |
| 44 | The text of the specification assumes a basic background in programming |
| 45 | at the level of bits and other primitive data representations. |
| 46 | |
| 47 | Unless otherwise indicated below, |
| 48 | a compliant compressor must produce data sets |
| 49 | that conform to the specifications presented here. |
| 50 | It doesn’t need to support all options though. |
| 51 | |
| 52 | A compliant decompressor must be able to decompress |
| 53 | at least one working set of parameters |
| 54 | that conforms to the specifications presented here. |
| 55 | It may also ignore informative fields, such as checksum. |
| 56 | Whenever it does not support a parameter defined in the compressed stream, |
| 57 | it must produce a non-ambiguous error code and associated error message |
| 58 | explaining which parameter is unsupported. |
| 59 | |
| 60 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 61 | Overall conventions |
| 62 | ----------- |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 63 | In this document: |
| 64 | - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. |
| 65 | - a naming convention for identifiers is `Mixed_Case_With_Underscores` |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 66 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 67 | Definitions |
| 68 | ----------- |
| 69 | A content compressed by Zstandard is transformed into a Zstandard __frame__. |
| 70 | Multiple frames can be appended into a single file or stream. |
| 71 | A frame is totally independent, has a defined beginning and end, |
| 72 | and a set of parameters which tells the decoder how to decompress it. |
| 73 | |
| 74 | A frame encapsulates one or multiple __blocks__. |
| 75 | Each block can be compressed or not, |
| 76 | and has a guaranteed maximum content size, which depends on frame parameters. |
| 77 | Unlike frames, each block depends on previous blocks for proper decoding. |
| 78 | However, each block can be decompressed without waiting for its successor, |
| 79 | allowing streaming operations. |
| 80 | |
| 81 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 82 | Frame Concatenation |
| 83 | ------------------- |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 84 | |
| 85 | In some circumstances, it may be required to append multiple frames, |
| 86 | for example in order to add new data to an existing compressed file |
| 87 | without re-framing it. |
| 88 | |
| 89 | In such case, each frame brings its own set of descriptor flags. |
| 90 | Each frame is considered independent. |
| 91 | The only relation between frames is their sequential order. |
| 92 | |
| 93 | The ability to decode multiple concatenated frames |
| 94 | within a single stream or file is left outside of this specification. |
| 95 | As an example, the reference `zstd` command line utility is able |
| 96 | to decode all concatenated frames in their sequential order, |
| 97 | delivering the final decompressed result as if it was a single content. |
| 98 | |
| 99 | |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 100 | Skippable Frames |
| 101 | ---------------- |
| 102 | |
| 103 | | `Magic_Number` | `Frame_Size` | `User_Data` | |
| 104 | |:--------------:|:------------:|:-----------:| |
| 105 | | 4 bytes | 4 bytes | n bytes | |
| 106 | |
| 107 | Skippable frames allow the insertion of user-defined data |
| 108 | into a flow of concatenated frames. |
| 109 | Its design is pretty straightforward, |
| 110 | with the sole objective to allow the decoder to quickly skip |
| 111 | over user-defined data and continue decoding. |
| 112 | |
| 113 | Skippable frames defined in this specification are compatible with [LZ4] ones. |
| 114 | |
| 115 | [LZ4]:http://www.lz4.org |
| 116 | |
| 117 | __`Magic_Number`__ |
| 118 | |
| 119 | 4 Bytes, little-endian format. |
| 120 | Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. |
| 121 | All 16 values are valid to identify a skippable frame. |
| 122 | |
| 123 | __`Frame_Size`__ |
| 124 | |
| 125 | This is the size, in bytes, of the following `User_Data` |
| 126 | (without including the magic number nor the size field itself). |
| 127 | This field is represented using 4 Bytes, little-endian format, unsigned 32-bits. |
| 128 | This means `User_Data` can’t be bigger than (2^32-1) bytes. |
| 129 | |
| 130 | __`User_Data`__ |
| 131 | |
| 132 | The `User_Data` can be anything. Data will just be skipped by the decoder. |
| 133 | |
| 134 | |
| 135 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 136 | General Structure of Zstandard Frame format |
| 137 | ------------------------------------------- |
| 138 | The structure of a single Zstandard frame is following: |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 139 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 140 | | `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] | |
| 141 | |:--------------:|:--------------:|:----------:| ------------------ |:--------------------:| |
| 142 | | 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 143 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 144 | __`Magic_Number`__ |
| 145 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 146 | 4 Bytes, little-endian format. |
| Yann Collet | 7bdfcea | 2016-09-05 17:43:31 +0200 | [diff] [blame^] | 147 | Value : 0xFD2FB528 |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 148 | |
| 149 | __`Frame_Header`__ |
| 150 | |
| 151 | 2 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header). |
| 152 | |
| 153 | __`Data_Block`__ |
| 154 | |
| 155 | Detailed in [next chapter](#the-structure-of-data_block). |
| 156 | That’s where compressed data is stored. |
| 157 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 158 | __`Content_Checksum`__ |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 159 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 160 | An optional 32-bit checksum, only present if `Content_Checksum_flag` is set. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 161 | The content checksum is the result |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 162 | of [xxh64() hash function](http://www.xxhash.org) |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 163 | digesting the original (decoded) data as input, and a seed of zero. |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 164 | The low 4 bytes of the checksum are stored in little endian format. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 165 | |
| 166 | |
| 167 | The structure of `Frame_Header` |
| 168 | ------------------------------- |
| 169 | The `Frame_Header` has a variable size, which uses a minimum of 2 bytes, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 170 | and up to 14 bytes depending on optional parameters. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 171 | The structure of `Frame_Header` is following: |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 172 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 173 | | `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] | |
| 174 | | ------------------------- | --------------------- | ----------------- | ---------------------- | |
| 175 | | 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 176 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 177 | ### `Frame_Header_Descriptor` |
| 178 | |
| 179 | The first header's byte is called the `Frame_Header_Descriptor`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 180 | It tells which other fields are present. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 181 | Decoding this byte is enough to tell the size of `Frame_Header`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 182 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 183 | | Bit number | Field name | |
| 184 | | ---------- | ---------- | |
| 185 | | 7-6 | `Frame_Content_Size_flag` | |
| 186 | | 5 | `Single_Segment_flag` | |
| 187 | | 4 | `Unused_bit` | |
| 188 | | 3 | `Reserved_bit` | |
| 189 | | 2 | `Content_Checksum_flag` | |
| 190 | | 1-0 | `Dictionary_ID_flag` | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 191 | |
| 192 | In this table, bit 7 is highest bit, while bit 0 is lowest. |
| 193 | |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 194 | __`Frame_Content_Size_flag`__ |
| 195 | |
| 196 | This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`), |
| 197 | specifying if decompressed data size is provided within the header. |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 198 | The `Flag_Value` can be converted into `Field_Size`, |
| 199 | which is the number of bytes used by `Frame_Content_Size` |
| 200 | according to the following table: |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 201 | |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 202 | |`Flag_Value`| 0 | 1 | 2 | 3 | |
| 203 | | ---------- | ------ | --- | --- | --- | |
| 204 | |`Field_Size`| 0 or 1 | 2 | 4 | 8 | |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 205 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 206 | When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` : |
| 207 | if `Single_Segment_flag` is set, `Field_Size` is 1. |
| 208 | Otherwise, `Field_Size` is 0 (content size not provided). |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 209 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 210 | __`Single_Segment_flag`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 211 | |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 212 | If this flag is set, |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 213 | data must be regenerated within a single continuous memory segment. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 214 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 215 | In this case, `Frame_Content_Size` is necessarily present, |
| 216 | but `Window_Descriptor` byte is skipped. |
| inikep | 49ec6d1 | 2016-07-25 12:26:39 +0200 | [diff] [blame] | 217 | As a consequence, the decoder must allocate a memory segment |
| 218 | of size equal or bigger than `Frame_Content_Size`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 219 | |
| 220 | In order to preserve the decoder from unreasonable memory requirement, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 221 | a decoder can reject a compressed frame |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 222 | which requests a memory size beyond decoder's authorized range. |
| 223 | |
| 224 | For broader compatibility, decoders are recommended to support |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 225 | memory sizes of at least 8 MB. |
| 226 | This is just a recommendation, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 227 | each decoder is free to support higher or lower limits, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 228 | depending on local limitations. |
| 229 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 230 | __`Unused_bit`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 231 | |
| Yann Collet | f0bc673 | 2016-07-13 17:30:21 +0200 | [diff] [blame] | 232 | The value of this bit should be set to zero. |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 233 | A decoder compliant with this specification version shall not interpret it. |
| Yann Collet | f0bc673 | 2016-07-13 17:30:21 +0200 | [diff] [blame] | 234 | It might be used in a future version, |
| 235 | to signal a property which is not mandatory to properly decode the frame. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 236 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 237 | __`Reserved_bit`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 238 | |
| 239 | This bit is reserved for some future feature. |
| 240 | Its value _must be zero_. |
| 241 | A decoder compliant with this specification version must ensure it is not set. |
| 242 | This bit may be used in a future revision, |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 243 | to signal a feature that must be interpreted to decode the frame correctly. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 244 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 245 | __`Content_Checksum_flag`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 246 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 247 | If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end. |
| 248 | See `Content_Checksum` paragraph. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 249 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 250 | __`Dictionary_ID_flag`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 251 | |
| 252 | This is a 2-bits flag (`= FHD & 3`), |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 253 | telling if a dictionary ID is provided within the header. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 254 | It also specifies the size of this field as `Field_Size`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 255 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 256 | |`Flag_Value`| 0 | 1 | 2 | 3 | |
| 257 | | ---------- | --- | --- | --- | --- | |
| 258 | |`Field_Size`| 0 | 1 | 2 | 4 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 259 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 260 | ### `Window_Descriptor` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 261 | |
| 262 | Provides guarantees on maximum back-reference distance |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 263 | that will be used within compressed data. |
| 264 | This information is important for decoders to allocate enough memory. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 265 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 266 | The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 267 | In this case, the maximum back-reference distance is the content size itself, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 268 | which can be any value from 1 to 2^64-1 bytes (16 EB). |
| 269 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 270 | | Bit numbers | 7-3 | 0-2 | |
| 271 | | ----------- | ---------- | ---------- | |
| 272 | | Field name | `Exponent` | `Mantissa` | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 273 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 274 | Maximum distance is given by the following formulas : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 275 | ``` |
| 276 | windowLog = 10 + Exponent; |
| 277 | windowBase = 1 << windowLog; |
| 278 | windowAdd = (windowBase / 8) * Mantissa; |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 279 | Window_Size = windowBase + windowAdd; |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 280 | ``` |
| 281 | The minimum window size is 1 KB. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 282 | The maximum size is `15*(1<<38)` bytes, which is 1.875 TB. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 283 | |
| 284 | To properly decode compressed data, |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 285 | a decoder will need to allocate a buffer of at least `Window_Size` bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 286 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 287 | In order to preserve decoder from unreasonable memory requirements, |
| 288 | a decoder can refuse a compressed frame |
| 289 | which requests a memory size beyond decoder's authorized range. |
| 290 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 291 | For improved interoperability, |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 292 | decoders are recommended to be compatible with window sizes of 8 MB, |
| 293 | and encoders are recommended to not request more than 8 MB. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 294 | It's merely a recommendation though, |
| 295 | decoders are free to support larger or lower limits, |
| 296 | depending on local limitations. |
| 297 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 298 | ### `Dictionary_ID` |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 299 | |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 300 | This is a variable size field, which contains |
| 301 | the ID of the dictionary required to properly decode the frame. |
| 302 | Note that this field is optional. When it's not present, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 303 | it's up to the caller to make sure it uses the correct dictionary. |
| 304 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 305 | Field size depends on `Dictionary_ID_flag`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 306 | 1 byte can represent an ID 0-255. |
| 307 | 2 bytes can represent an ID 0-65535. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 308 | 4 bytes can represent an ID 0-4294967295. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 309 | |
| 310 | It's allowed to represent a small ID (for example `13`) |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 311 | with a large 4-bytes dictionary ID, losing some compacity in the process. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 312 | |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 313 | _Reserved ranges :_ |
| 314 | If the frame is going to be distributed in a private environment, |
| 315 | any dictionary ID can be used. |
| 316 | However, for public distribution of compressed frames using a dictionary, |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 317 | the following ranges are reserved for future use and should not be used : |
| 318 | - low range : 1 - 32767 |
| 319 | - high range : >= (2^31) |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 320 | |
| 321 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 322 | ### `Frame_Content_Size` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 323 | |
| inikep | 2fc3752 | 2016-07-25 12:47:02 +0200 | [diff] [blame] | 324 | This is the original (uncompressed) size. This information is optional. |
| 325 | The `Field_Size` is provided according to value of `Frame_Content_Size_flag`. |
| 326 | The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 327 | Format is little-endian. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 328 | |
| inikep | 2fc3752 | 2016-07-25 12:47:02 +0200 | [diff] [blame] | 329 | | `Field_Size` | Range | |
| 330 | | ------------ | ---------- | |
| 331 | | 1 | 0 - 255 | |
| 332 | | 2 | 256 - 65791| |
| 333 | | 4 | 0 - 2^32-1 | |
| 334 | | 8 | 0 - 2^64-1 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 335 | |
| inikep | 2fc3752 | 2016-07-25 12:47:02 +0200 | [diff] [blame] | 336 | When `Field_Size` is 1, 4 or 8 bytes, the value is read directly. |
| 337 | When `Field_Size` is 2, _the offset of 256 is added_. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 338 | It's allowed to represent a small size (for example `18`) using any compatible variant. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 339 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 340 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 341 | The structure of `Data_Block` |
| 342 | ----------------------------- |
| 343 | The structure of `Data_Block` is following: |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 344 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 345 | | `Last_Block` | `Block_Type` | `Block_Size` | `Block_Content` | |
| 346 | |:------------:|:------------:|:------------:|:---------------:| |
| 347 | | 1 bit | 2 bits | 21 bits | n bytes | |
| 348 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 349 | The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes. |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 350 | |
| 351 | __`Last_Block`__ |
| 352 | |
| 353 | The lowest bit signals if this block is the last one. |
| 354 | Frame ends right after this block. |
| 355 | It may be followed by an optional `Content_Checksum` . |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 356 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 357 | __`Block_Type` and `Block_Size`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 358 | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 359 | The next 2 bits represent the `Block_Type`, |
| 360 | while the remaining 21 bits represent the `Block_Size`. |
| 361 | Format is __little-endian__. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 362 | |
| 363 | There are 4 block types : |
| 364 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 365 | | Value | 0 | 1 | 2 | 3 | |
| 366 | | ------------ | ----------- | ----------- | ------------------ | --------- | |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 367 | | `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`| |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 368 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 369 | - `Raw_Block` - this is an uncompressed block. |
| 370 | `Block_Size` is the number of bytes to read and copy. |
| 371 | - `RLE_Block` - this is a single byte, repeated N times. |
| 372 | In which case, `Block_Size` is the size to regenerate, |
| 373 | while the "compressed" block is just 1 byte (the byte to repeat). |
| 374 | - `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block), |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 375 | detailed in another section of this specification. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 376 | `Block_Size` is the compressed size. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 377 | Decompressed size is unknown, |
| 378 | but its maximum possible value is guaranteed (see below) |
| Yann Collet | c991cc1 | 2016-07-28 00:55:43 +0200 | [diff] [blame] | 379 | - `Reserved` - this is not a block. |
| 380 | This value cannot be used with current version of this specification. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 381 | |
| 382 | Block sizes must respect a few rules : |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 383 | - In compressed mode, compressed size if always strictly `< decompressed size`. |
| 384 | - Block decompressed size is always <= maximum back-reference distance . |
| 385 | - Block decompressed size is always <= 128 KB |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 386 | |
| 387 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 388 | __`Block_Content`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 389 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 390 | The `Block_Content` is where the actual data to decode stands. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 391 | It might be compressed or not, depending on previous field indications. |
| 392 | A data block is not necessarily "full" : |
| 393 | since an arbitrary “flush” may happen anytime, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 394 | block decompressed content can be any size, |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 395 | up to `Block_Maximum_Decompressed_Size`, which is the smallest of : |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 396 | - Maximum back-reference distance |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 397 | - 128 KB |
| 398 | |
| 399 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 400 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 401 | The format of `Compressed_Block` |
| 402 | -------------------------------- |
| 403 | The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`. |
| 404 | The `Compressed_Block` has a guaranteed maximum regenerated size, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 405 | in order to properly allocate destination buffer. |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 406 | See [`Data_Block`](#the-structure-of-data_block) for more details. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 407 | |
| 408 | A compressed block consists of 2 sections : |
| inikep | 0079425 | 2016-08-04 14:43:21 +0200 | [diff] [blame] | 409 | - [`Literals_Section`](#literals_section) |
| 410 | - [`Sequences_Section`](#sequences_section) |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 411 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 412 | ### Prerequisites |
| 413 | To decode a compressed block, the following elements are necessary : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 414 | - Previous decoded blocks, up to a distance of `Window_Size`, |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 415 | or all previous blocks when `Single_Segment_flag` is set. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 416 | - List of "recent offsets" from previous compressed block. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 417 | - Decoding tables of previous compressed block for each symbol type |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 418 | (literals, literals lengths, match lengths, offsets). |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 419 | |
| 420 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 421 | ### `Literals_Section` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 422 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 423 | During sequence phase, literals will be entangled with match copy operations. |
| 424 | All literals are regrouped in the first part of the block. |
| 425 | They can be decoded first, and then copied during sequence operations, |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 426 | or they can be decoded on the flow, as needed by sequence commands. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 427 | |
| inikep | 4f270ac | 2016-08-04 11:25:52 +0200 | [diff] [blame] | 428 | | `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] | |
| 429 | | ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 430 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 431 | Literals can be stored uncompressed or compressed using Huffman prefix codes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 432 | When compressed, an optional tree description can be present, |
| 433 | followed by 1 or 4 streams. |
| 434 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 435 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 436 | #### `Literals_Section_Header` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 437 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 438 | Header is in charge of describing how literals are packed. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 439 | It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, |
| Yann Collet | 198e6aa | 2016-07-20 20:12:24 +0200 | [diff] [blame] | 440 | using little-endian convention. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 441 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 442 | | `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] | |
| 443 | | --------------------- | ------------- | ------------------ | ----------------- | |
| 444 | | 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits | |
| Yann Collet | 198e6aa | 2016-07-20 20:12:24 +0200 | [diff] [blame] | 445 | |
| 446 | In this representation, bits on the left are smallest bits. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 447 | |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 448 | __`Literals_Block_Type`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 449 | |
| Yann Collet | 198e6aa | 2016-07-20 20:12:24 +0200 | [diff] [blame] | 450 | This field uses 2 lowest bits of first byte, describing 4 different block types : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 451 | |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 452 | | `Literals_Block_Type` | Value | |
| 453 | | ----------------------------- | ----- | |
| 454 | | `Raw_Literals_Block` | 0 | |
| 455 | | `RLE_Literals_Block` | 1 | |
| 456 | | `Compressed_Literals_Block` | 2 | |
| 457 | | `Repeat_Stats_Literals_Block` | 3 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 458 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 459 | - `Raw_Literals_Block` - Literals are stored uncompressed. |
| 460 | - `RLE_Literals_Block` - Literals consist of a single byte value repeated N times. |
| 461 | - `Compressed_Literals_Block` - This is a standard Huffman-compressed block, |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 462 | starting with a Huffman tree description. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 463 | See details below. |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 464 | - `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block, |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 465 | using Huffman tree _from previous Huffman-compressed literals block_. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 466 | Huffman tree description will be skipped. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 467 | |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 468 | __`Size_Format`__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 469 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 470 | `Size_Format` is divided into 2 families : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 471 | |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 472 | - For `Compressed_Block`, it requires to decode both `Compressed_Size` |
| 473 | and `Regenerated_Size` (the decompressed size). It will also decode the number of streams. |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 474 | - For `Raw_Literals_Block` and `RLE_Literals_Block` it's enough to decode `Regenerated_Size`. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 475 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 476 | For values spanning several bytes, convention is little-endian. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 477 | |
| inikep | 9d003c1 | 2016-08-04 10:41:49 +0200 | [diff] [blame] | 478 | __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 479 | |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 480 | - Value x0 : `Regenerated_Size` uses 5 bits (0-31). |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 481 | `Literals_Section_Header` has 1 byte. |
| 482 | `Regenerated_Size = Header[0]>>3` |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 483 | - Value 01 : `Regenerated_Size` uses 12 bits (0-4095). |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 484 | `Literals_Section_Header` has 2 bytes. |
| 485 | `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)` |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 486 | - Value 11 : `Regenerated_Size` uses 20 bits (0-1048575). |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 487 | `Literals_Section_Header` has 3 bytes. |
| 488 | `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)` |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 489 | |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 490 | Note : it's allowed to represent a short value (for example `13`) |
| 491 | using a long format, accepting the increased compressed data size. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 492 | |
| inikep | 9d003c1 | 2016-08-04 10:41:49 +0200 | [diff] [blame] | 493 | __`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 494 | |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 495 | - Value 00 : _A single stream_. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 496 | Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023). |
| 497 | `Literals_Section_Header` has 3 bytes. |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 498 | - Value 01 : 4 streams. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 499 | Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023). |
| 500 | `Literals_Section_Header` has 3 bytes. |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 501 | - Value 10 : 4 streams. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 502 | Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383). |
| 503 | `Literals_Section_Header` has 4 bytes. |
| inikep | 0132375 | 2016-08-25 12:20:38 +0200 | [diff] [blame] | 504 | - Value 11 : 4 streams. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 505 | Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143). |
| 506 | `Literals_Section_Header` has 5 bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 507 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 508 | Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention. |
| inikep | f896c1d | 2016-08-03 16:37:42 +0200 | [diff] [blame] | 509 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 510 | |
| inikep | 4f270ac | 2016-08-04 11:25:52 +0200 | [diff] [blame] | 511 | #### `Huffman_Tree_Description` |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 512 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 513 | This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 514 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 515 | Prefix coding represents symbols from an a priori known alphabet |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 516 | by bit sequences (codewords), one codeword for each symbol, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 517 | in a manner such that different symbols may be represented |
| 518 | by bit sequences of different lengths, |
| 519 | but a parser can always parse an encoded string |
| 520 | unambiguously symbol-by-symbol. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 521 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 522 | Given an alphabet with known symbol frequencies, |
| 523 | the Huffman algorithm allows the construction of an optimal prefix code |
| 524 | using the fewest bits of any possible prefix codes for that alphabet. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 525 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 526 | Prefix code must not exceed a maximum code length. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 527 | More bits improve accuracy but cost more header size, |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame] | 528 | and require more memory or more complex decoding operations. |
| 529 | This specification limits maximum code length to 11 bits. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 530 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 531 | |
| 532 | ##### Representation |
| 533 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 534 | All literal values from zero (included) to last present one (excluded) |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 535 | are represented by `Weight` with values from `0` to `Max_Number_of_Bits`. |
| 536 | Transformation from `Weight` to `Number_of_Bits` follows this formula : |
| 537 | ``` |
| 538 | Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 |
| 539 | ``` |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 540 | The last symbol's `Weight` is deduced from previously decoded ones, |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 541 | by completing to the nearest power of 2. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 542 | This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 543 | |
| 544 | __Example__ : |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 545 | Let's presume the following Huffman tree must be described : |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 546 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 547 | | literal | 0 | 1 | 2 | 3 | 4 | 5 | |
| 548 | | ---------------- | --- | --- | --- | --- | --- | --- | |
| 549 | | `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 550 | |
| 551 | The tree depth is 4, since its smallest element uses 4 bits. |
| 552 | Value `5` will not be listed, nor will values above `5`. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 553 | Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 554 | Weight formula is : |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 555 | ``` |
| 556 | Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 |
| 557 | ``` |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 558 | It gives the following serie of weights : |
| 559 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 560 | | `Weight` | 4 | 3 | 2 | 0 | 1 | |
| 561 | | -------- | --- | --- | --- | --- | --- | |
| 562 | | literal | 0 | 1 | 2 | 3 | 4 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 563 | |
| 564 | The decoder will do the inverse operation : |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 565 | having collected weights of literals from `0` to `4`, |
| 566 | it knows the last literal, `5`, is present with a non-zero weight. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 567 | The weight of `5` can be deducted by joining to the nearest power of 2. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 568 | Sum of `2^(Weight-1)` (excluding 0) is : |
| 569 | `8 + 4 + 2 + 0 + 1 = 15`. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 570 | Nearest power of 2 is 16. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 571 | Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 572 | |
| 573 | ##### Huffman Tree header |
| 574 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 575 | This is a single byte value (0-255), |
| 576 | which tells how to decode the list of weights. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 577 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 578 | - if `headerByte` >= 128 : this is a direct representation, |
| 579 | where each `Weight` is written directly as a 4 bits field (0-15). |
| 580 | The full representation occupies `((Number_of_Symbols+1)/2)` bytes, |
| 581 | meaning it uses a last full byte even if `Number_of_Symbols` is odd. |
| 582 | `Number_of_Symbols = headerByte - 127`. |
| 583 | Note that maximum `Number_of_Symbols` is 255-127 = 128. |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 584 | A larger serie must necessarily use FSE compression. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 585 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 586 | - if `headerByte` < 128 : |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 587 | the serie of weights is compressed by FSE. |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 588 | The length of the FSE-compressed serie is equal to `headerByte` (0-127). |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 589 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 590 | ##### Finite State Entropy (FSE) compression of Huffman weights |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 591 | |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 592 | The serie of weights is compressed using FSE compression. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 593 | It's a single bitstream with 2 interleaved states, |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 594 | sharing a single distribution table. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 595 | |
| 596 | To decode an FSE bitstream, it is necessary to know its compressed size. |
| 597 | Compressed size is provided by `headerByte`. |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 598 | It's also necessary to know its _maximum possible_ decompressed size, |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 599 | which is `255`, since literal values span from `0` to `255`, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 600 | and last symbol value is not represented. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 601 | |
| 602 | An FSE bitstream starts by a header, describing probabilities distribution. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 603 | It will create a Decoding Table. |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 604 | Table must be pre-allocated, which requires to support a maximum accuracy. |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 605 | For a list of Huffman weights, maximum accuracy is 7 bits. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 606 | |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 607 | FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format), |
| 608 | and so is [FSE bitstream](#bitstream). |
| 609 | The main difference is that Huffman header compression uses 2 states, |
| 610 | which share the same FSE distribution table. |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 611 | Bitstream contains only FSE symbols (no interleaved "raw bitfields"). |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 612 | The number of symbols to decode is discovered |
| 613 | by tracking bitStream overflow condition. |
| 614 | When both states have overflowed the bitstream, end is reached. |
| 615 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 616 | |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 617 | ##### Conversion from weights to Huffman prefix codes |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 618 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 619 | All present symbols shall now have a `Weight` value. |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 620 | It is possible to transform weights into Number_of_Bits, using this formula: |
| 621 | ``` |
| 622 | Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0 |
| 623 | ``` |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 624 | Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order. |
| 625 | Symbols with a `Weight` of zero are removed. |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 626 | Then, starting from lowest weight, prefix codes are distributed in order. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 627 | |
| 628 | __Example__ : |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 629 | Let's presume the following list of weights has been decoded : |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 630 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 631 | | Literal | 0 | 1 | 2 | 3 | 4 | 5 | |
| 632 | | -------- | --- | --- | --- | --- | --- | --- | |
| 633 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 634 | |
| 635 | Sorted by weight and then natural order, |
| 636 | it gives the following distribution : |
| 637 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 638 | | Literal | 3 | 4 | 5 | 2 | 1 | 0 | |
| 639 | | ---------------- | --- | --- | --- | --- | --- | ---- | |
| 640 | | `Weight` | 0 | 1 | 1 | 2 | 3 | 4 | |
| 641 | | `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 | |
| 642 | | prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 643 | |
| 644 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 645 | #### The content of Huffman-compressed literal stream |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 646 | |
| 647 | ##### Bitstreams sizes |
| 648 | |
| 649 | As seen in a previous paragraph, |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 650 | there are 2 types of Huffman-compressed literals : |
| 651 | a single stream and 4 streams. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 652 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 653 | Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 654 | Since each stream can be decoded independently, |
| 655 | it's possible to decode them up to 4x faster than a single stream, |
| 656 | presuming the CPU has enough parallelism available. |
| 657 | |
| 658 | For single stream, header provides both the compressed and regenerated size. |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 659 | For 4 streams though, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 660 | header only provides compressed and regenerated size of all 4 streams combined. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 661 | In order to properly decode the 4 streams, |
| 662 | it's necessary to know the compressed and regenerated size of each stream. |
| 663 | |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 664 | Regenerated size of each stream can be calculated by `(totalSize+3)/4`, |
| 665 | except for last one, which can be up to 3 bytes smaller, to reach `totalSize`. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 666 | |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 667 | Compressed size is provided explicitly : in the 4-streams variant, |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 668 | bitstreams are preceded by 3 unsigned little-endian 16-bits values. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 669 | Each value represents the compressed size of one stream, in order. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 670 | The last stream size is deducted from total compressed size |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 671 | and from previously decoded stream sizes : |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 672 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 673 | `stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 674 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 675 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 676 | ##### Bitstreams read and decode |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 677 | |
| 678 | Each bitstream must be read _backward_, |
| 679 | that is starting from the end down to the beginning. |
| 680 | Therefore it's necessary to know the size of each bitstream. |
| 681 | |
| 682 | It's also necessary to know exactly which _bit_ is the latest. |
| 683 | This is detected by a final bit flag : |
| 684 | the highest bit of latest byte is a final-bit-flag. |
| 685 | Consequently, a last byte of `0` is not possible. |
| 686 | And the final-bit-flag itself is not part of the useful bitstream. |
| Yann Collet | 38b75dd | 2016-07-24 15:35:59 +0200 | [diff] [blame] | 687 | Hence, the last byte contains between 0 and 7 useful bits. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 688 | |
| 689 | Starting from the end, |
| 690 | it's possible to read the bitstream in a little-endian fashion, |
| 691 | keeping track of already used bits. |
| 692 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 693 | Reading the last `Max_Number_of_Bits` bits, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 694 | it's then possible to compare extracted value to decoding table, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 695 | determining the symbol to decode and number of bits to discard. |
| 696 | |
| 697 | The process continues up to reading the required number of symbols per stream. |
| 698 | If a bitstream is not entirely and exactly consumed, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 699 | hence reaching exactly its beginning position with _all_ bits consumed, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 700 | the decoding process is considered faulty. |
| 701 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 702 | |
| inikep | 4f270ac | 2016-08-04 11:25:52 +0200 | [diff] [blame] | 703 | ### `Sequences_Section` |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 704 | |
| 705 | A compressed block is a succession of _sequences_ . |
| 706 | A sequence is a literal copy command, followed by a match copy command. |
| 707 | A literal copy command specifies a length. |
| 708 | It is the number of bytes to be copied (or extracted) from the literal section. |
| 709 | A match copy command specifies an offset and a length. |
| 710 | The offset gives the position to copy from, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 711 | which can be within a previous block. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 712 | |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 713 | When all _sequences_ are decoded, |
| 714 | if there is any literal left in the _literal section_, |
| 715 | these bytes are added at the end of the block. |
| 716 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 717 | The `Sequences_Section` regroup all symbols required to decode commands. |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 718 | There are 3 symbol types : literals lengths, offsets and match lengths. |
| 719 | They are encoded together, interleaved, in a single _bitstream_. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 720 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 721 | The `Sequences_Section` starts by a header, |
| 722 | followed by optional probability tables for each symbol type, |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 723 | followed by the bitstream. |
| 724 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 725 | | `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream | |
| 726 | | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 727 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 728 | To decode the `Sequences_Section`, it's required to know its size. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 729 | This size is deducted from `blockSize - literalSectionSize`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 730 | |
| 731 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 732 | #### `Sequences_Section_Header` |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 733 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 734 | Consists in 2 items : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 735 | - `Number_of_Sequences` |
| 736 | - Symbol compression modes |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 737 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 738 | __`Number_of_Sequences`__ |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 739 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 740 | This is a variable size field using between 1 and 3 bytes. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 741 | Let's call its first byte `byte0`. |
| 742 | - `if (byte0 == 0)` : there are no sequences. |
| 743 | The sequence section stops there. |
| 744 | Regenerated content is defined entirely by literals section. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 745 | - `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte. |
| 746 | - `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes. |
| 747 | - `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 748 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 749 | __Symbol compression modes__ |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 750 | |
| 751 | This is a single byte, defining the compression mode of each symbol type. |
| 752 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 753 | |Bit number| 7-6 | 5-4 | 3-2 | 1-0 | |
| 754 | | -------- | ----------------------- | -------------- | -------------------- | ---------- | |
| 755 | |Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 756 | |
| 757 | The last field, `Reserved`, must be all-zeroes. |
| 758 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 759 | `Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of |
| 760 | literals lengths, offsets, and match lengths respectively. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 761 | |
| 762 | They follow the same enumeration : |
| 763 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 764 | | Value | 0 | 1 | 2 | 3 | |
| 765 | | ------------------ | ----------------- | ---------- | --------------------- | ------------- | |
| 766 | | `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 767 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 768 | - `Predefined_Mode` : uses a predefined distribution table. |
| 769 | - `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times. |
| 770 | - `Repeat_Mode` : re-use distribution table from previous compressed block. |
| 771 | - `FSE_Compressed_Mode` : standard FSE compression. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 772 | A distribution table will be present. |
| 773 | It will be described in [next part](#distribution-tables). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 774 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 775 | #### The codes for literals lengths, match lengths, and offsets. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 776 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 777 | Each symbol is a _code_ in its own context, |
| 778 | which specifies `Baseline` and `Number_of_Bits` to add. |
| 779 | _Codes_ are FSE compressed, |
| 780 | and interleaved with raw additional bits in the same bitstream. |
| 781 | |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 782 | ##### Literals length codes |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 783 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 784 | Literals length codes are values ranging from `0` to `35` included. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 785 | They define lengths from 0 to 131071 bytes. |
| 786 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 787 | | `Literals_Length_Code` | 0-15 | |
| 788 | | ---------------------- | ---------------------- | |
| 789 | | length | `Literals_Length_Code` | |
| 790 | | `Number_of_Bits` | 0 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 791 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 792 | | `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
| 793 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 794 | | `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 795 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 796 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 797 | | `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
| 798 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 799 | | `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 800 | | `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 801 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 802 | | `Literals_Length_Code` | 32 | 33 | 34 | 35 | |
| 803 | | ---------------------- | ---- | ---- | ---- | ---- | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 804 | | `Baseline` | 8192 |16384 |32768 |65536 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 805 | | `Number_of_Bits` | 13 | 14 | 15 | 16 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 806 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 807 | ##### Default distribution for literals length codes |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 808 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 809 | When `Compression_Mode` is `Predefined_Mode`, |
| 810 | a predefined distribution is used for FSE compression. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 811 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 812 | Below is its definition. It uses an accuracy of 6 bits (64 states). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 813 | ``` |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 814 | short literalsLength_defaultDistribution[36] = |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 815 | { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, |
| 816 | 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, |
| 817 | -1,-1,-1,-1 }; |
| 818 | ``` |
| 819 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 820 | ##### Match length codes |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 821 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 822 | Match length codes are values ranging from `0` to `52` included. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 823 | They define lengths from 3 to 131074 bytes. |
| 824 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 825 | | `Match_Length_Code` | 0-31 | |
| 826 | | ------------------- | ----------------------- | |
| 827 | | value | `Match_Length_Code` + 3 | |
| 828 | | `Number_of_Bits` | 0 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 829 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 830 | | `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
| 831 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 832 | | `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 833 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 834 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 835 | | `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
| 836 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 837 | | `Baseline` | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 838 | | `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 839 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 840 | | `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 | |
| 841 | | ------------------- | ---- | ---- | ---- | ---- | ---- | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 842 | | `Baseline` | 4098 | 8194 |16486 |32770 |65538 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 843 | | `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 844 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 845 | ##### Default distribution for match length codes |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 846 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 847 | When `Compression_Mode` is defined as `Predefined_Mode`, |
| 848 | a predefined distribution is used for FSE compression. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 849 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 850 | Below is its definition. It uses an accuracy of 6 bits (64 states). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 851 | ``` |
| 852 | short matchLengths_defaultDistribution[53] = |
| 853 | { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
| 854 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 855 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, |
| 856 | -1,-1,-1,-1,-1 }; |
| 857 | ``` |
| 858 | |
| 859 | ##### Offset codes |
| 860 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 861 | Offset codes are values ranging from `0` to `N`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 862 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 863 | A decoder is free to limit its maximum `N` supported. |
| 864 | Recommendation is to support at least up to `22`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 865 | For information, at the time of this writing. |
| 866 | the reference decoder supports a maximum `N` value of `28` in 64-bits mode. |
| 867 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 868 | An offset code is also the number of additional bits to read, |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 869 | and can be translated into an `Offset_Value` using the following formulas : |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 870 | |
| 871 | ``` |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 872 | Offset_Value = (1 << offsetCode) + readNBits(offsetCode); |
| 873 | if (Offset_Value > 3) offset = Offset_Value - 3; |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 874 | ``` |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 875 | It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4` |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 876 | but is limited by [maximum back-reference distance](#window_descriptor). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 877 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 878 | `Offset_Value` from 1 to 3 are special : they define "repeat codes", |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 879 | which means one of the previous offsets will be repeated. |
| 880 | They are sorted in recency order, with 1 meaning the most recent one. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 881 | See [Repeat offsets](#repeat-offsets) paragraph. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 882 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 883 | |
| 884 | ##### Default distribution for offset codes |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 885 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 886 | When `Compression_Mode` is defined as `Predefined_Mode`, |
| 887 | a predefined distribution is used for FSE compression. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 888 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 889 | Below is its definition. It uses an accuracy of 5 bits (32 states), |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 890 | and supports a maximum `N` of 28, allowing offset values up to 536,870,908 . |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 891 | |
| 892 | If any sequence in the compressed block requires an offset larger than this, |
| 893 | it's not possible to use the default distribution to represent it. |
| 894 | |
| 895 | ``` |
| 896 | short offsetCodes_defaultDistribution[53] = |
| 897 | { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
| 898 | 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 }; |
| 899 | ``` |
| 900 | |
| 901 | #### Distribution tables |
| 902 | |
| 903 | Following the header, up to 3 distribution tables can be described. |
| Yann Collet | 10b9c13 | 2016-07-24 01:21:53 +0200 | [diff] [blame] | 904 | When present, they are in this order : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 905 | - Literals lengths |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 906 | - Offsets |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 907 | - Match Lengths |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 908 | |
| Yann Collet | 10b9c13 | 2016-07-24 01:21:53 +0200 | [diff] [blame] | 909 | The content to decode depends on their respective encoding mode : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 910 | - `Predefined_Mode` : no content. Use predefined distribution table. |
| 911 | - `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block. |
| 912 | - `FSE_Compressed_Mode` : A distribution table is present. |
| 913 | - `Repeat_Mode` : no content. Re-use distribution from previous compressed block. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 914 | |
| 915 | ##### FSE distribution table : condensed format |
| 916 | |
| 917 | An FSE distribution table describes the probabilities of all symbols |
| 918 | from `0` to the last present one (included) |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 919 | on a normalized scale of `1 << Accuracy_Log` . |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 920 | |
| 921 | It's a bitstream which is read forward, in little-endian fashion. |
| 922 | It's not necessary to know its exact size, |
| 923 | since it will be discovered and reported by the decoding process. |
| 924 | |
| 925 | The bitstream starts by reporting on which scale it operates. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 926 | `Accuracy_Log = low4bits + 5`. |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 927 | Note that maximum `Accuracy_Log` for literal and match lengths is `9`, |
| 928 | and for offsets is `8`. Higher values are considered errors. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 929 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 930 | Then follows each symbol value, from `0` to last present one. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 931 | The number of bits used by each field is variable. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 932 | It depends on : |
| 933 | |
| 934 | - Remaining probabilities + 1 : |
| 935 | __example__ : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 936 | Presuming an `Accuracy_Log` of 8, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 937 | and presuming 100 probabilities points have already been distributed, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 938 | the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 939 | Therefore, it must read `log2sup(156) == 8` bits. |
| 940 | |
| 941 | - Value decoded : small values use 1 less bit : |
| 942 | __example__ : |
| 943 | Presuming values from 0 to 156 (included) are possible, |
| 944 | 255-156 = 99 values are remaining in an 8-bits field. |
| 945 | They are used this way : |
| 946 | first 99 values (hence from 0 to 98) use only 7 bits, |
| 947 | values from 99 to 156 use 8 bits. |
| 948 | This is achieved through this scheme : |
| 949 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 950 | | Value read | Value decoded | Number of bits used | |
| 951 | | ---------- | ------------- | ------------------- | |
| 952 | | 0 - 98 | 0 - 98 | 7 | |
| 953 | | 99 - 127 | 99 - 127 | 8 | |
| 954 | | 128 - 226 | 0 - 98 | 7 | |
| 955 | | 227 - 255 | 128 - 156 | 8 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 956 | |
| 957 | Symbols probabilities are read one by one, in order. |
| 958 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 959 | Probability is obtained from Value decoded by following formula : |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 960 | `Proba = value - 1` |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 961 | |
| 962 | It means value `0` becomes negative probability `-1`. |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 963 | `-1` is a special probability, which means "less than 1". |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 964 | Its effect on distribution table is described in [next paragraph]. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 965 | For the purpose of calculating cumulated distribution, it counts as one. |
| 966 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 967 | [next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables |
| 968 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 969 | When a symbol has a probability of `zero`, |
| 970 | it is followed by a 2-bits repeat flag. |
| 971 | This repeat flag tells how many probabilities of zeroes follow the current one. |
| 972 | It provides a number ranging from 0 to 3. |
| 973 | If it is a 3, another 2-bits repeat flag follows, and so on. |
| 974 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 975 | When last symbol reaches cumulated total of `1 << Accuracy_Log`, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 976 | decoding is complete. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 977 | If the last symbol makes cumulated total go above `1 << Accuracy_Log`, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 978 | distribution is considered corrupted. |
| 979 | |
| Yann Collet | 10b9c13 | 2016-07-24 01:21:53 +0200 | [diff] [blame] | 980 | Then the decoder can tell how many bytes were used in this process, |
| 981 | and how many symbols are present. |
| 982 | The bitstream consumes a round number of bytes. |
| 983 | Any remaining bit within the last byte is just unused. |
| 984 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 985 | ##### FSE decoding : from normalized distribution to decoding tables |
| 986 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 987 | The distribution of normalized probabilities is enough |
| 988 | to create a unique decoding table. |
| 989 | |
| 990 | It follows the following build rule : |
| 991 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 992 | The table has a size of `tableSize = 1 << Accuracy_Log`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 993 | Each cell describes the symbol decoded, |
| 994 | and instructions to get the next state. |
| 995 | |
| inikep | 12731a9 | 2016-08-25 15:19:37 +0200 | [diff] [blame] | 996 | Symbols are scanned in their natural order for "less than 1" probabilities. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 997 | Symbols with this probability are being attributed a single cell, |
| 998 | starting from the end of the table. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 999 | These symbols define a full state reset, reading `Accuracy_Log` bits. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1000 | |
| 1001 | All remaining symbols are sorted in their natural order. |
| 1002 | Starting from symbol `0` and table position `0`, |
| 1003 | each symbol gets attributed as many cells as its probability. |
| 1004 | Cell allocation is spreaded, not linear : |
| 1005 | each successor position follow this rule : |
| 1006 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1007 | ``` |
| 1008 | position += (tableSize>>1) + (tableSize>>3) + 3; |
| 1009 | position &= tableSize-1; |
| 1010 | ``` |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1011 | |
| 1012 | A position is skipped if already occupied, |
| 1013 | typically by a "less than 1" probability symbol. |
| 1014 | |
| 1015 | The result is a list of state values. |
| 1016 | Each state will decode the current symbol. |
| 1017 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1018 | To get the `Number_of_Bits` and `Baseline` required for next state, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1019 | it's first necessary to sort all states in their natural order. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1020 | The lower states will need 1 more bit than higher ones. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1021 | |
| 1022 | __Example__ : |
| 1023 | Presuming a symbol has a probability of 5. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1024 | It receives 5 state values. States are sorted in natural order. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1025 | |
| 1026 | Next power of 2 is 8. |
| 1027 | Space of probabilities is divided into 8 equal parts. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1028 | Presuming the `Accuracy_Log` is 7, it defines 128 states. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1029 | Divided by 8, each share is 16 large. |
| 1030 | |
| 1031 | In order to reach 8, 8-5=3 lowest states will count "double", |
| 1032 | taking shares twice larger, |
| 1033 | requiring one more bit in the process. |
| 1034 | |
| 1035 | Numbering starts from higher states using less bits. |
| 1036 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1037 | | state order | 0 | 1 | 2 | 3 | 4 | |
| 1038 | | ---------------- | ----- | ----- | ------ | ---- | ----- | |
| 1039 | | width | 32 | 32 | 32 | 16 | 16 | |
| 1040 | | `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 | |
| 1041 | | range number | 2 | 4 | 6 | 0 | 1 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1042 | | `Baseline` | 32 | 64 | 96 | 0 | 16 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1043 | | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1044 | |
| 1045 | Next state is determined from current state |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1046 | by reading the required `Number_of_Bits`, and adding the specified `Baseline`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 1047 | |
| 1048 | |
| 1049 | #### Bitstream |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1050 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1051 | All sequences are stored in a single bitstream, read _backward_. |
| 1052 | It is therefore necessary to know the bitstream size, |
| 1053 | which is deducted from compressed block size. |
| 1054 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 1055 | The last useful bit of the stream is followed by an end-bit-flag. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1056 | Highest bit of last byte is this flag. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1057 | It does not belong to the useful part of the bitstream. |
| 1058 | Therefore, last byte has 0-7 useful bits. |
| 1059 | Note that it also means that last byte cannot be `0`. |
| 1060 | |
| 1061 | ##### Starting states |
| 1062 | |
| 1063 | The bitstream starts with initial state values, |
| 1064 | each using the required number of bits in their respective _accuracy_, |
| 1065 | decoded previously from their normalized distribution. |
| 1066 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1067 | It starts by `Literals_Length_State`, |
| 1068 | followed by `Offset_State`, |
| 1069 | and finally `Match_Length_State`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1070 | |
| 1071 | Reminder : always keep in mind that all values are read _backward_. |
| 1072 | |
| 1073 | ##### Decoding a sequence |
| 1074 | |
| 1075 | A state gives a code. |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1076 | A code provides `Baseline` and `Number_of_Bits` to add. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1077 | See [Symbol Decoding] section for details on each symbol. |
| 1078 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1079 | Decoding starts by reading the `Number_of_Bits` required to decode `Offset`. |
| 1080 | It then does the same for `Match_Length`, |
| 1081 | and then for `Literals_Length`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1082 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1083 | `Offset`, `Match_Length`, and `Literals_Length` define a sequence. |
| 1084 | It starts by inserting the number of literals defined by `Literals_Length`, |
| 1085 | then continue by copying `Match_Length` bytes from `currentPos - Offset`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1086 | |
| 1087 | The next operation is to update states. |
| 1088 | Using rules pre-calculated in the decoding tables, |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1089 | `Literals_Length_State` is updated, |
| 1090 | followed by `Match_Length_State`, |
| 1091 | and then `Offset_State`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1092 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1093 | This operation will be repeated `Number_of_Sequences` times. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1094 | At the end, the bitstream shall be entirely consumed, |
| 1095 | otherwise bitstream is considered corrupted. |
| 1096 | |
| inikep | de9d130 | 2016-08-25 14:59:08 +0200 | [diff] [blame] | 1097 | [Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1098 | |
| 1099 | ##### Repeat offsets |
| 1100 | |
| inikep | 12731a9 | 2016-08-25 15:19:37 +0200 | [diff] [blame] | 1101 | As seen in [Offset Codes], the first 3 values define a repeated offset and we will call them `Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`. |
| 1102 | They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one". |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1103 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1104 | There is an exception though, when current sequence's literals length is `0`. |
| inikep | 12731a9 | 2016-08-25 15:19:37 +0200 | [diff] [blame] | 1105 | In which case, repeated offsets are "pushed by one", |
| 1106 | so `Repeated_Offset1` becomes `Repeated_Offset2`, `Repeated_Offset2` becomes `Repeated_Offset3`, |
| 1107 | and `Repeated_Offset3` becomes `Repeated_Offset1 - 1_byte`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1108 | |
| Yann Collet | 917fe18 | 2016-07-31 04:01:57 +0200 | [diff] [blame] | 1109 | On first block, offset history is populated by the following values : 1, 4 and 8 (in order). |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1110 | |
| 1111 | Then each block receives its start value from previous compressed block. |
| 1112 | Note that non-compressed blocks are skipped, |
| 1113 | they do not contribute to offset history. |
| 1114 | |
| 1115 | [Offset Codes]: #offset-codes |
| 1116 | |
| 1117 | ###### Offset updates rules |
| 1118 | |
| Yann Collet | 917fe18 | 2016-07-31 04:01:57 +0200 | [diff] [blame] | 1119 | New offset take the lead in offset history, |
| 1120 | up to its previous place if it was already present. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1121 | |
| inikep | 12731a9 | 2016-08-25 15:19:37 +0200 | [diff] [blame] | 1122 | It means that when `Repeated_Offset1` (most recent) is used, history is unmodified. |
| 1123 | When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1124 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1125 | |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1126 | Dictionary format |
| 1127 | ----------------- |
| 1128 | |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1129 | `zstd` is compatible with "raw content" dictionaries, free of any format restriction. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1130 | But dictionaries created by `zstd --train` follow a format, described here. |
| 1131 | |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1132 | __Pre-requisites__ : a dictionary has a size, |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1133 | defined either by a buffer limit, or a file size. |
| 1134 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1135 | | `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` | |
| 1136 | | -------------- | --------------- | ---------------- | --------- | |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1137 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1138 | __`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1139 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1140 | __`Dictionary_ID`__ : 4 bytes, stored in little-endian format. |
| 1141 | `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`). |
| Yann Collet | 722e14b | 2016-07-08 19:22:16 +0200 | [diff] [blame] | 1142 | It's used by decoders to check if they use the correct dictionary. |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1143 | |
| 1144 | _Reserved ranges :_ |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 1145 | If the frame is going to be distributed in a private environment, |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1146 | any `Dictionary_ID` can be used. |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 1147 | However, for public distribution of compressed frames, |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1148 | the following ranges are reserved for future use and should not be used : |
| Yann Collet | 6cacd34 | 2016-07-15 17:58:13 +0200 | [diff] [blame] | 1149 | |
| Yann Collet | 70c2326 | 2016-08-21 00:24:18 +0200 | [diff] [blame] | 1150 | - low range : 1 - 32767 |
| 1151 | - high range : >= (2^31) |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1152 | |
| inikep | e81f2cb | 2016-08-13 09:36:24 +0200 | [diff] [blame] | 1153 | __`Entropy_Tables`__ : following the same format as a [compressed blocks]. |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1154 | They are stored in following order : |
| 1155 | Huffman tables for literals, FSE table for offsets, |
| 1156 | FSE table for match lengths, and FSE table for literals lengths. |
| 1157 | It's finally followed by 3 offset values, populating recent offsets, |
| 1158 | stored in order, 4-bytes little-endian each, for a total of 12 bytes. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1159 | |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1160 | __`Content`__ : The rest of the dictionary is its content. |
| 1161 | The content act as a "past" in front of data to compress or decompress. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1162 | |
| inikep | f9c3cce | 2016-07-25 11:04:56 +0200 | [diff] [blame] | 1163 | [compressed blocks]: #the-format-of-compressed_block |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1164 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1165 | |
| 1166 | Version changes |
| 1167 | --------------- |
| Yann Collet | 855766d | 2016-09-02 17:04:49 -0700 | [diff] [blame] | 1168 | - 0.2.1 : clarify field names, by Przemyslaw Skibinski |
| Yann Collet | 6fa05a2 | 2016-07-20 14:58:49 +0200 | [diff] [blame] | 1169 | - 0.2.0 : numerous format adjustments for zstd v0.8 |
| inikep | 586a055 | 2016-08-03 16:16:38 +0200 | [diff] [blame] | 1170 | - 0.1.2 : limit Huffman tree depth to 11 bits |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame] | 1171 | - 0.1.1 : reserved dictID ranges |
| 1172 | - 0.1.0 : initial release |