| Yann Collet | 5cc1882 | 2016-07-03 19:03:13 +0200 | [diff] [blame] | 1 | Zstandard Compression Format |
| 2 | ============================ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 3 | |
| 4 | ### Notices |
| 5 | |
| 6 | Copyright (c) 2016 Yann Collet |
| 7 | |
| 8 | Permission is granted to copy and distribute this document |
| 9 | for any purpose and without charge, |
| 10 | including translations into other languages |
| 11 | and incorporation into compilations, |
| 12 | provided that the copyright notice and this notice are preserved, |
| 13 | and that any substantive changes or deletions from the original |
| 14 | are clearly marked. |
| 15 | Distribution of this document is unlimited. |
| 16 | |
| 17 | ### Version |
| 18 | |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame^] | 19 | 0.1.2 (15/07/16) |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 20 | |
| 21 | |
| 22 | Introduction |
| 23 | ------------ |
| 24 | |
| 25 | The purpose of this document is to define a lossless compressed data format, |
| 26 | that is independent of CPU type, operating system, |
| 27 | file system and character set, suitable for |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 28 | file compression, pipe and streaming compression, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 29 | using the [Zstandard algorithm](http://www.zstandard.org). |
| 30 | |
| 31 | The data can be produced or consumed, |
| 32 | even for an arbitrarily long sequentially presented input data stream, |
| 33 | using only an a priori bounded amount of intermediate storage, |
| 34 | and hence can be used in data communications. |
| 35 | The format uses the Zstandard compression method, |
| 36 | and optional [xxHash-64 checksum method](http://www.xxhash.org), |
| 37 | for detection of data corruption. |
| 38 | |
| 39 | The data format defined by this specification |
| 40 | does not attempt to allow random access to compressed data. |
| 41 | |
| 42 | This specification is intended for use by implementers of software |
| 43 | to compress data into Zstandard format and/or decompress data from Zstandard format. |
| 44 | The text of the specification assumes a basic background in programming |
| 45 | at the level of bits and other primitive data representations. |
| 46 | |
| 47 | Unless otherwise indicated below, |
| 48 | a compliant compressor must produce data sets |
| 49 | that conform to the specifications presented here. |
| 50 | It doesn’t need to support all options though. |
| 51 | |
| 52 | A compliant decompressor must be able to decompress |
| 53 | at least one working set of parameters |
| 54 | that conforms to the specifications presented here. |
| 55 | It may also ignore informative fields, such as checksum. |
| 56 | Whenever it does not support a parameter defined in the compressed stream, |
| 57 | it must produce a non-ambiguous error code and associated error message |
| 58 | explaining which parameter is unsupported. |
| 59 | |
| 60 | |
| 61 | Definitions |
| 62 | ----------- |
| 63 | A content compressed by Zstandard is transformed into a Zstandard __frame__. |
| 64 | Multiple frames can be appended into a single file or stream. |
| 65 | A frame is totally independent, has a defined beginning and end, |
| 66 | and a set of parameters which tells the decoder how to decompress it. |
| 67 | |
| 68 | A frame encapsulates one or multiple __blocks__. |
| 69 | Each block can be compressed or not, |
| 70 | and has a guaranteed maximum content size, which depends on frame parameters. |
| 71 | Unlike frames, each block depends on previous blocks for proper decoding. |
| 72 | However, each block can be decompressed without waiting for its successor, |
| 73 | allowing streaming operations. |
| 74 | |
| 75 | |
| 76 | General Structure of Zstandard Frame format |
| 77 | ------------------------------------------- |
| 78 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 79 | | MagicNb | Frame Header | Block | (More blocks) | EndMark | |
| 80 | |:-------:|:-------------:| ----- | ------------- | ------- | |
| 81 | | 4 bytes | 2-14 bytes | | | 3 bytes | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 82 | |
| 83 | __Magic Number__ |
| 84 | |
| 85 | 4 Bytes, Little endian format. |
| 86 | Value : 0xFD2FB527 |
| 87 | |
| 88 | __Frame Header__ |
| 89 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 90 | 2 to 14 Bytes, detailed in [next part](#frame-header). |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 91 | |
| 92 | __Data Blocks__ |
| 93 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 94 | Detailed in [next chapter](#data-blocks). |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 95 | That’s where compressed data is stored. |
| 96 | |
| 97 | __EndMark__ |
| 98 | |
| 99 | The flow of blocks ends when the last block header brings an _end signal_ . |
| 100 | This last block header may optionally host a __Content Checksum__ . |
| 101 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 102 | ##### __Content Checksum__ |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 103 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 104 | Content Checksum verify that frame content has been regenerated correctly. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 105 | The content checksum is the result |
| 106 | of [xxh64() hash function](https://www.xxHash.com) |
| 107 | digesting the original (decoded) data as input, and a seed of zero. |
| 108 | Bits from 11 to 32 (included) are extracted to form a 22 bits checksum |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 109 | stored into the endmark body. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 110 | ``` |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 111 | mask22bits = (1<<22)-1; |
| 112 | contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits; |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 113 | ``` |
| 114 | Content checksum is only present when its associated flag |
| 115 | is set in the frame descriptor. |
| 116 | Its usage is optional. |
| 117 | |
| 118 | __Frame Concatenation__ |
| 119 | |
| 120 | In some circumstances, it may be required to append multiple frames, |
| 121 | for example in order to add new data to an existing compressed file |
| 122 | without re-framing it. |
| 123 | |
| 124 | In such case, each frame brings its own set of descriptor flags. |
| 125 | Each frame is considered independent. |
| 126 | The only relation between frames is their sequential order. |
| 127 | |
| 128 | The ability to decode multiple concatenated frames |
| 129 | within a single stream or file is left outside of this specification. |
| 130 | As an example, the reference `zstd` command line utility is able |
| 131 | to decode all concatenated frames in their sequential order, |
| 132 | delivering the final decompressed result as if it was a single content. |
| 133 | |
| 134 | |
| 135 | Frame Header |
| 136 | ------------- |
| 137 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 138 | | FHD | (WD) | (dictID) | (Content Size) | |
| 139 | | ------- | --------- | --------- |:--------------:| |
| 140 | | 1 byte | 0-1 byte | 0-4 bytes | 0 - 8 bytes | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 141 | |
| 142 | Frame header has a variable size, which uses a minimum of 2 bytes, |
| 143 | and up to 14 bytes depending on optional parameters. |
| 144 | |
| 145 | __FHD byte__ (Frame Header Descriptor) |
| 146 | |
| 147 | The first Header's byte is called the Frame Header Descriptor. |
| 148 | It tells which other fields are present. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 149 | Decoding this byte is enough to tell the size of Frame Header. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 150 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 151 | | BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 | |
| 152 | | ------- | ------ | ------- | ------ | -------- | -------- | ------ | |
| 153 | |FieldName| FCSize | Segment | Unused | Reserved | Checksum | dictID | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 154 | |
| 155 | In this table, bit 7 is highest bit, while bit 0 is lowest. |
| 156 | |
| 157 | __Frame Content Size flag__ |
| 158 | |
| 159 | This is a 2-bits flag (`= FHD >> 6`), |
| 160 | specifying if decompressed data size is provided within the header. |
| 161 | |
| 162 | | Value | 0 | 1 | 2 | 3 | |
| 163 | | ------- | --- | --- | --- | --- | |
| 164 | |FieldSize| 0-1 | 2 | 4 | 8 | |
| 165 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 166 | Value 0 meaning depends on _single segment_ mode : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 167 | it either means `0` (size not provided) _if_ the `WD` byte is present, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 168 | or `1` (frame content size <= 255 bytes) otherwise. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 169 | |
| 170 | __Single Segment__ |
| 171 | |
| 172 | If this flag is set, |
| 173 | data shall be regenerated within a single continuous memory segment. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 174 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 175 | In which case, `WD` byte __is not present__, |
| 176 | but `Frame Content Size` field necessarily is. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 177 | As a consequence, the decoder must allocate a memory segment |
| 178 | of size `>= Frame Content Size`. |
| 179 | |
| 180 | In order to preserve the decoder from unreasonable memory requirement, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 181 | a decoder can reject a compressed frame |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 182 | which requests a memory size beyond decoder's authorized range. |
| 183 | |
| 184 | For broader compatibility, decoders are recommended to support |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 185 | memory sizes of at least 8 MB. |
| 186 | This is just a recommendation, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 187 | each decoder is free to support higher or lower limits, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 188 | depending on local limitations. |
| 189 | |
| 190 | __Unused bit__ |
| 191 | |
| Yann Collet | f0bc673 | 2016-07-13 17:30:21 +0200 | [diff] [blame] | 192 | The value of this bit should be set to zero. |
| 193 | A decoder compliant with this specification version should not interpret it. |
| 194 | It might be used in a future version, |
| 195 | to signal a property which is not mandatory to properly decode the frame. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 196 | |
| 197 | __Reserved bit__ |
| 198 | |
| 199 | This bit is reserved for some future feature. |
| 200 | Its value _must be zero_. |
| 201 | A decoder compliant with this specification version must ensure it is not set. |
| 202 | This bit may be used in a future revision, |
| 203 | to signal a feature that must be interpreted in order to decode the frame. |
| 204 | |
| 205 | __Content checksum flag__ |
| 206 | |
| 207 | If this flag is set, a content checksum will be present into the EndMark. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 208 | The checksum is a 22 bits value extracted from the XXH64() of data, |
| 209 | and stored into endMark. See [__Content Checksum__](#content-checksum) . |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 210 | |
| 211 | __Dictionary ID flag__ |
| 212 | |
| 213 | This is a 2-bits flag (`= FHD & 3`), |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 214 | telling if a dictionary ID is provided within the header. |
| 215 | It also specifies the size of this field. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 216 | |
| 217 | | Value | 0 | 1 | 2 | 3 | |
| 218 | | ------- | --- | --- | --- | --- | |
| 219 | |FieldSize| 0 | 1 | 2 | 4 | |
| 220 | |
| 221 | __WD byte__ (Window Descriptor) |
| 222 | |
| 223 | Provides guarantees on maximum back-reference distance |
| 224 | that will be present within compressed data. |
| 225 | This information is useful for decoders to allocate enough memory. |
| 226 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 227 | `WD` byte is optional. It's not present in `single segment` mode. |
| 228 | In which case, the maximum back-reference distance is the content size itself, |
| 229 | which can be any value from 1 to 2^64-1 bytes (16 EB). |
| 230 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 231 | | BitNb | 7-3 | 0-2 | |
| 232 | | --------- | -------- | -------- | |
| 233 | | FieldName | Exponent | Mantissa | |
| 234 | |
| 235 | Maximum distance is given by the following formulae : |
| 236 | ``` |
| 237 | windowLog = 10 + Exponent; |
| 238 | windowBase = 1 << windowLog; |
| 239 | windowAdd = (windowBase / 8) * Mantissa; |
| 240 | windowSize = windowBase + windowAdd; |
| 241 | ``` |
| 242 | The minimum window size is 1 KB. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 243 | The maximum size is `15*(1<<38)` bytes, which is 1.875 TB. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 244 | |
| 245 | To properly decode compressed data, |
| 246 | a decoder will need to allocate a buffer of at least `windowSize` bytes. |
| 247 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 248 | In order to preserve decoder from unreasonable memory requirements, |
| 249 | a decoder can refuse a compressed frame |
| 250 | which requests a memory size beyond decoder's authorized range. |
| 251 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 252 | For improved interoperability, |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 253 | decoders are recommended to be compatible with window sizes of 8 MB. |
| 254 | Encoders are recommended to not request more than 8 MB. |
| 255 | It's merely a recommendation though, |
| 256 | decoders are free to support larger or lower limits, |
| 257 | depending on local limitations. |
| 258 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 259 | __Dictionary ID__ |
| 260 | |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 261 | This is a variable size field, which contains |
| 262 | the ID of the dictionary required to properly decode the frame. |
| 263 | Note that this field is optional. When it's not present, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 264 | it's up to the caller to make sure it uses the correct dictionary. |
| 265 | |
| 266 | Field size depends on __Dictionary ID flag__. |
| 267 | 1 byte can represent an ID 0-255. |
| 268 | 2 bytes can represent an ID 0-65535. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 269 | 4 bytes can represent an ID 0-4294967295. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 270 | |
| 271 | It's allowed to represent a small ID (for example `13`) |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 272 | with a large 4-bytes dictionary ID, losing some compacity in the process. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 273 | |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 274 | _Reserved ranges :_ |
| 275 | If the frame is going to be distributed in a private environment, |
| 276 | any dictionary ID can be used. |
| 277 | However, for public distribution of compressed frames using a dictionary, |
| 278 | some ranges are reserved for future use : |
| 279 | - low : 1 - 32767 : reserved |
| 280 | - high : >= (2^31) : reserved |
| 281 | |
| 282 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 283 | __Frame Content Size__ |
| 284 | |
| 285 | This is the original (uncompressed) size. |
| 286 | This information is optional, and only present if associated flag is set. |
| 287 | Content size is provided using 1, 2, 4 or 8 Bytes. |
| 288 | Format is Little endian. |
| 289 | |
| 290 | | Field Size | Range | |
| 291 | | ---------- | ---------- | |
| 292 | | 0 | 0 | |
| 293 | | 1 | 0 - 255 | |
| 294 | | 2 | 256 - 65791| |
| 295 | | 4 | 0 - 2^32-1 | |
| 296 | | 8 | 0 - 2^64-1 | |
| 297 | |
| 298 | When field size is 1, 4 or 8 bytes, the value is read directly. |
| 299 | When field size is 2, _an offset of 256 is added_. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 300 | It's allowed to represent a small size (ex: `18`) using any compatible variant. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 301 | A size of `0` means `content size is unknown`. |
| 302 | In which case, the `WD` byte will necessarily be present, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 303 | and becomes the only hint to guide memory allocation. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 304 | |
| 305 | In order to preserve decoder from unreasonable memory requirement, |
| 306 | a decoder can refuse a compressed frame |
| 307 | which requests a memory size beyond decoder's authorized range. |
| 308 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 309 | |
| 310 | Data Blocks |
| 311 | ----------- |
| 312 | |
| 313 | | B. Header | data | |
| 314 | |:---------:| ------ | |
| 315 | | 3 bytes | | |
| 316 | |
| 317 | |
| 318 | __Block Header__ |
| 319 | |
| 320 | This field uses 3-bytes, format is __big-endian__. |
| 321 | |
| 322 | The 2 highest bits represent the `block type`, |
| 323 | while the remaining 22 bits represent the (compressed) block size. |
| 324 | |
| 325 | There are 4 block types : |
| 326 | |
| 327 | | Value | 0 | 1 | 2 | 3 | |
| 328 | | ---------- | ---------- | --- | --- | ------- | |
| 329 | | Block Type | Compressed | Raw | RLE | EndMark | |
| 330 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 331 | - Compressed : this is a [Zstandard compressed block](#compressed-block-format), |
| 332 | detailed in another section of this specification. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 333 | "block size" is the compressed size. |
| 334 | Decompressed size is unknown, |
| 335 | but its maximum possible value is guaranteed (see below) |
| 336 | - Raw : this is an uncompressed block. |
| 337 | "block size" is the number of bytes to read and copy. |
| 338 | - RLE : this is a single byte, repeated N times. |
| 339 | In which case, "block size" is the size to regenerate, |
| 340 | while the "compressed" block is just 1 byte (the byte to repeat). |
| 341 | - EndMark : this is not a block. Signal the end of the frame. |
| 342 | The rest of the field may be optionally filled by a checksum |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 343 | (see [Content Checksum](#content-checksum)). |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 344 | |
| 345 | Block sizes must respect a few rules : |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 346 | - In compressed mode, compressed size if always strictly `< decompressed size`. |
| 347 | - Block decompressed size is always <= maximum back-reference distance . |
| 348 | - Block decompressed size is always <= 128 KB |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 349 | |
| 350 | |
| 351 | __Data__ |
| 352 | |
| 353 | Where the actual data to decode stands. |
| 354 | It might be compressed or not, depending on previous field indications. |
| 355 | A data block is not necessarily "full" : |
| 356 | since an arbitrary “flush” may happen anytime, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 357 | block decompressed content can be any size, |
| 358 | up to Block Maximum Decompressed Size, which is the smallest of : |
| 359 | - Maximum back-reference distance |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 360 | - 128 KB |
| 361 | |
| 362 | |
| 363 | Skippable Frames |
| 364 | ---------------- |
| 365 | |
| 366 | | Magic Number | Frame Size | User Data | |
| 367 | |:------------:|:----------:| --------- | |
| 368 | | 4 bytes | 4 bytes | | |
| 369 | |
| 370 | Skippable frames allow the insertion of user-defined data |
| 371 | into a flow of concatenated frames. |
| 372 | Its design is pretty straightforward, |
| 373 | with the sole objective to allow the decoder to quickly skip |
| 374 | over user-defined data and continue decoding. |
| 375 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 376 | Skippable frames defined in this specification are compatible with [LZ4] ones. |
| 377 | |
| 378 | [LZ4]:http://www.lz4.org |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 379 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 380 | __Magic Number__ : |
| 381 | |
| 382 | 4 Bytes, Little endian format. |
| 383 | Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. |
| 384 | All 16 values are valid to identify a skippable frame. |
| 385 | |
| 386 | __Frame Size__ : |
| 387 | |
| 388 | This is the size, in bytes, of the following User Data |
| 389 | (without including the magic number nor the size field itself). |
| 390 | 4 Bytes, Little endian format, unsigned 32-bits. |
| 391 | This means User Data can’t be bigger than (2^32-1) Bytes. |
| 392 | |
| 393 | __User Data__ : |
| 394 | |
| 395 | User Data can be anything. Data will just be skipped by the decoder. |
| 396 | |
| 397 | |
| 398 | Compressed block format |
| 399 | ----------------------- |
| 400 | This specification details the content of a _compressed block_. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 401 | A compressed block has a size, which must be known. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 402 | It also has a guaranteed maximum regenerated size, |
| 403 | in order to properly allocate destination buffer. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 404 | See [Data Blocks](#data-blocks) for more details. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 405 | |
| 406 | A compressed block consists of 2 sections : |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 407 | - [Literals section](#literals-section) |
| 408 | - [Sequences section](#sequences-section) |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 409 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 410 | ### Prerequisites |
| 411 | To decode a compressed block, the following elements are necessary : |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 412 | - Previous decoded blocks, up to a distance of `windowSize`, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 413 | or all previous blocks in "single segment" mode. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 414 | - List of "recent offsets" from previous compressed block. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 415 | - Decoding tables of previous compressed block for each symbol type |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 416 | (literals, litLength, matchLength, offset). |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 417 | |
| 418 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 419 | ### Literals section |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 420 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 421 | Literals are compressed using Huffman prefix codes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 422 | During sequence phase, literals will be entangled with match copy operations. |
| 423 | All literals are regrouped in the first part of the block. |
| 424 | They can be decoded first, and then copied during sequence operations, |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 425 | or they can be decoded on the flow, as needed by sequence commands. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 426 | |
| 427 | | Header | (Tree Description) | Stream1 | (Stream2) | (Stream3) | (Stream4) | |
| 428 | | ------ | ------------------ | ------- | --------- | --------- | --------- | |
| 429 | |
| 430 | Literals can be compressed, or uncompressed. |
| 431 | When compressed, an optional tree description can be present, |
| 432 | followed by 1 or 4 streams. |
| 433 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 434 | #### Literals section header |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 435 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 436 | Header is in charge of describing how literals are packed. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 437 | It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, |
| 438 | using big-endian convention. |
| 439 | |
| 440 | | BlockType | sizes format | (compressed size) | regenerated size | |
| 441 | | --------- | ------------ | ----------------- | ---------------- | |
| 442 | | 2 bits | 1 - 2 bits | 0 - 18 bits | 5 - 20 bits | |
| 443 | |
| 444 | __Block Type__ : |
| 445 | |
| 446 | This is a 2-bits field, describing 4 different block types : |
| 447 | |
| 448 | | Value | 0 | 1 | 2 | 3 | |
| 449 | | ---------- | ---------- | ------ | --- | ------- | |
| 450 | | Block Type | Compressed | Repeat | Raw | RLE | |
| 451 | |
| 452 | - Compressed : This is a standard huffman-compressed block, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 453 | starting with a huffman tree description. |
| 454 | See details below. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 455 | - Repeat Stats : This is a huffman-compressed block, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 456 | using huffman tree _from previous huffman-compressed literals block_. |
| 457 | Huffman tree description will be skipped. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 458 | - Raw : Literals are stored uncompressed. |
| 459 | - RLE : Literals consist of a single byte value repeated N times. |
| 460 | |
| 461 | __Sizes format__ : |
| 462 | |
| 463 | Sizes format are divided into 2 families : |
| 464 | |
| 465 | - For compressed block, it requires to decode both the compressed size |
| 466 | and the decompressed size. It will also decode the number of streams. |
| 467 | - For Raw or RLE blocks, it's enough to decode the size to regenerate. |
| 468 | |
| 469 | For values spanning several bytes, convention is Big-endian. |
| 470 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 471 | __Sizes format for Raw or RLE literals block__ : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 472 | |
| 473 | - Value : 0x : Regenerated size uses 5 bits (0-31). |
| 474 | Total literal header size is 1 byte. |
| 475 | `size = h[0] & 31;` |
| 476 | - Value : 10 : Regenerated size uses 12 bits (0-4095). |
| 477 | Total literal header size is 2 bytes. |
| 478 | `size = ((h[0] & 15) << 8) + h[1];` |
| 479 | - Value : 11 : Regenerated size uses 20 bits (0-1048575). |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 480 | Total literal header size is 3 bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 481 | `size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];` |
| 482 | |
| 483 | Note : it's allowed to represent a short value (ex : `13`) |
| 484 | using a long format, accepting the reduced compacity. |
| 485 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 486 | __Sizes format for Compressed literals block__ : |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 487 | |
| 488 | Note : also applicable to "repeat-stats" blocks. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 489 | - Value : 00 : 4 streams. |
| 490 | Compressed and regenerated sizes use 10 bits (0-1023). |
| 491 | Total literal header size is 3 bytes. |
| 492 | - Value : 01 : _Single stream_. |
| 493 | Compressed and regenerated sizes use 10 bits (0-1023). |
| 494 | Total literal header size is 3 bytes. |
| 495 | - Value : 10 : 4 streams. |
| 496 | Compressed and regenerated sizes use 14 bits (0-16383). |
| 497 | Total literal header size is 4 bytes. |
| 498 | - Value : 10 : 4 streams. |
| 499 | Compressed and regenerated sizes use 18 bits (0-262143). |
| 500 | Total literal header size is 5 bytes. |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 501 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 502 | Compressed and regenerated size fields follow big endian convention. |
| 503 | |
| 504 | #### Huffman Tree description |
| 505 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 506 | This section is only present when literals block type is `Compressed` (`0`). |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 507 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 508 | Prefix coding represents symbols from an a priori known alphabet |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 509 | by bit sequences (codewords), one codeword for each symbol, |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 510 | in a manner such that different symbols may be represented |
| 511 | by bit sequences of different lengths, |
| 512 | but a parser can always parse an encoded string |
| 513 | unambiguously symbol-by-symbol. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 514 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 515 | Given an alphabet with known symbol frequencies, |
| 516 | the Huffman algorithm allows the construction of an optimal prefix code |
| 517 | using the fewest bits of any possible prefix codes for that alphabet. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 518 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 519 | Prefix code must not exceed a maximum code length. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 520 | More bits improve accuracy but cost more header size, |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame^] | 521 | and require more memory or more complex decoding operations. |
| 522 | This specification limits maximum code length to 11 bits. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 523 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 524 | |
| 525 | ##### Representation |
| 526 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 527 | All literal values from zero (included) to last present one (excluded) |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 528 | are represented by `weight` values, from 0 to `maxBits`. |
| 529 | Transformation from `weight` to `nbBits` follows this formulae : |
| 530 | `nbBits = weight ? maxBits + 1 - weight : 0;` . |
| 531 | The last symbol's weight is deduced from previously decoded ones, |
| 532 | by completing to the nearest power of 2. |
| 533 | This power of 2 gives `maxBits`, the depth of the current tree. |
| 534 | |
| 535 | __Example__ : |
| 536 | Let's presume the following huffman tree must be described : |
| 537 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 538 | | literal | 0 | 1 | 2 | 3 | 4 | 5 | |
| 539 | | ------- | --- | --- | --- | --- | --- | --- | |
| 540 | | nbBits | 1 | 2 | 3 | 0 | 4 | 4 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 541 | |
| 542 | The tree depth is 4, since its smallest element uses 4 bits. |
| 543 | Value `5` will not be listed, nor will values above `5`. |
| 544 | Values from `0` to `4` will be listed using `weight` instead of `nbBits`. |
| 545 | Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;` |
| 546 | It gives the following serie of weights : |
| 547 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 548 | | weights | 4 | 3 | 2 | 0 | 1 | |
| 549 | | ------- | --- | --- | --- | --- | --- | |
| 550 | | literal | 0 | 1 | 2 | 3 | 4 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 551 | |
| 552 | The decoder will do the inverse operation : |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 553 | having collected weights of literals from `0` to `4`, |
| 554 | it knows the last literal, `5`, is present with a non-zero weight. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 555 | The weight of `5` can be deducted by joining to the nearest power of 2. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 556 | Sum of 2^(weight-1) (excluding 0) is : |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 557 | `8 + 4 + 2 + 0 + 1 = 15` |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 558 | Nearest power of 2 is 16. |
| 559 | Therefore, `maxBits = 4` and `weight[5] = 1`. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 560 | |
| 561 | ##### Huffman Tree header |
| 562 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 563 | This is a single byte value (0-255), |
| 564 | which tells how to decode the list of weights. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 565 | |
| 566 | - if headerByte >= 242 : this is one of 14 pre-defined weight distributions : |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 567 | |
| 568 | | value |242|243|244|245|246|247|248|249|250|251|252|253|254|255| |
| Yann Collet | e0ce5b0 | 2016-07-06 01:50:44 +0200 | [diff] [blame] | 569 | | -------- |---|---|---|---|---|---|---|---|---|---|---|---|---|---| |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 570 | | Nb of 1s | 1 | 2 | 3 | 4 | 7 | 8 | 15| 16| 31| 32| 63| 64|127|128| |
| 571 | |Complement| 1 | 2 | 1 | 4 | 1 | 8 | 1 | 16| 1 | 32| 1 | 64| 1 |128| |
| 572 | |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 573 | _Note_ : complement is found by using "join to nearest power of 2" rule. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 574 | |
| 575 | - if headerByte >= 128 : this is a direct representation, |
| 576 | where each weight is written directly as a 4 bits field (0-15). |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 577 | The full representation occupies `((nbSymbols+1)/2)` bytes, |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 578 | meaning it uses a last full byte even if nbSymbols is odd. |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 579 | `nbSymbols = headerByte - 127;`. |
| 580 | Note that maximum nbSymbols is 241-127 = 114. |
| 581 | A larger serie must necessarily use FSE compression. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 582 | |
| 583 | - if headerByte < 128 : |
| 584 | the serie of weights is compressed by FSE. |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 585 | The length of the FSE-compressed serie is `headerByte` (0-127). |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 586 | |
| 587 | ##### FSE (Finite State Entropy) compression of huffman weights |
| 588 | |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 589 | The serie of weights is compressed using FSE compression. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 590 | It's a single bitstream with 2 interleaved states, |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 591 | sharing a single distribution table. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 592 | |
| 593 | To decode an FSE bitstream, it is necessary to know its compressed size. |
| 594 | Compressed size is provided by `headerByte`. |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 595 | It's also necessary to know its maximum decompressed size, |
| 596 | which is `255`, since literal values span from `0` to `255`, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 597 | and last symbol value is not represented. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 598 | |
| 599 | An FSE bitstream starts by a header, describing probabilities distribution. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 600 | It will create a Decoding Table. |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 601 | Table must be pre-allocated, which requires to support a maximum accuracy. |
| 602 | For a list of huffman weights, recommended maximum is 7 bits. |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 603 | |
| Yann Collet | 26f6814 | 2016-07-08 10:42:59 +0200 | [diff] [blame] | 604 | FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format), |
| 605 | and so is [FSE bitstream](#bitstream). |
| 606 | The main difference is that Huffman header compression uses 2 states, |
| 607 | which share the same FSE distribution table. |
| 608 | Bitstream contains only FSE symbols, there are no interleaved "raw bitfields". |
| 609 | The number of symbols to decode is discovered |
| 610 | by tracking bitStream overflow condition. |
| 611 | When both states have overflowed the bitstream, end is reached. |
| 612 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 613 | |
| 614 | ##### Conversion from weights to huffman prefix codes |
| 615 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 616 | All present symbols shall now have a `weight` value. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 617 | Symbols are sorted by weight. |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 618 | Symbols with a weight of zero are removed. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 619 | Within same weight, symbols keep natural order. |
| 620 | Starting from lowest weight, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 621 | symbols are being allocated to a `range`. |
| 622 | A `weight` directly represents a `range`, |
| 623 | following the formulae : `range = weight ? 1 << (weight-1) : 0 ;` |
| 624 | Similarly, it is possible to transform weights into nbBits : |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 625 | `nbBits = nbBits ? maxBits + 1 - weight : 0;` . |
| 626 | |
| 627 | |
| 628 | __Example__ : |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 629 | Let's presume the following list of weights has been decoded : |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 630 | |
| 631 | | Literal | 0 | 1 | 2 | 3 | 4 | 5 | |
| 632 | | ------- | --- | --- | --- | --- | --- | --- | |
| 633 | | weight | 4 | 3 | 2 | 0 | 1 | 1 | |
| 634 | |
| 635 | Sorted by weight and then natural order, |
| 636 | it gives the following distribution : |
| 637 | |
| 638 | | Literal | 3 | 4 | 5 | 2 | 1 | 0 | |
| 639 | | ------------ | --- | --- | --- | --- | --- | ---- | |
| 640 | | weight | 0 | 1 | 1 | 2 | 3 | 4 | |
| 641 | | range | 0 | 1 | 1 | 2 | 4 | 8 | |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 642 | | table entries| N/A | 0 | 1 | 2-3 | 4-7 | 8-15 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 643 | | nb bits | 0 | 4 | 4 | 3 | 2 | 1 | |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 644 | | prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 645 | |
| 646 | |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 647 | #### Literals bitstreams |
| 648 | |
| 649 | ##### Bitstreams sizes |
| 650 | |
| 651 | As seen in a previous paragraph, |
| 652 | there are 2 flavors of huffman-compressed literals : |
| 653 | single stream, and 4-streams. |
| 654 | |
| 655 | 4-streams is useful for CPU with multiple execution units and OoO operations. |
| 656 | Since each stream can be decoded independently, |
| 657 | it's possible to decode them up to 4x faster than a single stream, |
| 658 | presuming the CPU has enough parallelism available. |
| 659 | |
| 660 | For single stream, header provides both the compressed and regenerated size. |
| 661 | For 4-streams though, |
| 662 | header only provides compressed and regenerated size of all 4 streams combined. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 663 | In order to properly decode the 4 streams, |
| 664 | it's necessary to know the compressed and regenerated size of each stream. |
| 665 | |
| 666 | Regenerated size is easiest : |
| 667 | each stream has a size of `(totalSize+3)/4`, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 668 | except the last one, which is up to 3 bytes smaller, to reach `totalSize`. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 669 | |
| 670 | Compressed size must be provided explicitly : in the 4-streams variant, |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 671 | bitstreams are preceded by 3 unsigned Little Endian 16-bits values. |
| 672 | Each value represents the compressed size of one stream, in order. |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 673 | The last stream size is deducted from total compressed size |
| 674 | and from already known stream sizes : |
| 675 | `stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;` |
| 676 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 677 | ##### Bitstreams read and decode |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 678 | |
| 679 | Each bitstream must be read _backward_, |
| 680 | that is starting from the end down to the beginning. |
| 681 | Therefore it's necessary to know the size of each bitstream. |
| 682 | |
| 683 | It's also necessary to know exactly which _bit_ is the latest. |
| 684 | This is detected by a final bit flag : |
| 685 | the highest bit of latest byte is a final-bit-flag. |
| 686 | Consequently, a last byte of `0` is not possible. |
| 687 | And the final-bit-flag itself is not part of the useful bitstream. |
| 688 | Hence, the last byte contain between 0 and 7 useful bits. |
| 689 | |
| 690 | Starting from the end, |
| 691 | it's possible to read the bitstream in a little-endian fashion, |
| 692 | keeping track of already used bits. |
| 693 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 694 | Reading the last `maxBits` bits, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 695 | it's then possible to compare extracted value to decoding table, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 696 | determining the symbol to decode and number of bits to discard. |
| 697 | |
| 698 | The process continues up to reading the required number of symbols per stream. |
| 699 | If a bitstream is not entirely and exactly consumed, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 700 | hence reaching exactly its beginning position with _all_ bits consumed, |
| Yann Collet | d916c90 | 2016-07-04 00:42:58 +0200 | [diff] [blame] | 701 | the decoding process is considered faulty. |
| 702 | |
| Yann Collet | 698cb63 | 2016-07-03 18:49:35 +0200 | [diff] [blame] | 703 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 704 | ### Sequences section |
| 705 | |
| 706 | A compressed block is a succession of _sequences_ . |
| 707 | A sequence is a literal copy command, followed by a match copy command. |
| 708 | A literal copy command specifies a length. |
| 709 | It is the number of bytes to be copied (or extracted) from the literal section. |
| 710 | A match copy command specifies an offset and a length. |
| 711 | The offset gives the position to copy from, |
| Yann Collet | b21e9cb | 2016-07-15 17:31:13 +0200 | [diff] [blame] | 712 | which can be within a previous block. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 713 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 714 | There are 3 symbol types, `literalLength`, `matchLength` and `offset`, |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 715 | which are encoded together, interleaved in a single _bitstream_. |
| 716 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 717 | Each symbol is a _code_ in its own context, |
| 718 | which specifies a baseline and a number of bits to add. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 719 | _Codes_ are FSE compressed, |
| 720 | and interleaved with raw additional bits in the same bitstream. |
| 721 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 722 | The Sequences section starts by a header, |
| 723 | followed by optional Probability tables for each symbol type, |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 724 | followed by the bitstream. |
| 725 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 726 | | Header | (LitLengthTable) | (OffsetTable) | (MatchLengthTable) | bitStream | |
| 727 | | ------ | ---------------- | ------------- | ------------------ | --------- | |
| 728 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 729 | To decode the Sequence section, it's required to know its size. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 730 | This size is deducted from `blockSize - literalSectionSize`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 731 | |
| 732 | |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 733 | #### Sequences section header |
| 734 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 735 | Consists in 2 items : |
| 736 | - Nb of Sequences |
| 737 | - Flags providing Symbol compression types |
| 738 | |
| 739 | __Nb of Sequences__ |
| 740 | |
| 741 | This is a variable size field, `nbSeqs`, using between 1 and 3 bytes. |
| 742 | Let's call its first byte `byte0`. |
| 743 | - `if (byte0 == 0)` : there are no sequences. |
| 744 | The sequence section stops there. |
| 745 | Regenerated content is defined entirely by literals section. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 746 | - `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte. |
| 747 | - `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes. |
| 748 | - `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 749 | |
| 750 | __Symbol compression modes__ |
| 751 | |
| 752 | This is a single byte, defining the compression mode of each symbol type. |
| 753 | |
| 754 | | BitNb | 7-6 | 5-4 | 3-2 | 1-0 | |
| 755 | | ------- | ------ | ------ | ------ | -------- | |
| 756 | |FieldName| LLtype | OFType | MLType | Reserved | |
| 757 | |
| 758 | The last field, `Reserved`, must be all-zeroes. |
| 759 | |
| 760 | `LLtype`, `OFType` and `MLType` define the compression mode of |
| 761 | Literal Lengths, Offsets and Match Lengths respectively. |
| 762 | |
| 763 | They follow the same enumeration : |
| 764 | |
| 765 | | Value | 0 | 1 | 2 | 3 | |
| 766 | | ---------------- | ------ | --- | ------ | --- | |
| 767 | | Compression Mode | predef | RLE | Repeat | FSE | |
| 768 | |
| 769 | - "predef" : uses a pre-defined distribution table. |
| 770 | - "RLE" : it's a single code, repeated `nbSeqs` times. |
| 771 | - "Repeat" : re-use distribution table from previous compressed block. |
| 772 | - "FSE" : standard FSE compression. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 773 | A distribution table will be present. |
| 774 | It will be described in [next part](#distribution-tables). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 775 | |
| 776 | #### Symbols decoding |
| 777 | |
| 778 | ##### Literal Lengths codes |
| 779 | |
| 780 | Literal lengths codes are values ranging from `0` to `35` included. |
| 781 | They define lengths from 0 to 131071 bytes. |
| 782 | |
| 783 | | Code | 0-15 | |
| 784 | | ------ | ---- | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 785 | | length | Code | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 786 | | nbBits | 0 | |
| 787 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 788 | |
| 789 | | Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
| 790 | | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 791 | | Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | |
| 792 | | nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
| 793 | |
| 794 | | Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
| 795 | | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 796 | | Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | |
| 797 | | nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
| 798 | |
| 799 | | Code | 32 | 33 | 34 | 35 | |
| 800 | | -------- | ---- | ---- | ---- | ---- | |
| 801 | | Baseline | 8192 |16384 |32768 |65536 | |
| 802 | | nb Bits | 13 | 14 | 15 | 16 | |
| 803 | |
| 804 | __Default distribution__ |
| 805 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 806 | When "compression mode" is "predef"", |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 807 | a pre-defined distribution is used for FSE compression. |
| 808 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 809 | Below is its definition. It uses an accuracy of 6 bits (64 states). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 810 | ``` |
| 811 | short literalLengths_defaultDistribution[36] = |
| 812 | { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, |
| 813 | 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, |
| 814 | -1,-1,-1,-1 }; |
| 815 | ``` |
| 816 | |
| 817 | ##### Match Lengths codes |
| 818 | |
| 819 | Match lengths codes are values ranging from `0` to `52` included. |
| 820 | They define lengths from 3 to 131074 bytes. |
| 821 | |
| 822 | | Code | 0-31 | |
| 823 | | ------ | -------- | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 824 | | value | Code + 3 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 825 | | nbBits | 0 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 826 | |
| 827 | | Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
| 828 | | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 829 | | Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | |
| 830 | | nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
| 831 | |
| 832 | | Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
| 833 | | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
| 834 | | Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 | |
| 835 | | nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | |
| 836 | |
| 837 | | Code | 48 | 49 | 50 | 51 | 52 | |
| 838 | | -------- | ---- | ---- | ---- | ---- | ---- | |
| 839 | | Baseline | 4098 | 8194 |16486 |32770 |65538 | |
| 840 | | nb Bits | 12 | 13 | 14 | 15 | 16 | |
| 841 | |
| 842 | __Default distribution__ |
| 843 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 844 | When "compression mode" is defined as "predef", |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 845 | a pre-defined distribution is used for FSE compression. |
| 846 | |
| 847 | Here is its definition. It uses an accuracy of 6 bits (64 states). |
| 848 | ``` |
| 849 | short matchLengths_defaultDistribution[53] = |
| 850 | { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
| 851 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 852 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, |
| 853 | -1,-1,-1,-1,-1 }; |
| 854 | ``` |
| 855 | |
| 856 | ##### Offset codes |
| 857 | |
| 858 | Offset codes are values ranging from `0` to `N`, |
| 859 | with `N` being limited by maximum backreference distance. |
| 860 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 861 | A decoder is free to limit its maximum `N` supported. |
| 862 | Recommendation is to support at least up to `22`. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 863 | For information, at the time of this writing. |
| 864 | the reference decoder supports a maximum `N` value of `28` in 64-bits mode. |
| 865 | |
| 866 | An offset code is also the nb of additional bits to read, |
| 867 | and can be translated into an `OFValue` using the following formulae : |
| 868 | |
| 869 | ``` |
| 870 | OFValue = (1 << offsetCode) + readNBits(offsetCode); |
| 871 | if (OFValue > 3) offset = OFValue - 3; |
| 872 | ``` |
| 873 | |
| 874 | OFValue from 1 to 3 are special : they define "repeat codes", |
| 875 | which means one of the previous offsets will be repeated. |
| 876 | They are sorted in recency order, with 1 meaning the most recent one. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 877 | See [Repeat offsets](#repeat-offsets) paragraph. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 878 | |
| 879 | __Default distribution__ |
| 880 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 881 | When "compression mode" is defined as "predef", |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 882 | a pre-defined distribution is used for FSE compression. |
| 883 | |
| 884 | Here is its definition. It uses an accuracy of 5 bits (32 states), |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 885 | and supports a maximum `N` of 28, allowing offset values up to 536,870,908 . |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 886 | |
| 887 | If any sequence in the compressed block requires an offset larger than this, |
| 888 | it's not possible to use the default distribution to represent it. |
| 889 | |
| 890 | ``` |
| 891 | short offsetCodes_defaultDistribution[53] = |
| 892 | { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
| 893 | 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 }; |
| 894 | ``` |
| 895 | |
| 896 | #### Distribution tables |
| 897 | |
| 898 | Following the header, up to 3 distribution tables can be described. |
| 899 | They are, in order : |
| 900 | - Literal lengthes |
| 901 | - Offsets |
| 902 | - Match Lengthes |
| 903 | |
| 904 | The content to decode depends on their respective compression mode : |
| 905 | - Repeat mode : no content. Re-use distribution from previous compressed block. |
| 906 | - Predef : no content. Use pre-defined distribution table. |
| 907 | - RLE : 1 byte. This is the only code to use across the whole compressed block. |
| 908 | - FSE : A distribution table is present. |
| 909 | |
| 910 | ##### FSE distribution table : condensed format |
| 911 | |
| 912 | An FSE distribution table describes the probabilities of all symbols |
| 913 | from `0` to the last present one (included) |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 914 | on a normalized scale of `1 << AccuracyLog` . |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 915 | |
| 916 | It's a bitstream which is read forward, in little-endian fashion. |
| 917 | It's not necessary to know its exact size, |
| 918 | since it will be discovered and reported by the decoding process. |
| 919 | |
| 920 | The bitstream starts by reporting on which scale it operates. |
| 921 | `AccuracyLog = low4bits + 5;` |
| 922 | In theory, it can define a scale from 5 to 20. |
| 923 | In practice, decoders are allowed to limit the maximum supported `AccuracyLog`. |
| 924 | Recommended maximum are `9` for literal and match lengthes, and `8` for offsets. |
| 925 | The reference decoder uses these limits. |
| 926 | |
| 927 | Then follow each symbol value, from `0` to last present one. |
| 928 | The nb of bits used by each field is variable. |
| 929 | It depends on : |
| 930 | |
| 931 | - Remaining probabilities + 1 : |
| 932 | __example__ : |
| 933 | Presuming an AccuracyLog of 8, |
| 934 | and presuming 100 probabilities points have already been distributed, |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 935 | the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included). |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 936 | Therefore, it must read `log2sup(156) == 8` bits. |
| 937 | |
| 938 | - Value decoded : small values use 1 less bit : |
| 939 | __example__ : |
| 940 | Presuming values from 0 to 156 (included) are possible, |
| 941 | 255-156 = 99 values are remaining in an 8-bits field. |
| 942 | They are used this way : |
| 943 | first 99 values (hence from 0 to 98) use only 7 bits, |
| 944 | values from 99 to 156 use 8 bits. |
| 945 | This is achieved through this scheme : |
| 946 | |
| 947 | | Value read | Value decoded | nb Bits used | |
| 948 | | ---------- | ------------- | ------------ | |
| 949 | | 0 - 98 | 0 - 98 | 7 | |
| 950 | | 99 - 127 | 99 - 127 | 8 | |
| 951 | | 128 - 226 | 0 - 98 | 7 | |
| 952 | | 227 - 255 | 128 - 156 | 8 | |
| 953 | |
| 954 | Symbols probabilities are read one by one, in order. |
| 955 | |
| 956 | Probability is obtained from Value decoded by following formulae : |
| 957 | `Proba = value - 1;` |
| 958 | |
| 959 | It means value `0` becomes negative probability `-1`. |
| 960 | `-1` is a special probability, which means `less than 1`. |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 961 | Its effect on distribution table is described in [next paragraph]. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 962 | For the purpose of calculating cumulated distribution, it counts as one. |
| 963 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 964 | [next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables |
| 965 | |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 966 | When a symbol has a probability of `zero`, |
| 967 | it is followed by a 2-bits repeat flag. |
| 968 | This repeat flag tells how many probabilities of zeroes follow the current one. |
| 969 | It provides a number ranging from 0 to 3. |
| 970 | If it is a 3, another 2-bits repeat flag follows, and so on. |
| 971 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 972 | When last symbol reaches cumulated total of `1 << AccuracyLog`, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 973 | decoding is complete. |
| 974 | Then the decoder can tell how many bytes were used in this process, |
| 975 | and how many symbols are present. |
| 976 | |
| 977 | The bitstream consumes a round number of bytes. |
| 978 | Any remaining bit within the last byte is just unused. |
| 979 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 980 | If the last symbol makes cumulated total go above `1 << AccuracyLog`, |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 981 | distribution is considered corrupted. |
| 982 | |
| 983 | ##### FSE decoding : from normalized distribution to decoding tables |
| 984 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 985 | The distribution of normalized probabilities is enough |
| 986 | to create a unique decoding table. |
| 987 | |
| 988 | It follows the following build rule : |
| 989 | |
| 990 | The table has a size of `tableSize = 1 << AccuracyLog;`. |
| 991 | Each cell describes the symbol decoded, |
| 992 | and instructions to get the next state. |
| 993 | |
| 994 | Symbols are scanned in their natural order for `less than 1` probabilities. |
| 995 | Symbols with this probability are being attributed a single cell, |
| 996 | starting from the end of the table. |
| 997 | These symbols define a full state reset, reading `AccuracyLog` bits. |
| 998 | |
| 999 | All remaining symbols are sorted in their natural order. |
| 1000 | Starting from symbol `0` and table position `0`, |
| 1001 | each symbol gets attributed as many cells as its probability. |
| 1002 | Cell allocation is spreaded, not linear : |
| 1003 | each successor position follow this rule : |
| 1004 | |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1005 | ``` |
| 1006 | position += (tableSize>>1) + (tableSize>>3) + 3; |
| 1007 | position &= tableSize-1; |
| 1008 | ``` |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1009 | |
| 1010 | A position is skipped if already occupied, |
| 1011 | typically by a "less than 1" probability symbol. |
| 1012 | |
| 1013 | The result is a list of state values. |
| 1014 | Each state will decode the current symbol. |
| 1015 | |
| 1016 | To get the Number of bits and baseline required for next state, |
| 1017 | it's first necessary to sort all states in their natural order. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1018 | The lower states will need 1 more bit than higher ones. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1019 | |
| 1020 | __Example__ : |
| 1021 | Presuming a symbol has a probability of 5. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1022 | It receives 5 state values. States are sorted in natural order. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1023 | |
| 1024 | Next power of 2 is 8. |
| 1025 | Space of probabilities is divided into 8 equal parts. |
| 1026 | Presuming the AccuracyLog is 7, it defines 128 states. |
| 1027 | Divided by 8, each share is 16 large. |
| 1028 | |
| 1029 | In order to reach 8, 8-5=3 lowest states will count "double", |
| 1030 | taking shares twice larger, |
| 1031 | requiring one more bit in the process. |
| 1032 | |
| 1033 | Numbering starts from higher states using less bits. |
| 1034 | |
| 1035 | | state order | 0 | 1 | 2 | 3 | 4 | |
| 1036 | | ----------- | ----- | ----- | ------ | ---- | ----- | |
| 1037 | | width | 32 | 32 | 32 | 16 | 16 | |
| 1038 | | nb Bits | 5 | 5 | 5 | 4 | 4 | |
| 1039 | | range nb | 2 | 4 | 6 | 0 | 1 | |
| 1040 | | baseline | 32 | 64 | 96 | 0 | 16 | |
| 1041 | | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | |
| 1042 | |
| 1043 | Next state is determined from current state |
| 1044 | by reading the required number of bits, and adding the specified baseline. |
| Yann Collet | 23f05cc | 2016-07-04 16:13:11 +0200 | [diff] [blame] | 1045 | |
| 1046 | |
| 1047 | #### Bitstream |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1048 | |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1049 | All sequences are stored in a single bitstream, read _backward_. |
| 1050 | It is therefore necessary to know the bitstream size, |
| 1051 | which is deducted from compressed block size. |
| 1052 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 1053 | The last useful bit of the stream is followed by an end-bit-flag. |
| Yann Collet | cd25a91 | 2016-07-05 11:50:37 +0200 | [diff] [blame] | 1054 | Highest bit of last byte is this flag. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1055 | It does not belong to the useful part of the bitstream. |
| 1056 | Therefore, last byte has 0-7 useful bits. |
| 1057 | Note that it also means that last byte cannot be `0`. |
| 1058 | |
| 1059 | ##### Starting states |
| 1060 | |
| 1061 | The bitstream starts with initial state values, |
| 1062 | each using the required number of bits in their respective _accuracy_, |
| 1063 | decoded previously from their normalized distribution. |
| 1064 | |
| 1065 | It starts by `Literal Length State`, |
| 1066 | followed by `Offset State`, |
| 1067 | and finally `Match Length State`. |
| 1068 | |
| 1069 | Reminder : always keep in mind that all values are read _backward_. |
| 1070 | |
| 1071 | ##### Decoding a sequence |
| 1072 | |
| 1073 | A state gives a code. |
| 1074 | A code provides a baseline and number of bits to add. |
| 1075 | See [Symbol Decoding] section for details on each symbol. |
| 1076 | |
| 1077 | Decoding starts by reading the nb of bits required to decode offset. |
| 1078 | It then does the same for match length, |
| 1079 | and then for literal length. |
| 1080 | |
| Yann Collet | c40ba71 | 2016-07-08 15:39:02 +0200 | [diff] [blame] | 1081 | Offset / matchLength / litLength define a sequence. |
| 1082 | It starts by inserting the number of literals defined by `litLength`, |
| 1083 | then continue by copying `matchLength` bytes from `currentPos - offset`. |
| Yann Collet | 9ca7336 | 2016-07-05 10:53:38 +0200 | [diff] [blame] | 1084 | |
| 1085 | The next operation is to update states. |
| 1086 | Using rules pre-calculated in the decoding tables, |
| 1087 | `Literal Length State` is updated, |
| 1088 | followed by `Match Length State`, |
| 1089 | and then `Offset State`. |
| 1090 | |
| 1091 | This operation will be repeated `NbSeqs` times. |
| 1092 | At the end, the bitstream shall be entirely consumed, |
| 1093 | otherwise bitstream is considered corrupted. |
| 1094 | |
| 1095 | [Symbol Decoding]:#symbols-decoding |
| 1096 | |
| 1097 | ##### Repeat offsets |
| 1098 | |
| 1099 | As seen in [Offset Codes], the first 3 values define a repeated offset. |
| 1100 | They are sorted in recency order, with 1 meaning "most recent one". |
| 1101 | |
| 1102 | There is an exception though, when current sequence's literal length is `0`. |
| 1103 | In which case, 1 would just make previous match longer. |
| 1104 | Therefore, in such case, 1 means in fact 2, and 2 is impossible. |
| 1105 | Meaning of 3 is unmodified. |
| 1106 | |
| 1107 | Repeat offsets start with the following values : 1, 4 and 8 (in order). |
| 1108 | |
| 1109 | Then each block receives its start value from previous compressed block. |
| 1110 | Note that non-compressed blocks are skipped, |
| 1111 | they do not contribute to offset history. |
| 1112 | |
| 1113 | [Offset Codes]: #offset-codes |
| 1114 | |
| 1115 | ###### Offset updates rules |
| 1116 | |
| 1117 | When the new offset is a normal one, |
| 1118 | offset history is simply translated by one position, |
| 1119 | with the new offset taking first spot. |
| 1120 | |
| 1121 | - When repeat offset 1 (most recent) is used, history is unmodified. |
| 1122 | - When repeat offset 2 is used, it's swapped with offset 1. |
| 1123 | - When repeat offset 3 is used, it takes first spot, |
| 1124 | pushing the other ones by one position. |
| Yann Collet | 00d44ab | 2016-07-04 01:29:47 +0200 | [diff] [blame] | 1125 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1126 | |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1127 | Dictionary format |
| 1128 | ----------------- |
| 1129 | |
| 1130 | `zstd` is compatible with "pure content" dictionaries, free of any format restriction. |
| 1131 | But dictionaries created by `zstd --train` follow a format, described here. |
| 1132 | |
| 1133 | __Pre-requisites__ : a dictionary has a known length, |
| 1134 | defined either by a buffer limit, or a file size. |
| 1135 | |
| 1136 | | Header | DictID | Stats | Content | |
| 1137 | | ------ | ------ | ----- | ------- | |
| 1138 | |
| 1139 | __Header__ : 4 bytes ID, value 0xEC30A437, Little Endian format |
| 1140 | |
| 1141 | __Dict_ID__ : 4 bytes, stored in Little Endian format. |
| 1142 | DictID can be any value, except 0 (which means no DictID). |
| Yann Collet | 722e14b | 2016-07-08 19:22:16 +0200 | [diff] [blame] | 1143 | It's used by decoders to check if they use the correct dictionary. |
| Yann Collet | f6ff53c | 2016-07-15 17:03:38 +0200 | [diff] [blame] | 1144 | _Reserved ranges :_ |
| 1145 | If the frame is going to be distributed in a private environment, |
| 1146 | any dictionary ID can be used. |
| 1147 | However, for public distribution of compressed frames, |
| 1148 | some ranges are reserved for future use : |
| Yann Collet | 6cacd34 | 2016-07-15 17:58:13 +0200 | [diff] [blame] | 1149 | |
| 1150 | - low range : 1 - 32767 : reserved |
| 1151 | - high range : >= (2^31) : reserved |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1152 | |
| 1153 | __Stats__ : Entropy tables, following the same format as a [compressed blocks]. |
| 1154 | They are stored in following order : |
| 1155 | Huffman tables for literals, FSE table for offset, |
| Yann Collet | 722e14b | 2016-07-08 19:22:16 +0200 | [diff] [blame] | 1156 | FSE table for matchLenth, and FSE table for litLength. |
| 1157 | It's finally followed by 3 offset values, populating recent offsets, |
| 1158 | stored in order, 4-bytes little endian each, for a total of 12 bytes. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1159 | |
| 1160 | __Content__ : Where the actual dictionary content is. |
| Yann Collet | 722e14b | 2016-07-08 19:22:16 +0200 | [diff] [blame] | 1161 | Content size depends on Dictionary size. |
| Yann Collet | bd10607 | 2016-07-08 19:16:57 +0200 | [diff] [blame] | 1162 | |
| 1163 | [compressed blocks]: #compressed-block-format |
| 1164 | |
| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1165 | |
| 1166 | Version changes |
| 1167 | --------------- |
| Yann Collet | e557fd5 | 2016-07-17 16:21:37 +0200 | [diff] [blame^] | 1168 | - 0.1.2 : limit huffman tree depth to 11 bits |
| 1169 | - 0.1.1 : reserved dictID ranges |
| 1170 | - 0.1.0 : initial release |